Friday, November 8, 2024

CrowdStrike Says Buggy Validator Was Behind Massive Outage

A major disruption to Windows PCs in the U.S., U.K., Australia, South Africa and other countries was caused by an error in a CrowdStrike Falcon Sensor update, the cloud security company announced on July 19. Emergency services, airports and law enforcement reported downtime. About 8.5 million Windows devices were affected.

The problem stemmed from a Rapid Response Content update in the Falcon Sensor, CrowdStrike said on July 24. This type of update is intended to respond to fast-moving threats, and uses a Template Instance to define specific behaviors. “Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data” on July 19, CrowdStrike wrote in a Preliminary Post-Incident Review. The Content Validator is a procedure to “perform validation checks on the content before it is published,” CrowdStrike wrote. The Template Instance passed other quality checks, but, due to the bug, an error was allowed to pass through to deployment.

“When received by the sensor and loaded into the Content Interpreter, problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception,” CrowdStrike wrote. “This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD).”

The problem did not stem from a kernel driver, as had been previously reported.

Blue Screen of Death widespread due to CrowdStrike outage

Affected organizations saw the infamous Blue Screen of Death, the Windows system crash alert. American Airlines, United and Delta flights were delayed on the morning of July 19 due to the issue impacting the airlines’ IT systems. U.K. media outlet Sky News reported on its own television outage early Friday morning. The New Hampshire emergency services department reported it is back online after disruption to 911 services early Friday.

“The issue has been identified, isolated and a fix has been deployed,” CrowdStrike said on Friday. However, outages on some machines that were initially affected are still being reported.

Microsoft 365 reported a service degradation warning on Friday morning, but this appears to be a separate incident.

CrowdStrike made 14.74% of the total software revenue for security software segments and regions in 2023, according to data Gartner sent to TechRepublic by email. Microsoft made 40.16%.

SEE: Downtime costs the world’s largest companies $400 billion a year, according to Splunk.

What steps can businesses take if they are affected by the CrowdStrike outage?

The first step is to identify which hosts are impacted. From there, follow CloudStrike’s instructions for repairing or recovering Windows.

On Saturday, Microsoft released a Recovery Tool using a USB or Preboot Execution Environment.

On Friday, Microsoft recommended restarting Azure Virtual Machines running the CrowdStrike Falcon agent. This may require a lot of reboots, with some users reporting success after as many as 15. Other options are to restore from a backup earlier than July 18 at 04:09 UTC, or to try to repair the OS disk by using a repair VM. 

“Because of the way in which the update has been deployed, recovery options for affected machines are manual and thus limited,” said Forrester VP and Principal Analyst Andras Cser in a prepared statement emailed to TechRepublic. “Administrators must attach a physical keyboard to each affected system, boot into Safe Mode, remove the compromised CrowdStrike update, and then reboot. Some administrators have also stated they have been unable to gain access to BitLocker hard drive encryption keys to perform remediation steps.”

CrowdStrike recommends that its customers keep in touch with CrowdStrike representatives. Organizations, even those not directly affected, should check in with their SaaS partners to see whether they might be experiencing issues.

Beware of misinformation

Because this incident affects such a wide range of major organizations, the possibility for misinformation is high.

“There will be a lot of misinformation about how to reconfigure your computers or which critical system files to delete,” said former NSA cybersecurity expert Evan Dornbush in an email to TechRepublic. “Don’t fall victim to downloading phony solutions.”

On Saturday, CrowdStrike highlighted a malware campaign targeting Spanish-speaking CrowdStrike customers which disguised itself as a fix for the outage. The malware is a ZIP file attached to a bogus “utility for automating recovery,” according to CrowdStrike’s blog post.

“This is a great time to reflect on password management, since the fix may eventually require administrative access to systems that have not rebooted in quite some time,” Dornbush said.

Assess your recovery plan and support your team

Assess your organization’s reliance on one provider or service, and be sure your organization has a strong recovery process in place.

It’s also a good time for IT team leaders to make sure their personnel have the support they need.

“This disruption hit on Friday evening in some geographies, right as people were headed home for their weekend,” noted Forrester Principal Analyst Allie Mellen in a prepared statement emailed to TechRepublic. “Tech incidents like this require an all-hands-on-deck approach, and your teams will be working 24/7 over the weekend to recover. Support your teams by ensuring they have adequate support and rest breaks to avoid burnout and mistakes. Clearly communicate roles, responsibilities, and expectations.”

When reached for comment, CrowdStrike directed TechRepublic to the official statement.

What is CrowdStrike doing in response?

In the July 24 Preliminary Post Incident Review, CrowdStrike said it is taking the following steps to improve its deployment process:

Software Resiliency and Testing

  • Improve Rapid Response Content testing by using testing types such as:
    • Local developer testing
    • Content update and rollback testing
    • Stress testing, fuzzing and fault injection
    • Stability testing
    • Content interface testing
  • Add additional validation checks to the Content Validator for Rapid Response Content. A new check is in process to guard against this type of problematic content from being deployed in the future.
  • Enhance existing error handling in the Content Interpreter.

Rapid Response Content Deployment

  • Implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.
  • Improve monitoring for both sensor and system performance, collecting feedback during Rapid Response Content deployment to guide a phased rollout.
  • Provide customers with greater control over the delivery of Rapid Response Content updates by allowing granular selection of when and where these updates are deployed.
  • Provide content update details via release notes, which customers can subscribe to.”

This article has been updated as more information became available. TechRepublic has reached out to Microsoft for comment. 

Related Articles

Latest Articles