Friday, November 22, 2024

CrowdStrike Offers Mea Culpa to House Committee

A contrite CrowdStrike executive this week described the company’s faulty July 19 content configuration update that crashed 8.5 million Windows systems worldwide as resulting from a “perfect storm” of issues that have since been addressed.

Testifying before members of the House Committee on Homeland Security on Sept. 24, CrowdStrike’s senior vice president, Adam Meyers, apologized for the incident and reassured the committee of steps the company has implemented since then to prevent a similar failure.

The House Committee called for the hearing in July after a CrowdStrike content configuration update for the company’s Falcon Sensor caused millions of Windows systems to crash, triggering widespread and lengthy service disruptions for businesses, government agencies, and critical infrastructure organizations worldwide. Some have pegged losses to affected organizations from the incident to be in the billions of dollars.

Chess Game Gone Awry

When asked to explain the root cause for the incident, Meyers told the House Committee that the problem stemmed from a mismatch between what the Falcon sensor expected and what the content configuration update actually contained.

Essentially, the update caused Falcon Sensor to try and follow a threat detection configuration for which there were no corresponding rules on what to do. “If you think about a chessboard [and] trying to move a chess piece to someplace where’s there’s no square,” Meyers said. “That’s effectively what happened inside the sensor. This was kind of a perfect storm of issues.”

CrowdStrike’s validation and testing processes for content configuration updates did not catch the issue because this specific scenario had not occurred before, Meyers explained.

Rep. Morgan Luttrell of Texas characterized CrowdStrike’s failure to spot the buggy update as a “very large miss,” especially for a company with a large presence in government and critical infrastructure sectors. “You mentioned North Korea, China, and Iran [and other] outside actors are trying to get us every day,” Luttrell said during the hearing. “We shot ourselves in the foot inside of the house,” with the faulty update. Luttrell demanded to know what preventive measures CrowdStrike has implemented since July.

In his written testimony and responses to questions from committee members, Meyers listed several changes that CrowdStrike has implemented to prevent against a similar lapse. The measures include new validation and testing processes, more control for customers over how and when they receive updates, and a phased rollout process that enables CrowdStrike to quickly reverse an update if problems surface. Following the incident, CrowdStrike has also begun treating all content updates as code, meaning they receive the same level of scrutiny and testing as code updates.

Multiple Changes

“Since July 19, 2024, we have implemented multiple enhancements to our deployment processes to make them more robust and help prevent recurrence of such an incident — without compromising our ability to protect customers against rapidly-evolving cyber threats,” Meyers said in written testimony.

Meyers defended the need for companies like CrowdStrike to be able to continue making updates at the kernel level of the operating system when committee members probed him about the potential risks associated with the practice. “I would suggest that while things can be conducted in user mode, from a security perspective, kernel visibility is certainly critical,” he stated. In its root cause analysis of the incident, CrowdStrike noted that considerable work still needs to happen within the Windows ecosystem for security vendors to be able to issue updates directly to user space instead of the Windows kernel.

Missing the Bigger Picture?

But some viewed the hearing as not going far enough to identify and focus on some of the more significant takeaways from the incident. “To think of the July 19 outage as a CrowdStrike failure is simply wrong,” says Jim Taylor, chief product and technology officer at RSA. “More than 8 million devices failed, and it’s not CrowdStrike’s fault that those didn’t have backups built to withstand an outage, or that the Microsoft systems they were running couldn’t default to on-premises backups,” he notes.

The global outage was the result of organizations for years abdicating responsibility for building resilient systems and instead relying on a limited number of cloud vendors to carry out critical business functions. “Focusing on one company misses the forest for the trees,” Meyers says. “I wish the hearing had done more to ask what organizations are doing to build resilient systems capable of withstanding an outage.”

Grant Leonard, chief information security officer (CISO) of Lumifi, says one shortcoming of the hearing was overemphasis on the root cause of the outage and relatively less focus on lessons learned. “Questions about CrowdStrike’s decision-making process during the crisis, their communication strategies with affected clients, and their plans for preventing similar incidents in the future would have provided more actionable insights for the industry,” Leonard says. “Exploring these areas could help other companies improve their incident response protocols and quality assurance processes.”

Leonard expects the hearing will result in a renewed emphasis on quality assurance processes across the cybersecurity industry. “We will likely see an uptick in solid reviews and trial runs of business continuity and disaster recovery plans,” he says. The incident could also lead to a more cautious approach to auto-updates and patching across the industry, with companies implementing more rigorous testing protocols. “Additionally, it could prompt a reevaluation of liability and indemnity clauses in cybersecurity service contracts, potentially shifting the balance of responsibility between vendors and clients.”


Related Articles

Latest Articles