The Microsoft CrowdStrike Incident: A Case for a Shared Validation & Verification Process

This is some text inside of a div block.

Executive Summary

‍The July 19th Microsoft CrowdStrike incident starkly underscored the critical importance of an effective Validation & Verification (V&V) program in protecting against severe information system outages that can significantly impact businesses and customers across various industries, including healthcare.

Organizations that experience V&V process failures promote significantly defective platform-level software to production systems, leading to catastrophic disruptions for platform and application stakeholders, customers and third parties, who are indirectly impacted by service disruptions.

The CrowdStrike incident affected mission-critical systems and software applications on a global scale, with widespread short and long-term consequences.

The Incident

On July 19th, 2024, a global information technology service disruption occurred due to a defective software update to CrowdStrike’s Falcon Sensor application, resulting in a 'blue screen of death' (BSOD) on approximately 8.5 million Microsoft platform customers worldwide.

This incident grounded airline flights, disrupted healthcare and hospital patient care services, and impacted various sectors including investment and retail markets, financial institutions, news broadcasters, and other organizations providing mission-critical services.

Root Cause Analysis Perspectives

Failure to Observe Change, Risk and Security Management Best Practices

Organizations that consistently employ change management, risk assessment and security best practices detect and isolate software defects during the V&V process that is engaged on pre-production test systems.

Risk assessment and security experts provide business IT managers with guidance so they can make a determination as to when to schedule a software update to mission-critical systems, once the V&V process is complete.

Put another way, not every security threat is critical and urgent, requiring immediate actions, including circumventing business processes designed to protect business and customer interests.

Prioritizing Short-Term Benefit Over Future Risks and Consequences

In the race to bring new software products, capabilities, and features to market as quickly as possible, there's been a growing trend of prioritizing speed and schedule over ensuring a product meets stakeholder and customer critical needs and expectations.

Brian Krebs, author of the Krebs on Security, column and associated comments on the Microsoft CrowdStrike incident, is testament to the growing trend to push code into production, without regard to whether it's been functionally validated by functional and security validation teams.

This growing practice has led to the deployment of new software products and updates without thorough risk assessment and validation, resulting in potential risk of long-term negative consequences.

Software Validation: A Shared Responsibility

The elephant in the room that was evident as the Microsoft CrowdStrike event unfolded was the lack of shared stakeholder responsibility for software validation prior to production release to mission-critical systems.

Microsoft and their enterprise stakeholders must take responsibility for validating any changes that could disrupt their business operations.

History continues to demonstrate that relying solely on third-party validation is not a viable strategy.

Litigation& Liability: A Serious Consequence of Ineffective V&V Programs

The potential of legal challenges and lawsuits arising from the Microsoft CrowdStrike incident due to negligence could be a powerful motivator for organizations to re-evaluate their processes and make the necessary investments in validation, verification, risk assessment, and sustainability programs.

While the Microsoft CrowdStrike incident was not a cyberattack, affected healthcare service organizations suffered similar affects associated with a ransonware attack.

When an EHR and other patient care software applications are rendered unavailable for use, the result can lead to the potential for patient harm, resulting in lawsuits.

Potential for Federal Class-Action Lawsuits

Beck Andrew Salgado, reporting for the Austin American-Statesman, wrote about a proposed class action lawsuit filed in Austin, Texas federal court by three Delta airlines passengers, alleging negligence in testing and deploying its software. This lawsuit is in addition to the lawsuit filed for damages by Delta airlines.

V&V Investment Payoff: Averting a Business Disaster

It's a safe bet that there were Microsoft CrowdStrike business customers that caught the defective update in their pre-production validation platforms, averting the impact of this incident on their business and customers.

It's also likely that both Microsoft and CrowdStrike were alerted by some of their customers of serious validation findings based on pre-production validation results.

As investigations of this incident progress, reports may come to light of organizations that detected the CrowdStrike software defect during pre-production validation, and alerted Microsoft and CrowdStrike of their findings.

Impact on the Healthcare Industry

The Microsoft CrowdStrike incident has profound implications for the healthcare industry. As seen during the outage, critical healthcare services were disrupted, highlighting the sector's dependency on reliable IT systems.

Hospitals and healthcare providers that have not implemented a rigorous Validation & Verification (V&V) program to ensure the stability and security of their systems need to consider the significant risks and costs associated with taking no action.

Key Considerations

Patient Safety. Healthcare IT disruptions directly affect patient care and safety. Ensuring robust V&V processes helps ensure patient safety and maintain seamless operations.

Regulatory Compliance. Healthcare providers must comply with stringent regulations, like HiPAA. A strong V&V framework supports adherence to HIPAA and other mandatory regulatory standards and laws.

Financial Stability. Preventing IT outages safeguards revenue cycles and minimizes financial losses due to downtime.

Mandatory Review of Current HIT Platforms and Integrated Applications

Healthcare organizations that were impacted by a production push of the Microsoft CrowdStrike update need to engage a serious review of their healthcare EHR, revenue cycle, patient scheduling and other mission-critical platforms that were knocked offline causing massive patient care, provider and allied healthcare service disruptions.

As reported by Becker's Hospital Review, one estimate of losses for the healthcare industry was estimated at $1.9 billion USD.

Affected hospital and healthcare organizational leaders that balked at the investment cost of engaging effective V&V programs, now have to invest many times over the initial investment cost to work toward preventing and mitigating the effects of future incidents.

Additionally, affected organizations will continue to be vulnerable to new defects, as they update programs and implement new programs to manage future incidents.

Marianne Kolbasuk McGee. HealthInfoSec, hospitals using the Epic EHR platform were directly and indirectly impacted, resulting in a loss of Epic telehealth service or an inability to access the Epic EHR platform, due to Microsoft Windows workstations failing.

Mike Toole and Kristina Rex, WBZ CBS News Boston, reported on Mass General Brigham canceling non-urgent surgeries and hospital visits. Nurses reported providing patient care without access to EHR platforms (!).

Meghan Mahoney, a Neurosciences Intensive Care Unit nurse put the crisis in context: "I can't even quantify it into words. We do everything on our computers now, right? We have our electronic health systems which have people's medical history, their allergies. Everything that they have done to establish care with us is on our electronic health system."

Potential for Federal Class-Action Lawsuits

Karen Blum, reporting for the Association of Health Care Journalists, cited the Lurie Children's Hospital class-action federal lawsuit filed in Chicago U.S. District Court. Such lawsuits can be filed months later after a cyberattack or similar failure that renders HIT platforms inaccessible or functionally inoperable.

Lessons Learned

Many of these lessons have been learned by organizations over time, who have transitioned them into their best practices, operational polices, and procedures.

These polices and business practices drive risk assessment and change management programs to ensure the safety, security, functionality and reliability of mission-critical business information and healthcare IT systems and software applications.

- Corporate, non-profit, and government institution leaders need to recognize, acknowledge, and take definitive actions to ensure ongoing investments in validation, risk assessment, and security programs to ensure the long-term viability and sustainability of the information technology they depend on.

- By adopting a long-term sustainable business perspective and prioritizing these aspects from the outset, companies can avoid costly mistakes, build trust with their customers, and position themselves for greater success.

- Ensuring the functionality, stability, security, and sustainability of a company's IT infrastructure is a shared responsibility between IT platform, system, and software providers, and the customers who rely on these technologies.

- Customers should implement their own validation, risk assessment, and security business practices, policies, and processes to manage software updates destined for mission-critical production IT platforms.

- Validation and Risk Assessment are ongoing programs, not one-and-done projects. IT Platforms, applications and integrations are ever- changing, along with security and risk assessment. Change drives updates to Validation and Verification assets and platforms, to ensure the integrity and sustainability of business platforms, applications, and integrations.

By involving all stakeholders in software release validation, the likelihood of incidents like the CrowdStrike outage can be significantly reduced.

Conclusion and Takeaways

Priyanka Aash, Cofounder of the CISOPlatform, a social network for security professionals, published her perspective on the Microsoft CrowdStrike incident which bears reading.

Here are a few of my conclusions and takeaways:

- Sometimes it takes a major incident like this to serve as a wake-up call and drive real change within an industry. Companies may have heard the message about the importance of quality and security for years, but it's only when they see the tangible consequences firsthand that they're more likely to take these issues seriously.

- It's unfortunate that it often takes drastic events, such as the Microsoft CrowdStrike incident, to spur business, technology, and government leaders into action.

- As painful and costly lessons are learned from these incidents, more organizations will adopt industry best practices that help ensure a more viable and sustainable business and technology infrastructure that future generations can rely on.

- Business stakeholders, customers, and consumers should evaluate their software validation practices and processes that support mission-critical programs, projects, and processes.

- Contingency plans need to be current, with organizations ready to implement them on short notice.

- Companies that value quality and excellence in customer experience are not immune from V&V lapses, as demonstrated by the Apple iPadiOS 18 software update failure that reportedly 'bricked' some premium customer iPad tablets.

Interested in hearing an Artificial Intelligent agent's perspective on this column? Checkout this discussion between a couple of Google's Gemini AI agents.

Posted

7.25.2024

Information Technology

Follow and Support me on these Media Channels

The Microsoft CrowdStrike Incident: A Case for a Shared Validation & Verification Process