Massive Outage Highlights Need for Resilient Operating System
A massive outage affected multiple businesses worldwide due to a routine application update, highlighting the critical need for a new generation of resilient operating systems in large-scale deployments.
Today, we woke up to a massive outage affecting multiple businesses worldwide. Airlines, banks, credit card companies, and other industries were impacted. Many wonder how a simple maintenance update in one of hundreds of applications left many systems unable to boot, causing business outages and chaos. The manual and local fixes required for each affected device mean it could take weeks to recover fully. This incident, involving a widespread issue with a CrowdStrike update affecting Windows operating system, highlights the critical need for a more resilient operating system for large-scale deployments.
The Root Cause
The root cause of this disruption seems to be a routine application update that led to systems failing with a Blue Screen of Death (BSOD) every time they rebooted. The affected devices included desktops, servers, terminals, and edge devices, amplifying the recovery challenges due to their dispersed locations and manual resolution. Such incidents emphasize the vulnerabilities inherent in traditional operating systems when managing extensive IT infrastructures.
The Scale Problem
As we scale, we must be concerned about how something as routine as a maintenance update failure can dramatically affect our business continuity and potentially our brand and reputation. What makes it worse is that this will not be the last maintenance error. The only question is when the next one will occur. Application errors, administrator mistakes, and other issues can always happen.
In the past, when everything was in central systems with manual procedures, administrators could access machines directly, and resolution was not too time-consuming. They could go locally to the systems and repair them in hours. However, today, dispersed environments and massive distribution across the edge, cloud, or remote systems mean that unbootable systems are not easily accessible and could take weeks of work and potential travel to recover everything. To understand the scale, in this incident, we have seen images of even airport screen terminals with a blue screen, highlighting the potential cost and logistical challenge of sending technicians to each affected location.
Learnings
This is where we understand why we need a new generation of operating systems designed to be always ready to service. Operating systems that are always ready to boot in a ready-to-service state with automated health checks and rollback capabilities, making recovery time negligible and allowing administrators to perform repairs remotely if needed.
The Need for an Immutable, Transactional, Enterprise-Ready OS Supporting Full Rollback
Although some general-purpose Linux distributions like SUSE Linux Enterprise Server offer excellent resiliency and rollback capabilities backed by its Btrfs filesystem, they can’t cover all the cases in which an error makes the system unusable. To have an always ready-to-service (and boot) OS, we need an immutable, transactional operating system like SUSE Linux Enterprise Micro. Unlike traditional systems, it offers automated health checks and rollback capabilities, ensuring that any maintenance error can be undone, leaving a booted system ready to service and effortlessly corrected centrally without manual intervention. An immutable system always has the last ready-to-boot copy of the OS. In the case of an error, automatic health checks detect the issue and roll back to a previous known good state, ensuring consistent booting, minimizing downtime, maintenance costs, and potential associated brand damage or liability.
Benefits in Large-Scale Environments
In large-scale environments, like cloud deployments, data centers, or large enterprise networks, the complexity of IT management is magnified. Here, the advantages of using an immutable operating system like SUSE Linux Enterprise Micro become evident. Its transactional updates ensure changes are implemented only if they pass all predefined health checks after boot, preventing incomplete or faulty updates or configuration changes from disrupting operations. This significantly reduces the risk of downtime and associated maintenance costs. Additionally, in the event of an error, the OS can seamlessly roll back to a previous stable state, ensuring continuity of service. This robust approach enhances overall system reliability and efficiency, making SUSE Linux Enterprise Micro ideal for managing extensive IT infrastructures, especially in remote environments where minimizing travel and manual interventions is crucial.
Infrastructure Ready for the Edge
We have learned that the edge is a very specific scenario where incidents like these can multiply the impact. Therefore, edge infrastructure must be equipped with solutions designed to handle such situations. SUSE Edge leverages SUSE Linux Enterprise Micro to provide a robust solution. SUSE Edge ensures that dispersed and remote systems are always in a ready-to-service state, offering automated health checks and rollback capabilities. This makes managing and recovering edge devices efficient and reliable, significantly reducing the risk and impact of system failures. Learn more about SUSE Edge and its capabilities here.
An Additional Learning: The Need to Implement Processes for Patching
To further minimize risks, it’s crucial to implement processes to test patches and use staging environments before deploying updates, including not only OS patches but also all applications. Tools like SUSE Manager can facilitate and automate this process by managing patch testing and staging in preproduction environments, ensuring updates are reliable and reducing the likelihood of system failures.
Conclusion
The recent outage is a stark reminder of the risks associated with conventional operating systems in managing extensive and remote IT estates. By adopting an always-ready-to-service OS, organizations can mitigate such risks, ensuring a more resilient and manageable IT environment.
In a previous blog post, I explored the challenges faced by software vendors and integrators in maintaining remote and dispersed systems, particularly when updates lead to critical errors. This recent outage is just another reminder of these challenges. While that blog post focused on immutable operating systems like SUSE Linux Enterprise Micro, the broader point is that resilient operating systems with features like automatic rollbacks can play a significant role in ensuring system uptime in large-scale deployments.
To know more about what is included in SUSE Linux Enterprise Micro, visit this link.