Why Systems are Broken on Purpose
Systems are broken intentionally to ensure that they are resilient and fault-tolerant before release. This practice helps systems handle unexpected disruptions, preventing potential damage during production or when the system is live.
Chaos engineering is not just about randomly causing failure or breaking things (uncontrolled scenarios), there are thoughts, plans, and evaluations before failure is caused in a production or pre-production environment.
Chaos engineering should be implemented when the system has achieved a high level of maturity in terms of resilience strategies and monitoring. Before causing chaos, it's important to ensure that the system's behaviour is observed, monitored, and understood so that the failure and response to stress can be evaluated properly.
Key reasons include:
Identifying weaknesses to discover potential or hidden vulnerabilities.
Improve resilience and fault tolerance to test how well the system can recover during disruptions. Prepares systems and engineering teams on how to deal with unexpected disruptions in the real world.
Chaos engineering is usually performed in the production environment. However, chaos testing doesn't have to happen in either the pre-production or post-production stage. It can happen early in the development cycle as part of CI/CD environments.
Developers can run resilience tests on new code just as it is being developed. This is to prevent regression in some of the resilience that may have already been built up.
Observability in Chaos Engineering
Observability involves understanding the internal components of a software system by analysing its external outputs. Observability explores the different modes of failure within a system and uses the insights to create safety iterations.
Before intentionally causing a failure, the engineer or team must ask the question: What is the risk tolerance for the production environment to know if it's suitable to introduce chaos?
Chaos engineering follows four basic steps, and they are:
Hypothesis: This is the first step where engineers think about what could happen to the application when changing a variable. It allows chaos engineers to ask questions, write down assumptions, and compare them to real-life scenarios.
Blast Radius: This is the area of impact during a chaos experiment. The blast radius is defined when testing specific variables, such as increasing memory load and other components, to ensure that disruptions are controlled and do not cause damage to the overall system or application.
Testing: This involves executing the chaos experiments to observe how the system functions under failure conditions. For instance, chaos engineers may use a simulated environment or carry out chaos experiments in production environments to cause disruptions to services, infrastructure, networks, and devices. If there's a different result from the assumptions made in the hypothesis, the chaos engineers would analyse the insights derived from the experiment to identify weaknesses and then rebuild the component or make necessary adjustments to improve the resilience of the system.
Insights: Insights are derived from the results of the hypothesis and testing. Chaos engineers use insights to restructure or rebuild components, enhancing their performance or functionality under unexpected conditions.
Benefits and Limitations of Chaos Engineering
Benefits | Limitations |
Early Detection of Issues | Inaccurate Conclusion |
Improved resilience and recovery processes | Disruption in Production Systems |
Improve business | High-Cost |
Extensive understanding of the system | Scope and Coverage |
Improved Decision Making | Hard to Simulate |
Collaboration | |
Benefits of Chaos Engineering
There are several benefits of chaos engineering, which include:
Early Detection of Issues: Chaos engineering helps engineers discover potential issues during production before the application is deployed.
Improved resilience and recovery processes: Injecting failures and observing how the system responds can help engineers identify weaknesses and improve the application’s resilience. It also pushes engineers to create recovery plans and procedures to ensure the system recovers from failures quickly
Improve business: Chaos engineering can build resilient and reliable systems that enhance customer experience, which can boost the business and draw in more clients.
Extensive understanding of the system: Chaos engineering generates insights into intricate interactions and dependencies within the system, enabling engineers to make informed decisions about the architecture and design of the system.
Improved Decision Making: Chaos engineering insights help engineers make informed decisions leading to more efficient and reliable systems.
Collaboration: Coordination and communication across teams is necessary when fault is applied in a system when the team understands that everyone is working together to improve the resilience of the system.
Limitations of Chaos Engineering
Limitations of Chaos Engineering include:
Inaccurate Conclusion: In some cases, the chaos experiments can give false positives and false negatives, depending on how the experiment is designed and executed.
Disruption in Production Systems: Injecting chaos into the production environment comes with many risks. In an attempt to stimulate realistic failure scenarios, actual service disruption or downtime can occur.
Therefore, planning and creating mitigation strategies are essential to minimise the impact of these disruptions on operations.
High-Cost: Chaos engineering can require significant resources such as financial investment and time, depending on the complexity of the system
Scope and Coverage: Achieving comprehensive coverage with chaos testing can be challenging because not all failure scenarios can be realistically simulated. As well, some vulnerabilities may remain undetected, therefore, it is recommended that chaos testing is supplemented with other testing methods like penetration, security testing, etc.
Hard to Simulate: It can be difficult and challenging to simulate chaotic scenarios, or to create realistic scenarios that mimic real-world conditions. This is because systems consist of components that are dependent on each other. Therefore, simulating failures can have a ripple effect, making it difficult to predict outcomes and analyse results.
Types of Testing in Chaos Engineering
Types of experiments in chaos engineering include;
Dependency Failures: Dependency testing focuses on dependencies such as databases, APIs, microservices, etc. Chaos engineers can introduce chaos or stimulate scenarios where a service is unavailable and test the system's resilience during such failure.
Network Latency Test: Increase network latency to see how the system handles slow communication or how quickly it can send and receive data.
DDoS (Distributed Denial of Service) Attack: This involves sending high volumes of traffic to simulate the system’s response to DDoS attacks.
Blackhole Attack: A blackhole attack drops targeted IP packets at the transport level, shutting off communication to and from a component.
Packet Loss and Corruption:This involves mimicking poor or mobile internet connections to see how components of the entire application respond to faulty or missing data.
The software landscape is evolving, and the industry is moving beyond reactive chaos engineering and adapting a more proactive approach to building resilient systems. This involves a shift from breaking things to designing systems that are fault-tolerant and self-healing.
Designing systems for resilience
Resilience is incorporated into the core design of software systems instead of relying on chaos engineering to discover and address weaknesses. Therefore, building fault-tolerant and self-healing capabilities from the onset ensures that the system can withstand and recover from outages without significantly impacting users or services.
Observability and Continuous Improvement
The modern software landscape places emphasis on observability and continuous improvement. Monitoring and observability tools provide real-time visibility into system performance, allowing engineers identify and address potential issues before they become critical issues.
Shifting Left
This is the evolution of chaos engineering that empowers developers to build resilient systems from the early stage instead of relying on chaos engineering experts. This approach involves providing developers with the right tools and processes to integrate into their code from the onset to minimise the impact of failures.
Here are couple of tools you can use to perform chaos engineering;
Litmus
Litmus helps chaos engineers carry out controlled tests in the production stage. Engineers implement log capturing, detect bugs, run test suites and generate reports.
Chaos Mesh
Specific for cloud applications, Chaos mesh provides a dashboard with various built-in experiments and timeframes to inject chaos into software systems. Engineers can design custom experiments and perform status checks of different components and development stages either during pre-production or post-production.
Gremlin
Gremlin is a Paid chaos engineering tool that provides engineers with three attack modes and failure scenarios to help develop resilient and reliable software. It also offers features like latency injections, memory leak testing, CLI support, and so on.
Chaos Monkey
Chaos monkey is an open-source tool that can be used to detect system blockage and also provide solutions to resolve them.
AWS Fault Injection Simulator
AWS fault injection service is a resilience testing tool for setting up and running controlled fault injection experiments across various AWS services, enabling teams to build a confident and efficient application. It is used to find performance bottlenecks and other weaknesses missed by traditional software tests.