Concept

Chaos Engineering: the practice of deliberately introducing controlled failures into a system to build confidence in its ability to withstand real-world turbulence.

Key Principles & Goals

  • Proactive resilience: The core goal is to find weaknesses before they lead to significant problems for users.
  • Controlled experimentation: Chaos engineering involves hypothesis-driven, controlled experiments, not random destruction.
  • Hypothesize steady state: A steady state of normal system behavior is defined, and the hypothesis is that it will remain steady despite the introduced failure.- Minimize blast radius: Experiments are designed to have a limited impact to avoid causing widespread damage.- Validate monitoring: Experiments help validate that monitoring and alerting systems are working correctly.

How it works

  • Define steady state: Establish a baseline of normal system performance

  • Form a hypothesis: Make an assumption about how the system will behave under a specific failure scenario, e.g., “If a web server fails, the load balancer will redirect traffic to healthy servers”

  • Introduce a failure: Use tools to inject a controlled fault, such as increasing CPU usage, introducing network latency, or shutting down a server

How to introduce failures: Fault Injection in Distributed Systems

  • Observe the system: Monitor how the system responds to the failure and compare it to the hypothesis

  • Fix vulnerabilities: Based on the results, make necessary fixes to improve the system’s resilience

Drill in action: Mediary Fault Drill


Benefits of chaos engineering

  • Increases availability and reliability.
  • Reduces Mean Time To Resolution (MTTR) and Mean Time To Detection (MTTD).
  • Improves incident response preparedness by simulating real-world events.
  • Helps validate disaster recovery plans.
  • Validates that redundancy and failover mechanisms are working correctly.