Article

Chaos engineering: A step toward reliability

Sathyapriya Bhaskar,

Director, Intelligent Automation

Published: September 13, 2022

A journey toward microservices

IT organizations have undergone a massive transformation in terms of people, processes, and technology to address business needs like achieving faster time-to-market. People have reskilled to full stack; processes have moved from waterfall to iterative and agile; teams have become linear, and DevOps practices have been applied to the automation of software development and deployment. Although monolith architecture allows logical modularity regardless of any enhancements or new features, it is still deployed as a unified unit making continuous deployment tedious and time-consuming. To resolve this challenge, organizations have migrated from monolithic architecture to service-oriented architecture and, in most cases, to microservices. The shift to microservices has helped teams develop, test, and deploy applications more efficiently, quickly, and frequently. Moreover, the adoption of microservices has significantly improved key performance indicators (KPIs), such as deployment frequency and lead time to change, helping bring features to go to market on a shorter timeline and aid business.

Technological challenges of microservices architecture

IT enterprises that have embraced microservices have seen enormous benefits. However, challenges remain while managing operational and technical complexities. An increase in the number and interdependencies of microservices can also increase the points of failure, which then affect the following metrics:

  • Mean time to resolution (MTTR): the average amount of time between the start of an incident and its resolution
  • Mean time between failures (MTBF): the average time elapsed between failures

These metrics measure the availability of systems and are used by systems reliability operations and development teams to support contracts with service level agreements (SLAs). Higher system recovery time/MTTR and frequent failures in the system/MTBF can impact system availability and reliability, thus significantly affecting SLA/SLO (service level agreement/service level objective) commitments in contracts.

Adopting chaos engineering to improve availability

Chaos engineering involves a series of practice experiments run on the systems to check and enhance the confidence in their ability to withstand turbulent conditions in production. According to a 2021 report by Gremlin1, 23% of teams who frequently ran chaos engineering projects had an MTTR of under one hour; 60% had an MTTR of under 12 hours. This is why implementing chaos experiments early and often in the software development life cycle (SDLC) enables the following:

  • Improved MTTR: In the non-production environment, fixing defects prepares the teams and improves their speed in resolving incidents in production environments.
  • Improved MTBF: In the non-production environment, failures can be identified proactively and resolved, thereby reducing failures in production.

Today, many top companies experience assured unparalleled availability and reliability through implementing chaos experiments as part of their SDLC. Therefore, improving MTBF and MTTR impacts high system availability and does not deviate from the committed SLAs/SLOs.

Chaos testing can reveal system weaknesses

Organizations can introduce chaos engineering by using a six-step strategy, depicted in the graphic below:

Six-step strategy for introducing chaos engineering in an organization.
Six-step strategy for introducing chaos engineering in an organization.

Figure 1: Six-step strategy for introducing chaos engineering in an organization.

Let us elaborate on the six steps mentioned above. We’ll use the help of a sample microservices-based movie-booking application. It is deployed on a Kubernetes cluster on a public cloud provider, as depicted in Figure 2 below.

Various services, like show timing, movie rating, ticket purchasing, payment, email, and SMS, are deployed in individual containers, each wrapped in a separate pod. The front-end user interface and database also run on the pods. 

Movie-booking application deployment on a Kubernetes cluster on a public cloud provider.
Movie-booking application deployment on a Kubernetes cluster on a public cloud provider.

Figure 2: Movie-booking application deployment on a Kubernetes cluster on a public cloud provider.

Step 1: Discovery

During this phase, teams must collaborate to get the required details about an application and its environment by:

  • Analyzing all services/components, functionality, touchpoints, and dependencies between services, including upstream-downstream, independent services, configurations, and data stores 
  • Exploring infrastructure and deployment approach
  • Identifying points of failure for each component/service
  • Determining the impact on business for each failure point 
Depiction of a movie-booking application’s services and its dependencies.
Depiction of a movie-booking application’s services and its dependencies.

Figure 3: Depiction of a movie-booking application’s services and its dependencies.

Step 2: Define steady state

During this stage, teams should:

  • Measurably define the application’s normal state behavior 
  • Measure business and operational metrics (e.g., latency, throughput, error rate, etc.)
  • For the identified metrics, mark the acceptable value range 

In the sample movie-booking application (Figure 2), the service – book movie – is a business-critical component, as it is tied to revenue generation. Typically, during the first few days of a movie release, this component receives massive concurrent hits. Hence, throughput (the ability to service these requests without degradation) becomes the top-of-the-funnel metric, and the product management team defines the acceptable throughput as within the range of 50-65 requests per second for the book movie service.

Step 3: Build a hypothesis

Once the steady state has been defined for the services, identify the experiments to be injected, and hypothesize what will go wrong (e.g., what could impact the service, system, and customers?). Moreover, prioritize the experiments based on customer impact and the high frequency of occurrences.

Let’s terminate the pod that’s running the book movie service on the sample application. We hypothesize that Kubernetes will automatically detect the termination and provision a new pod.

Step 4: Run the experiment, and measure and analyze the results 

Start running experiments in a non-production environment. However, it is vital to understand how the system behaves before running the experiment. Measure the required metrics under normal conditions, and then measure the same metrics after injecting experiments. If there is a severe impact after running the experiment, abort the experiment and execute the rollback plan to revert to a steady state.

A schematic representation of the overall experiment, defined below.
A schematic representation of the overall experiment, defined below.

Figure 4: A schematic representation of the overall experiment, defined below.

Steps below to perform a book movie POD kill experiment:

  • We have set up Grafana — a multi-platform, open-source analytics, visualization, and monitoring tool. Monitor throughput, error rate, and latency before performing the experiment, as highlighted in Figures 5 and 6.
Pre-experiment throughput is in the defined range of 50 to 65, and there are no errors.
Pre-experiment throughput is in the defined range of 50 to 65, and there are no errors.

Figure 5: Pre-experiment throughput is in the defined range of 50 to 65, and there are no errors.

Pre-experiment latency results are in the expected range.
Pre-experiment latency results are in the expected range.

Figure 6: Pre-experiment latency results are in the expected range.

  • Leverage the open-source performance tool, Apache JMeter, to simulate concurrent load for a specific duration on the Book Movie service requests.
  • Run the chaos experiment — the termination of the book movie pod — by leveraging any chaos tool.
  • Observe post-experiment throughput, error rate, and latency after experimenting, as highlighted in Figures 7 and 8 (in Grafana).
Throughput and error after running the experiment (e.g., after terminating the book movie pod).
Throughput and error after running the experiment (e.g., after terminating the book movie pod).

Figure 7: Throughput and error after running the experiment (e.g., after terminating the book movie pod).

Latency after running the experiment (e.g., after terminating the book movie pod).
Latency after running the experiment (e.g., after terminating the book movie pod).

Figure 8: Latency after running the experiment (e.g., after terminating the book movie pod).

  • Analyze the error rate, threshold, and throughput (recorded from pre- and post-experiment) to verify deviations 

The data shows that when the book movie pod is terminated, the error rate and latency are above the normal range. The throughput dips until Kubernetes detects the termination and automatically brings up a pod containing book movie service; this glitch must be fixed.

Step 5: Fix and retest

In this case, one of the potential fixes is to scale the number of replicas; this ensures that a specified number of pods are running continuously. Even if one pod fails, the other pod(s) will handle the request. Note that after the fix, in the image below, there are zero errors. Throughput and latency are as expected and well within the range. 

Post-fix, there are zero errors in throughput.
Post-fix, there are zero errors in throughput.

Figure 9: Post-fix, there are zero errors in throughput.

Post-fix, latency is in the acceptable range.
Post-fix, latency is in the acceptable range.

Figure 10: Post-fix, latency is in the acceptable range.

Also, with the implementation of microservice-resilient design patterns, glitches in the application can resolve. Developers can create failure-resistant microservices applications by using resilience patterns, like a circuit breaker, retry, and timeout.

Step 6:  Iteratively increase the blast radius

Once the experiment starts running successfully, the recommendation is to gradually increase the radius of the experiment. For example, in the above sample experiment, we terminate one service, then terminate another progressively. Hence, one must start experimenting in staging environments and then move to production.

Chaos engineering’s best practices 

  • Target experiments in a pre-production environment: Start experiments in the development or staging environment. After gaining confidence in the lower environment and understanding the impact of the experiments, run experiments in production environments.
  • Leverage automation: Start with a manual test using tools, but take advantage of automation to get faster feedback.
  • Run experiments continuously: As you gain confidence with the experiments, ensure chaos experiments run continuously in the CI/CD pipeline.
  • Iteratively increase the blast radius: Focus on known experiments and start small experiments (e.g., begin with one container, measure network latency between two services, etc.)
  • Recovery plan: Ensure a recovery plan for each experiment.

Closing thoughts

To ensure user satisfaction and retention in the digital world, businesses must consider certain factors (time to market, user experience, reliable system, and the ability to recover quickly) to run operations and provide value to end users. Incorporating chaos engineering into SDLC enables both operational and technical benefits, such as the prevention of revenue losses, reduced maintenance costs, improved incident management, a decrease in negatory incidents, a lighter burden on on-call staff, and a better understanding of system failure modes and improved system design.

Because of its many advantages, organizations should step toward chaos engineering to facilitate the creation of reliable software systems.

References:

1. Horgan, Aileen. “The State of Chaos Engineering in 2021.” Published January 26, 2021. The State of Chaos Engineering in 2021 (gremlin.com)

 

 

Sathyapriya Bhaskar

Sathyapriya Bhaskar

Director, Intelligent Automation

Sathyapriya Bhaskar has over 18 years of experience in Quality Engineering & DevOps. She is a practicing Solutions Architect, helping customers to design and implement SDLC automation solutions. She is also a certified chaos engineering practitioner and professional.

Related content