Director, Intelligent Automation
IT organizations have undergone a massive transformation in terms of people, processes, and technology to address business needs like achieving faster time-to-market. People have reskilled to full stack; processes have moved from waterfall to iterative and agile; teams have become linear, and DevOps practices have been applied to the automation of software development and deployment. Although monolith architecture allows logical modularity regardless of any enhancements or new features, it is still deployed as a unified unit making continuous deployment tedious and time-consuming. To resolve this challenge, organizations have migrated from monolithic architecture to service-oriented architecture and, in most cases, to microservices. The shift to microservices has helped teams develop, test, and deploy applications more efficiently, quickly, and frequently. Moreover, the adoption of microservices has significantly improved key performance indicators (KPIs), such as deployment frequency and lead time to change, helping bring features to go to market on a shorter timeline and aid business.
IT enterprises that have embraced microservices have seen enormous benefits. However, challenges remain while managing operational and technical complexities. An increase in the number and interdependencies of microservices can also increase the points of failure, which then affect the following metrics:
These metrics measure the availability of systems and are used by systems reliability operations and development teams to support contracts with service level agreements (SLAs). Higher system recovery time/MTTR and frequent failures in the system/MTBF can impact system availability and reliability, thus significantly affecting SLA/SLO (service level agreement/service level objective) commitments in contracts.
Chaos engineering involves a series of practice experiments run on the systems to check and enhance the confidence in their ability to withstand turbulent conditions in production. According to a 2021 report by Gremlin1, 23% of teams who frequently ran chaos engineering projects had an MTTR of under one hour; 60% had an MTTR of under 12 hours. This is why implementing chaos experiments early and often in the software development life cycle (SDLC) enables the following:
Today, many top companies experience assured unparalleled availability and reliability through implementing chaos experiments as part of their SDLC. Therefore, improving MTBF and MTTR impacts high system availability and does not deviate from the committed SLAs/SLOs.
Organizations can introduce chaos engineering by using a six-step strategy, depicted in the graphic below:
Figure 1: Six-step strategy for introducing chaos engineering in an organization.
Let us elaborate on the six steps mentioned above. We’ll use the help of a sample microservices-based movie-booking application. It is deployed on a Kubernetes cluster on a public cloud provider, as depicted in Figure 2 below.
Various services, like show timing, movie rating, ticket purchasing, payment, email, and SMS, are deployed in individual containers, each wrapped in a separate pod. The front-end user interface and database also run on the pods.
Figure 2: Movie-booking application deployment on a Kubernetes cluster on a public cloud provider.
Step 1: Discovery
During this phase, teams must collaborate to get the required details about an application and its environment by:
Figure 3: Depiction of a movie-booking application’s services and its dependencies.
Step 2: Define steady state
During this stage, teams should:
In the sample movie-booking application (Figure 2), the service – book movie – is a business-critical component, as it is tied to revenue generation. Typically, during the first few days of a movie release, this component receives massive concurrent hits. Hence, throughput (the ability to service these requests without degradation) becomes the top-of-the-funnel metric, and the product management team defines the acceptable throughput as within the range of 50-65 requests per second for the book movie service.
Step 3: Build a hypothesis
Once the steady state has been defined for the services, identify the experiments to be injected, and hypothesize what will go wrong (e.g., what could impact the service, system, and customers?). Moreover, prioritize the experiments based on customer impact and the high frequency of occurrences.
Let’s terminate the pod that’s running the book movie service on the sample application. We hypothesize that Kubernetes will automatically detect the termination and provision a new pod.
Step 4: Run the experiment, and measure and analyze the results
Start running experiments in a non-production environment. However, it is vital to understand how the system behaves before running the experiment. Measure the required metrics under normal conditions, and then measure the same metrics after injecting experiments. If there is a severe impact after running the experiment, abort the experiment and execute the rollback plan to revert to a steady state.
Figure 4: A schematic representation of the overall experiment, defined below.
Steps below to perform a book movie POD kill experiment:
Figure 5: Pre-experiment throughput is in the defined range of 50 to 65, and there are no errors.
Figure 6: Pre-experiment latency results are in the expected range.
Figure 7: Throughput and error after running the experiment (e.g., after terminating the book movie pod).
Figure 8: Latency after running the experiment (e.g., after terminating the book movie pod).
The data shows that when the book movie pod is terminated, the error rate and latency are above the normal range. The throughput dips until Kubernetes detects the termination and automatically brings up a pod containing book movie service; this glitch must be fixed.
Step 5: Fix and retest
In this case, one of the potential fixes is to scale the number of replicas; this ensures that a specified number of pods are running continuously. Even if one pod fails, the other pod(s) will handle the request. Note that after the fix, in the image below, there are zero errors. Throughput and latency are as expected and well within the range.
Figure 9: Post-fix, there are zero errors in throughput.
Figure 10: Post-fix, latency is in the acceptable range.
Also, with the implementation of microservice-resilient design patterns, glitches in the application can resolve. Developers can create failure-resistant microservices applications by using resilience patterns, like a circuit breaker, retry, and timeout.
Step 6: Iteratively increase the blast radius
Once the experiment starts running successfully, the recommendation is to gradually increase the radius of the experiment. For example, in the above sample experiment, we terminate one service, then terminate another progressively. Hence, one must start experimenting in staging environments and then move to production.
To ensure user satisfaction and retention in the digital world, businesses must consider certain factors (time to market, user experience, reliable system, and the ability to recover quickly) to run operations and provide value to end users. Incorporating chaos engineering into SDLC enables both operational and technical benefits, such as the prevention of revenue losses, reduced maintenance costs, improved incident management, a decrease in negatory incidents, a lighter burden on on-call staff, and a better understanding of system failure modes and improved system design.
Because of its many advantages, organizations should step toward chaos engineering to facilitate the creation of reliable software systems.
References:
Director, Intelligent Automation
Sathyapriya Bhaskar has over 18 years of experience in Quality Engineering & DevOps. She is a practicing Solutions Architect, helping customers to design and implement SDLC automation solutions. She is also a certified chaos engineering practitioner and professional.
Subscribe to keep up-to-date with recent industry developments including industry insights and innovative solution capabilities