8 June 2022

8 mins read

Chaos Engineering or fixing in the production

8 June 2022

8 mins read

Jadranko Kovacec

QA Automation Quality Engineer

Large amounts of resources are invested into testing software products when companies strive to build a very stable and resilient product. However, that may not always yield desired results.

Does fixing the software product in production sounds better than breaking the software? Well, this may depend on who we ask.

So maybe we should choose which words should we use when presenting this to stakeholders or managers. Some may not like “destructive” terms and, sometimes, we may need to use “creative” ones. But, breaking the software to see its weakness and make it more robust, reliable, and stable is leads us to think that we need to break it first and then fix it.

What is it called when you experiment with software to see how long it will last under the given pressure and how long it will be able to hold on to its integrity before malfunctioning? Does this have a name?

Let me introduce you to Chaos Engineering (further in text CE).

What is Chaos Engineering?

Some definitions of the CE say it’s the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Let’s say that you build software that does some functionality. That software works well and performs well under all undertaken tests. But what if in the productional environment, that software could be taken to its limits and eventually malfunction. The loss of revenue could be huge. The greater the system, the more significant could be the loss. To avoid that, we should minimise the risk mentioned earlier.

CE should be implemented in huge systems that are available to users almost all the time. The more users, the greater impact.

Let’s dive into the given subject even deeper.

Chaos Engineering in Practice

When we speak about software development, we need to think of a software system’s resiliency that is typically specified as a software requirement. It is a given software system’s ability to tolerate failures while still ensuring adequate Quality of Service (QoS). This requirement could often be hard to obtain by development teams due to the low know-how or tight project schedule. This is where Chaos Engineering comes in.

CE is a technique that helps meet the resilience requirement and could be used against application failures, infrastructure failures and network failures.

This is done by injecting failures in the specific system in a controlled and monitored way to get an insight into a given system’s resilience. If massive flaws emerge, we should address the situation to find a solution to mitigate them. With CE, we can uncover weaknesses in a specific system, find a solution to deal with them, and build a more resilient system for minimal risk of loss of revenue or the “market shame” that our users can create by reviewing specific software.

CE experiments follow some guidelines or steps.

Step #1

We should define a normal state of the observed system that should represent the normal behaviour and have some measurable output. For that purpose, we should measure system output over a short period of time and set that data as a normal or steady state.

Step #2

We should make a hypothesis that this normal state will be present both in the control and experimental group. It should be based on measurable data that we have already collected.

Step #3

We should introduce the situations that represent the real-world events which will simulate server issues (server shutdown, malformed responses, traffic spikes, resource exhaustion), network issues (low bandwidth, latency, packet loss or complete link failure), hard disk issues (resource exhaustion, corrupted data), etc.

Preferably, we should do these experiments directly in the production environment to ensure the authenticity and relevance of the currently deployed system. But be careful not to impact users during the process. If the user experience is affected, stop testing immediately. Learn from this experience and adapt your testing strategy. Try again.

Step #4

We should disprove the hypothesis with the different states in both the control and experimental group.

Step #5

Automate the process to run those experiments continuously. Manually this could be unsustainable.

The system should perform despite its failures and this is where Chaos Engineering becomes relevant.

Chaos Engineering or fixing in the production

Benefits of Chaos Engineering

CE can be a bit difficult to set up at first, but in the long term it should pay off. We can see its benefits from a Business, Customer and Technical perspective.

From a Business perspective, system failures can drive the company to the undesired state on the market and that can lead to revenue loss, bad reputation, high cost of maintenance, difficulties in getting new employees and new business contracts, etc.

From the Customer perspective, CE could maintain good product reviews, high product availability, a good reputation amongst current users and a stable inflow of new users.

From the Technical perspective, this technique can help reduce system issues or incidents, help to understand the system’s weaknesses better, improve the system and improve the speed needed for fixing the incidents.

The real-life examples of things that can happen are numerous.

We all use social media for communication. Most of us are in some groups. Let’s say that you miss one notification that should arrive just in time for you to take a certain action. It arrives too late. Another scenario: imagine that you have a solution on your smartphone that can track your geolocation and gives you incorrect information.

Here are some more examples. Imagine how many things could go wrong in the case of technology used for autonomous driving. Another case could be an in-house software solution that creates PDF files which employees often need to contract new business. If that software solution stopped working when there are some things that should be delivered in time, then the impact on company profit could be noticeable.

Finally, imagine that your company is a big bank with many branches and holds a good share of a certain country’s market share. A new version of the bank application is released. The application was superficially tested due to a lack of resources, but it was decided to let it go into production. When switching to production, some environmental issues arise, and the application becomes unavailable to all its users. The app rating in the store is dropping. The company’s reputation is also falling. Things like this should be avoided through flexibility and educating the engineers.

Maybe with CE in place, those things wouldn’t have happened.

Let’s get some things straight

If you have not yet encountered the term Chaos Engineering, you may be wondering about a few things, so let me explain them.

CE is not meant to produce chaos but to be a solution in case the mentioned problems arise.
Everything is done in a controlled environment (at least it should be if you don’t want to mess things up).
Everything is also monitored as we need that data for later
Apart from adding new features, developers care about reliability, write and perform unit testing.
Software solutions should be well tested before production and CE goes beyond that. Testing provides data that we expect. CE provides unexpected data that can be used to make resilient solutions.
CE is mainly meant for big systems, as mentioned earlier, but it is also useful for smaller ones.
No special tooling is required.
CE is best to be automatised.
With experiments in CE, we learn new system properties and gain new knowledge or build our confidence in the software solution.

Usage of Chaos Engineering

Back in 2011, Netflix released the article called “The Netflix Simian Army”, which was the first time when Chaos Engineering was introduced. From that story “Chaos Monkey” also emerged (developed in 2010 by Netflix Engineering Team) as a software tool that randomly simulates failures of production instances. Some of us may have watched West African soldiers giving an AK-47 to an Ape (YouTube: “Ape With AK-47”), which is similar to the results – unpredictable. Also, Chaos Monkey does not run as a service, but it’s like a scheduled cron job which then calls Chaos Monkey once a week to create a schedule of terminations.

We mentioned “Simian Army” just a sec before – so what is it? It’s a whole suite of failure-inducing tools that goes beyond the Chaos Monkey itself. It contains Latency Monkey, Conformity Monkey, Doctor Monkey, Janitor Monkey, Security Monkey, 10-18 Monkey, Chaos Gorilla and Chaos Kong. We’ll leave it for now as it may be too much to specify what each tool exactly does. This suite was made in 2011. In 2012 Chaos Monkey was made publicly available.

In 2016 the “Gremlin” was introduced by Kolton Andrus and Matthew Fornaciari and in late 2017 it was made publicly available as the world’s first managed enterprise CE solution.

In 2019, CE switched to some serious considerations for adoption, primary by e-commerce and big tech. That is understandable because of the direct impact on revenue due to the downtime that possibly can occur.

In 2020, “Amazon Web Services” (AWS) added CE to the “Well-Architected Framework” (WAF) and the “Fault Injection Simulator” (FIS) was introduced as a service for natively running chaos experiments on AWS. Soon some huge businesses started to adopt Chaos Engineering.

In 2021, for the first time ever, the Gremlin published the State of Chaos Engineering report showing the importance of the CE.

Lower Mean Time to Resolution (MTTR) and Mean Time to Detection (MTTD), fewer bugs and higher system resilience are benefits found in a study of the CE impact in companies that embraced CE.

Tools that we can mention are Chaos Mesh, Litmus, ChaosBlade, amongst the already mentioned Gremlin, Simian Army and Chaos Monkey. Other tools are also available, but I’ll leave this to you.

Using Chaos Engineering may indicate that some companies use a “shift left” testing strategy placed at the very starting stages of the software development lifecycle (SDLC).

And finally, here’s how to set up CE in a best-practice way:

Design the system resilience on infrastructure, networking, data, application, people and culture levels.
Use timeouts, retries, and fallbacks – Netflix adopted techniques.
When starting to create a hypothesis, try the Computational Thinking technique.
Use the Chaos Engineering team to set you all the necessary things for CE.
Educate yourselves and your employees.
Use the Canary Deployment and Canary Testing strategy as it is the safest way to inject failure and monitor for results.

Embracing Chaos Engineering can bring returns on investments (ROI) regarding the availability and reliability of a company’s services.

I hope that future software development will use the best strategies and practices available and explained.

About the author