No system or platform in production is infallible to falls. Nor can we ensure that the launch of any product will come down in a few hours. Perhaps in some cases it could have been avoided with a greater effort of software developers and system administrators in the configuration of the applications, but nothing is so obvious when you have a vast distributed infrastructure and different services in the cloud.
Read More Stories: “Boycott Apple” movement threatens Apple’s China business
Internet users seem accustomed to it being “normal” for any of their apps or cloud platforms to crash every so often. They spend hours without being able to access the service or can not see the last chapter of their series. We’ve seen it with Facebook, Whatsapp or, even with something much more frustrating, like HBO Spain at the premiere of Game of Thrones. But how to avoid these massive falls? And, of course, avoid the monetary losses that occur when this happens?
This is where Chaos Engineering came up with the idea of how any distributed infrastructure would face the fact that an army of monkeys entered to sabotage each of the network configurations , shutting down machines in AWS or randomly pulling some of the connected microservices . What was happening? Are we still responding normally? Or on the contrary we will respond with terrible errors 500, 503 or nothing at all?
Read More Stories: Mac & i: Kernel extensions for macOS now need Apple authentication
Chaos Engineering is not something new, nor the hype of the moment
We have to go back to the end of 2010 when Netflix published its magnificent post about the lessons learned by migrating to Amazon Web Service. On the one hand it made it clear, as a company of its size needed the cloud to be able to continue growing and, on the other hand, it made clear the concern of how it was going to ensure that everything will work well having already dozens of services distributed and now totally delocalised from its data centers.
In this way, the first thing they did was create a set of tools called as the affectionate name of Chaos Monkey . Initially, to randomly kill instances of AWS and see how it affected their infrastructure.
In this way, Netflix made sure to have a predictive control of what the user experience would be when one of its most important elements collapsed. So, if the recommendation system fell, they could redirect that traffic to something more static like the list of most popular series. Or if the platform’s streaming system failed, it could begin to adjust the reproduction quality until the platform stabilizes, so that the user can continue using it without cuts. In no case, if the recommendation database fails should affect another isolated service as the streaming service. If this happened we would have a problem, probably a misconceived dependency totally unnecessary.
Cause intentional failures to discover errors in our production platform
The methodology that Chaos Engineering promotes is simple in concept, but complicated to execute . If unit tests are the most “micro” part focused on our code doing what it says it does, then Chaos Engineering can be considered perfectly as the “macro” part to ensure the correct functioning of all our infrastructure (in production) ).
Read More Stories: Two rules from Jeff Bezos for successful management meetings
It is very difficult today to introduce the importance of Chaos Engineering into the culture of many development teams . Possibly the fault of technical managers such as the technology director (CTO or CIO) unable to understand what those production failures really cost when they occur and what it costs the company.
Read More Stories: The Walking Dead: Maggie Back in Season 10?
Obviously, introducing a culture among the team of software developers in which we proactively look for failures in our system is not something usual, rather we pull our hair afterwards when something inexplicably fails.
Although it costs, we need to find those failures before they happen in our architecture : those unnecessary latencies, those random failures that we do not know how to discover or that would prevent us from adequately serving all the requests that come to us when our platform has a peak of active users.
Probably the first to be convinced will be those who have had to struggle with production failures on a Friday night or a holiday. They are not willing for that to happen again.
Chaos Engineering in practice
As described in the Chaos Engineering manifesto, in order to address the uncertainty of those distributed on a large scale, we must follow a series of steps to carry out the necessary experiments:
Read More Stories: Apple-Qualcomm: why the two US giants agreed
- We have to be able to measure our system under normal conditions. If we are not monitoring our platform, not only can we not find out what is wrong when we run the experiment, but today we are unable to see that something is giving errors. Important always monitor your applications and the health of your components.
- Make hypotheses about the state to which we will take the system in production with the failures. How should I behave? What do we expect from him?
- Defining the points of failure of our system we have to agree certain events that will cause errors that are similar to the real world: servers that stop working, disks that run out of space or go into reading mode, unstable network connections, high response latencies , ..
- Draw a plan reviewing the results of the experiment. If our hypothesis has been fulfilled or if we have realized that there are real weaknesses that make the system fall. We have to solve them before they occur in a real way.
Read More Stories: HASH FUNCTION: The next nail in the coffin of SHA-1
Tools to use Chaos Monkey in your applications
During these years both Netflix and AWS have been developing some tools to simulate these events in complex environments. Netflix named them as their SimianArmy where we found different agents in charge of fundamental elements within our configuration:
- Introduce artificial delays to the RESTful client-server layer to simulate the degradation of the service and measure if the final response times are affected and in what way.
- Find instances that do not adhere to the best-defined practices and be turned off. For example, those machines may be misconfigured and not being properly autoscaled.
- Detect resources that are not being used in AWS and be removed or discover vulnerabilities or security misconfigurations.
Read More Stories: bloxberg: New blockchain research project for scientists
You can consult the open source project in Github of SimianArmy and its migration within the CI infrastructure of Spinnaker. There are also some projects like Gremlin that try to bring the concepts of Chaos Engineering to more companies intuitively through its platform.
And if you are more interested in the topic here I recommend this list of resources on Github: Awesome Chaos Engineering. As well as the book that the Netflix team wrote on the subject.
Read More Stories: Armageddon 2027: Is there any salvation from asteroids?
As mentioned above, it is necessary that these experiments be adopted as a cultural part of the development team. To carry out the steps described we need flexibility to be able to simulate these events in our production environment or if we have a staging environment for a more isolated initial experiment.
Although we have to remember that our goal is to find faults as real as possible. And, although it sounds bad, we must do those tests in production.