5 minute read

A huge topic, but it’s an important aspect of automation in DevOps.

Knowing some of the concepts and terminology will help you talk to your team about it.

We’ll look at ways that teams operate their production environments and respond to problems. I’m avoiding talking about specific tools since those change all the time, you should work together with your team to learn to use them.

Going back to the state of DevOps report and what makes teams successful with continuous delivery, they found that monitoring observability and continuous testing are key practices.

Think back to the time when you thought your team had done a thorough job of testing, but after you released you had a terrible production failure that was a case you had not even thought about it.

These unknown and those can really get us, our test environments can never fully emulate production and we can’t anticipate every way customers will use our product. It’s not realistic to think that we can test everything, though in sudden domains like safety/critical ones we might need to.

If we can respond quickly to production failures then we can take advantage of technology to help us identify problems. We can also learn what features the customers really want and use.

Alerts

Traditionally much of alerting was derived from blackbox monitoring methods and blackbox monitoring refers to observing a system from the outside, so that’s useful to see the symptoms of a problem.

Our site might be down but note that this may not necessarily be the root cause; our site might be down because the database went down.

White box monitoring is a term that refers to a category of monitoring tools and techniques that work with the data reported from the internals of a system, together with request tracing metrics and logs these are things that help us with observability.

You should start monitoring an application and then collecting aggregating and analyzing metrics to improve your understanding of how the system behaves.

We can set up alerts, although we want to do that wisely. This is something the whole team needs to do, to decide together logging and reporting errors and related data in a centralized way so developers can instrument the application code to log information about each event in the system.

And there’s other things that log information such as applications servers and street logging can provide valuable insights.

Tracing

When a problem does occur tracing allows you to see how the problem occurred, which function was going on, how long did that function last , what parameters were passed, how deep into the function was the user.

Tracing shows individual journeys through the application, it can be valuable in identifying bottlenecks though the tools can be expensive because it does take large amounts of data to process.

This is focused on the app and not the underlying infrastructure, today’s tracing tools can deal with huge amounts of data and often use machine learning to help analyze it.

Metrics are numbers measured over intervals of time so they are cheaper and easier to store and process than logs. Metrics are generated by applications and operating systems.

There are a good place to start monitoring and you can know if particularly resources go up or down

One example of a widely used tool for metrics is Prometheus

I’ve heard often people use the acronym elk, this is a commonly used open-source tool stack. I’m not recommending it one way or the other just to show that this is one stack of tools that provides a database for logging data and providing a search engine, a way to get the log information from your servers to the database and then a way to query the database and have visual dashboards and analytics to help you quickly see patterns and problems.

Observability

Observability is what’s required in order to gain visibility into the behavior of applications and infrastructure, some people have even said that it’s more important even than unit tests.

It’s about being able to ask arbitrary questions about your environment without knowing ahead of time what you wanted to ask, so for monitoring if we want to set up alerts we have to anticipate what problems might happen.

Oservability tools tell us how a system behaved, in the context of the production environment, unpredictable inputs, unpredictable assistant behavior and user behavior, all the data and tools that we have at our disposal.

Observability let us identify problems and then dig down really fast to see exactly what happened. Pinpointed even to an individual user so we can quickly reproduce those errors and quickly stop the customer pain by reverting the latest change or deploying a hot fix for the error that we’ve found.

Chaos Engineering

Another way to do testing in production and like observability it helps us find the unknown unknowns.

There are different approaches to this but I like the one Sara Wells uses thoughtful and planned experiments, done safely in production or tests to reveal system weaknesses.

We have a hypothesis and an evaluation, we do the tests and evaluate whether what we expected to happen did chaos.

Monkey is a tool Netflix produced to do their chaos engineering and that was a little bit more random and you know just trying different random things to see what happened and of course having people on hand in case it caused terrible problems.

There’s a lot of tooling coming out for this a lot of products coming out for this so it’s kind of a new area to get involved with.

We can automate regression tests in production and take advantage of testing in the actual production environment, if we can do it safely so as long as we don’t impact customers and as long as we can hide the features we’re testing.

By using feature release toggles this could be a really important way to see how things behave in production. Regression tests are not just for your test environments.

I mentioned Chaos Engineering is one way to do that, and this does not mean letting customers find our bugs, it’s only feasible with observability and the capability to quickly revert or fix a problem.

Because we can deploy to production without releasing the changes, thanks to feature toggles and other means such as canary and dark launches, we are able to do this.