My recent blog post discussed the role that DevOps has had in fixing some of the big issues of managing operations, and how Site Reliability Engineering and SRE services can help take things even further. This time we’ll take a more concrete and detailed look at the inner workings of SRE that help make all this happen.
Why the traditional way of handling ops doesn’t always cut it
Sometimes the original may indeed be the best option on the market, but the claim doesn’t necessarily hold true when talking about ops management – not categorically, anyway.
The traditional way of handling operations has been to allocate people to manually work things out. This is a very expensive approach by default, so different means have been implemented to minimize costs: for example outsourcing operations to an external partner who offers a cheaper hourly rate or a package cost for handling the service. Service level agreements (or SLAs) are then used to strike an acceptable balance between the cost and the quality of the service.
A considerable share of IT budgets is spent on running costs for critical and non-critical services. It is understandable that minimizing these costs is something every IT director seeks to do – and that model is quite sensible for some use cases, but certainly not all of them. Fortunately, as we’ve established, it’s not the only model there is.
DevOps focuses on maximizing the value of services
DevOps takes a different approach to the problem. Instead of concentrating on minimizing the cost of operations, they seek to maximize the value of the whole service. Rather than create a logical silo between developing a service and operating it, the entire service is thought of as one entity that requires a set of different skills and roles to build and operate successfully, in order to generate end-user satisfaction and business value. It’s an elegant approach, and a really nice way of thinking, if you ask me.
Some core ways for DevOps to maximize service value include: - putting all the necessary people together to avoid communication silos - small but frequent deployments to manage risks, - using measurements and data to back up decisions, and - building a fast feedback loop from ideas to production.
Site Reliability Engineering – all the good stuff without any of the nonsense
SRE takes the tried and tested DevOps approach and enhances it by adding some extra focus on production. It retains the emphasis that services do need to be operated – but with the important distinction that an expensive workforce should not be wasted on doing the same thing over and over again, since computers are more predictable and way cheaper. SRE uses various means to decrease the manual work needed for operations in order to create more time to build new features and better reliability.
SRE takes an aggressive approach to reducing what could best be described as toil, or repetitive manual work: everything that needs it should be eliminated (as far as the human workforce is concerned) and automated away. Sometimes it means building automation, sometimes it means improving service reliability in order to end up with fewer incidents, and sometimes it means increasing self-service so the need for certain work is no longer needed at all.
Reducing manual work will of course require engineering work first – so SRE can be thought of as an investment into a better future. SRE has mechanisms to ensure that enough time is allocated to fix things preemptively because not fixing them will cost even more time in the future. It is the solution to the neverending issue of not having time to fix root causes because putting out fires takes all your available time.
The second core principle of SRE is to focus on end-user and business value. Like DevOps, SRE measures and uses data for decision-making, but it also takes it a bit further by introducing the concept of Service Level Objectives (SLOs). SLOs are aggregate measurements of how well the end-user can use your service to do something that brings value to you.
For example – if you have an online storefront, some of the SLOs will ensure that customers can log in, find the products they are looking for, successfully make a purchase and be happy about the whole journey.
Yes, you’ll have to measure CPU and latency and whatnot, but these technical stats are not enough on their own. They have to be aggregated into meaningful measurements of the main journeys that end-users can take in your service.
The beauty of SRE is that business has to set the desired SLO. It is not something that IT does, but it is a business decision – because the stricter the SLO is, the happier your end-users will be, but at the same time, the slower your development speed will be.
Once the SLO has been set and measured (the name of the resulting measurement being the Service Level Indicator or SLI), then SRE makes sure that the service is managed to that level. In case the SLO is compromised, SRE has mechanisms to decrease feature development speed and to prioritize improving the quality of the service. And if your SLO is not even remotely at risk, that means that you can run faster and increase the business feature development speed.
SRE understands that failures are inevitable, and by default they are considered a product of development speed. The faster you run, the more problems there will be – but if you use DevOps principles such as frequent and small deployments to reduce the risk size and blast radius, you can reach the top running speed without risking your SLOs.
Like DevOps, SRE is a highly iterative method. You can begin your journey by introducing just some concepts of SRE and setting a relatively loose SLO. Then, little by little, you’ll start to improve both the service quality and automation level, as well as the level of applying SRE principles in your team.
Eventually, you’ll reach a point where SRE has a lower total cost of ownership than outsourcing the operations to the cheapest partner. That is because computers work faster than humans and all that service anti-fragility and automation work has reduced the need for human intervention and manual work. It will take a while for sure, but the payoff is more than worth the wait.