What is SRE?
In a nutshell, Site Reliability Engineering, SRE, is...
- Taking software engineering approach for operations and infrastructure
- Enabling enterprises to reliably and economically scale their critical internal and external services
- Maximizing service/product value instead of just minimizing the costs, but still with lesser costs in a long run
- Measuring and optimizing what really matters
- Balancing service reliability and the pace of developing new features
- An opinionated collection of principles and practices
As you see, the meaning of SRE can be summarized from multiple perspectives. Let’s open these a bit.
SRE takes a software engineering approach for operations and infrastructure.
Software engineers, or developers as they are usually called, spend time engineering solutions. You pay them to design and build solutions. Operators, however, traditionally use their time to manually operate and maintain the solution developers have built.
These two worlds - engineering solutions and maintaining solutions - have been quite different and that has caused lots of problems.
DevOps puts the developers and operators into the same team, thus breaking down silos between them.
Site Reliability Engineering takes that thought a bit further: why don't we use a software engineering approach also for the operations? Instead of doing things manually, let's engineer solutions that reduce the need for manual work.
SREs reduce the need for manual work in future. Some examples of how this can be done: improving processes, increasing self-service, building automation and tooling and improving service reliability.
SRE enables enterprises to reliably and economically scale their critical internal and external services
The traditional way of doing operations is to use human labor. Some desired service level is agreed (SLA) and then by default, a minimum amount of work is done that is required to fulfill that agreed level.
When service grows, the amount of work needed for operations also grows. The growth of operation costs grows linearly together with the service size and complexity.
SRE is actually cheaper than traditional ops, because you need less than a linear amount of work. That is true thanks to the SREs using their time to engineer solutions, which reduces the need for manual work in the future.
The better service quality causes fewer incidents to handle, automation reduces manual work, and antifragility fixes problems before they affect business.
So SRE should be thought of as an investment, which has a better ROI than the traditional decrease-the-costs ops approach.
A way to maximize service/product value instead of just minimizing the costs
Ask any IT director what are their top 5 priorities and decreasing IT costs will surely be on their list.
That is understandable because 40-90% of service costs incur after launch and a major part of the whole IT budget can go to running costs.
Yes, decreasing IT-related costs should be a high priority!
Unfortunately, way too often the cost-minimization route causes local optimization and ends up actually paying more.
Example: There is a heavyweight internal system, like SAP, centralized API management, or data storage, that directly or indirectly almost every development team has to use.
With the cost-optimization approach, the maintenance of the said service is often externalized to a vendor, which promises cheap labor costs. Some service level is set to ensure uptime. What is then monitored are the running costs of the system and whether SLA has been breached.
What is not monitored is how much money is wasted due to the system maintenance team not being able to serve dependent project teams properly. You have all seen the problems: long waiting time for change requests, getting proper access never seems to work on the first try and overall black box mentality.
Saving some money due to lower labor costs does not cover the money lost due to the operating model which doesn't make good service possible.
SRE fixes that by changing the operating model. Systems are operated as services and services do everything they can to serve the internal and external users. They spend money to build better service, whether it means improving the system, process, or self-service. The money spent will cause huge cost savings in the future, as the dependent development teams waste fewer resources.
After the initial investment, which might be a bit higher than with traditional ops, in the long run, this operating model will save more money.
A way to measure and optimize what really matters
As previously stated, SRE takes a service approach for operations. Services, as the name implies, are built to serve their users.
There is a sweet spot between resources used to run and improve service and the value it brings. After some optimum point, the return of investment is no longer justified.
SRE has mechanisms both to measure the service value and for finding the sweet spot. Let's first discuss the former.
A key concept of SRE is a "Service Level Objectives", SLOs, which basically state desired level of how well the service should serve its users.
Desired level, not maximum level, because of the diminishing returns after the sweet spot. The business makes the decision about the desired level, as they also pay for the service development costs.
Example: "We have an online store. Obviously, we want our customers to be able to find and buy products. So instead of just measuring whether the online store is up, we should measure whether the end-users can find the products they are seeking and for example, whether the purchase process can be completed successfully."
SLOs force you to think about what are the main user journeys of your service and then set what is the acceptable level of failures in a month.
A way to balance service reliability and the pace of developing new features
SRE takes a realistic approach to the operations - it cannot be done without failures. What can be done, however, is to decrease the failure size and impact and find a sweet spot between development speed and amount of failures.
Failures can be seen as a product of development speed. The faster you run, the more failures there will be. The optimal amount of failures is never zero, as that would mean a total halt of development.
The balance between speed and failure differs from service to service. The decision is left for the business to make, as they are the ones who will benefit from the speed and get hurt by the failures.
SLO is the mechanism to ensure that the set level is reached.
Using the example above, the company can decide that "95% of the product searches should return a list of products successfully within 5 seconds".
That is an SLO.
Then the service can be managed to that level. Whenever the numeric measurement of the SLO tells you that you are above the threshold you can keep on running.
Breaching the SLO has consequences. The most common consequence is that new business features are put on hold until the quality of the service is improved and the SLO has been reached again.
SRE has lots of mechanisms that help you to detect and fix the problem even before the SLO is breached.
An opinionated collection of principles and practices
SRE is an opinionated collection of principles and practices. It is built on top of DevOps best practices, with an extra focus on production, business and end-user.
DevOps is a broad set of principles, guidelines, and culture. SRE is one implementation of DevOps. Just like Agile and Scrum, the former being a set of principles and the latter being one implementation of that with set practices.
SRE implements DevOps, meaning that all DevOps principles are incorporated into SRE. On top of that, SRE takes an extra focus on reliability, scalability, business outcomes, and end-users.
We’ll dive into the SRE principles and practices next.
SRE Principles and practices
Principle #1: Operations is a software engineering problem
We already covered that on a high level, but let’s deep dive a bit.
Using a software engineering approach for operations and infrastructure means:
- Treat infrastructure and configuration also as a code, store them to version control, and then use development best practices like peer review process and automated tests to reduce risks and human errors
- Do what software engineers do: invest time to make more time in the future and build to create lasting value
- As operations are treated with a software engineering approach, that means most of the tools, techniques, and languages used for operations are familiar to developers. This decreases the silo between these skill sets.
Principle #2: Service Levels and proactiveness
Three terms used in SRE:
- Service Level Agreement, SLA, is the minimum level of service promised to customers, and breaching that usually means direct or indirect monetary compensation
- Service Level Objective, SLO, is an internal goal for the service level. Service is managed to that level and breaching SLO usually has consequences like slowing new feature development until the level is restored. SLOs measure things that matter, like whether the end-user can successfully find and purchase products from an online store, not just what some CPU load measurement is.
- Service Level Indicator, SLI, is a measurement that indicates the level of service. SLI tells whether you are within your SLO. SLI is often an aggregate of multiple technical measurements from all layers of the service.
And three golden rules:
- The business sets the SLO (with the help of others). This way business decides the balance between feature development speed and service level and the business also carries the responsibility of managing the service to hit that level.
- SLOs measure things that matter. SLOs enforce you to treat service as a whole, understand what is the value it brings to end-users and business and then optimize that to the sweet spot between value and costs
- SRE is about being proactive. Thanks to measuring and monitoring SLOs, the moment service level starts to deteriorate, a problem can be found and fixed before it even causes any outages.
Principle #3: Get rid of toil (aka repetitive work)
"Toil" is any manual repetitive and tactical work, without lasting value. The SRE team constantly engineers solutions to reduce the amount of toil.
Let’s be clear here: manual work is not always evil. Rebooting a failed production server has high tactical value and you should do that. But having to reboot production servers (or even having servers) is a toil that most likely can be removed via automation, tooling, and process improvements.
Removing toil doesn't always require automating the process. Sometimes toil can be removed by improving the process or increasing self-service.
SREs eat their own dog food meaning that they do both the manual work and engineer solutions to reduce the manual work. To ensure they have enough time for the latter, there is usually a gap in how much time they can spend on manual work. That is usually 50%.
Principle #4: Automation
Building automation should dominate what SRE does. Humans should spend their time on things that matter - that is building more automation or reacting to those operational problems that cannot be automated.
Why is automation so important? There are lots of us who have been operating a service that has so many fires to fight, that no one has time to improve the quality of the service. This neverending negative spiral burns people out and costs a lot.
This situation rarely happens instantly, but is often the product of bad management decisions. Not investing in automation but just developing new features will increase the need for manual work, until the team is no longer able to do anything but fight the fires.
No one deliberately builds a warehouse that collapses after just one year, no matter how busy the organization is. Unfortunately, this attitude is common when dealing with digital services and short-term optimization will have huge costs in the future.
Principle #5: Reduce the cost of failure
Failures are inevitable and a product of development speed. What can be done is to greatly reduce the cost and impact of a failure.
Frequent deployments to production decrease batch size, which decreases the risk and impact of failure. Freeze periods are often a signal that deployment sizes are way too large and infrequent.
SREs build automation, which handles the deployment. Humans are error-prone, so removing the need for manual work for the deployment process reduces the number of failures.
There are lots of deployment tactics to reduce the size of failures. For example "canary deployment" means deploying the new version only to 1% of the users. If something goes wrong, 99% of the users are not affected.
Automatic rollbacks to the last working configuration are also a common tactic.
Principle #6: Shared ownership
SREs share the same skills with the development teams. That helps to break boundaries between development and operations. That also helps the developers to be responsible for the operations instead of externalizing the problem to the operations people.
When the development team follows SRE best practices or staffs the team with SRE experts, the end goal is that more engineers will have experience in production deployments, not less.
So although the SRE experts have exquisite operations skills, they don't take the whole responsibility of operations, but instead, share the best practices with the whole team.
There are no individual roles within SRE. A Site Reliability Engineer is a person who is really skilled at Site Reliability Engineering, but any team can start incorporating SRE best practices into their daily work.
However, there are clearly different roles of SRE teams. Here are the most common:
Operates what they build
- "We are SRE" - any ops or devops team just starts learning SRE practices
- "Embedded SRE" - SRE expert(s) is injected into the existing development team to evolve their practices toward SRE
Operates foundations that other teams use
- "Platform SRE" or "Tooling SRE" - builds and operates centralized development platforms and common solutions that most of the teams use
Helps others to operate better
- “SRE center of practice” - central SRE team which works both as a pool of experts and also helps development teams to be more efficient
Operates services built by others
- "Google SRE" - a model where development and operations are separated. This usually requires a vast organization size to be sensible
These roles might be combined, so for example Platform team can also work as an SRE center of practice, both supporting teams with their ops problems and building common solutions for them.
As you notice, only some of the topologies operate services built by others. So SRE doesn’t mean that you have to necessarily outsource operations from the development team and in most cases that is not the optimal solution unless your scale is huge.
Site Reliability Engineering, SRE, is an approach where operations and infrastructure are handled with software engineering practices.
The main goals of SRE are to reduce manual work by building automation, improving processes, increasing self-service, and improving service reliability.
SRE costs less than traditional operations in the long run due to reducing manual work and maximizing the value of a service. For external services, this helps to optimize and manage the value the service brings to the end-user and thus to the business. For internal services, this has massive benefits as all the dependent development teams will waste less time due to better service.
SRE uses DevOps principles and best practices, thus evolving practices of existing teams are easier when the basics are already familiar.