How Site Reliability Engineering can do more for your dev and ops

29 Apr 2022|Opinion

DevOps has become a staple that can be found in the toolbox of many software companies today, and for a good reason. Site Reliability Engineering (SRE) implements DevOps principles with an extra focus on production and operations. It can help companies to ramp up new ways of working, lower operational costs, boost development speed, and increase the business value of services you are creating.

Person in red plaid shirt speaking with colleague in white shirt during office meeting, laptop visible on table.

Let’s face it: operating IT services is hard to get right by accident. For one, it’s very labor-intensive, and the costs can get high: after all, an estimated 40–90% of a service’s total cost of ownership is incurred after it is launched. Downtime and production issues can affect end-user happiness and eventually business. Brittle production hurts your employees as well: no one wants to wake up in the middle of the night to fight fires.

You can gain short-term cost savings by outsourcing and externalizing the problem. However, this usually comes with downsides. First and foremost, there’s the communication barrier between dev and ops, a lack of end-to-end visibility, and the endless responsibility ping-pong. On top of that, long waiting times for non-critical requests and decreased reliability are also possible side effects.

All of these hurt either end-user satisfaction or the speed at which you can develop new business features – or both, which eventually hurts your business without fail.

DevOps is the remedy, but sometimes it is not enough

DevOps has had a huge role in fixing many of these issues in software development in many ways, such as:

tearing down silos between dev and ops
establishing end-to-end responsibility within one team, and
bringing about the cultural change necessary to constantly deliver value with small and frequent changes

Credit where credit is due – all these are known to greatly increase the ROI of service development. But in reality, the problems seldom end there.

One commonly overlooked phenomenon is that DevOps teams tend to be more mature on the dev side than ops. Cloud, security and overall operations knowledge might be quite low in any given DevOps team, which may lead to problems with reliability, scalability, and automation. Handling production is one feat and most DevOps teams can handle that. Handling production well while simultaneously optimizing the end-user journey, increasing development speed, reducing manual work, preemptively fixing production issues before they happen and doing that data-driven way is another story.

Additionally, even when the DevOps team itself works efficiently, it might still depend on an internal or outsourced ops team which creates a bottleneck and lowers the velocity of all teams that rely on it.

When the number of autonomous DevOps teams increases, new challenges arise. Teams are very effective at building new business features and looking after their own service, but all the work that would help other teams might get deprioritized as it falls out of the scope of the team. This leads to a situation where teams reinvent the wheel and solve the same problems in too many ways. Without a centralized platform team or the culture of using a chunk of teams’ velocity to help other teams to flourish, the overall velocity starts dropping.

How SRE can improve your dev and ops?

Site reliability engineering (SRE) is a paradigm that emphasizes the Ops part in the DevOps and does that using Dev practices. It both speeds up the development and brings the ops back in the spotlight. Rather than treating ops simply as a minimizable cost, it instead views it as a strategic investment that enables and maximizes business value, developer productivity, and end-user satisfaction.

SRE’s key benefits can be summarized into two distinct areas: faster development speed and smoother operations and services.

Teams applying SRE practices are able to spend more of their time developing better services. This is possible because SRE takes a stance on reducing manual work, called “toil”. Anything that needs manual intervention and is repetitive, is a candidate for removal. SREs use engineering practices to get rid of toil. Any means which helps to reduce toil is used, be it building automation, increasing reliability, improving alerting, or removing the need altogether with a process change. SREs invest time to create even more time in the future.

As the need for manual operations works decreases, more time is left for development. This helps to speed up the feature development speed, which is good news for the business. Business is not the only winner, as developers welcome the fact they can spend more time on doing development which they love. The group which wins the most is the user of the service.

SRE is built around end-user happiness. It measures, manages, and optimizes the service for end-users, be they internal or external, as that is what matters the most.

SRE does not reinvent the wheel but builds on top of DevOps best practices. Any team familiar with DevOps can evolve to the SRE way of working.

Make no mistake – unlocking the benefits of SRE is a gradual process rather than instantaneous. It starts with understanding what SRE means in your context and picking the most suitable model and only then proceeding to roll out and ramp up the new way of working. But with an experienced SRE provider like Futurice, the direction is always clear: steadily towards your long-term vision and increased business value.