So, you're just learning about the new kid on the block hun? Service Level Objectives (SLO) seem to be popping up everywhere but you're not really sure what is it all about? No worries, we got you covered!
Why Implement the SLO Methodology
SLOs are an essential part of a broader methodology known as Site Reliability Engineering (SRE). In the broader sense, SRE is simply a way to take software engineering principles and apply them to infrastructure and operations in order to create highly reliable systems. The assumption here is that operations are a software problem, therefore SRE should use software engineering approaches to solve it! Beyer et al.  mentions the following fundamental principles that SRE aims for.
- Minimize toil and automate as much as you can: If a machine can perform an operation, it should! Toil refers to mundane, repetitive operational work providing linearly scalable value with service growth. Time spent on operational tasks means time not spent on projects! Automation is a blessing. When done right, it means that a task is predictable, reliable, and effortless. Free up your personal to focus on what matters.
- Aim for simplicity: Not only in your code but on your approach to reliability. Software simplicity is a prerequisite to reliability. Systems are complex on their own but you should be able to manage, expand and talk about them regardless of this fact.
- Embrace risk: And finally, the most relevant one, reduce the cost of failure and allow yourself to move faster. Forget 100% reliability, it's expensive and unnecessary. Focus only on being reliable enough to meet your user's quality standards.
Hopefully, these sound appealing, but where to start? There are three concepts at the core of this methodology: Service Level Indicators (SLI), Service Level Objectives (SLO), and Error Budgets. Together they form what Hidalgo et al.  calls the reliability stack.
Before we go over them, let us highlight the following: The SLO methodology is a continuous and iterative process. The main purpose of SLOs is to provide you with new, meaningful data that allows you to look into your service from your user's point of view. It empowers you to make better decisions that influence how the users experience your product. It won't make your services reliable on its own, and it will probably need to be re-think and adjusted as time goes on. Nevertheless, if used properly, SLOs can become one of the pillars of your decision-making process.
SLIs, SLOs and Error Budgets
Service Level Indicators
SLIs are a quantitative measure of a specific aspect of service reliability. They allow you to classify interactions between your service and your customers into one of two states: they were either good events or bad events.
You may think of your SLIs as the manifestation of your standards of quality regarding several aspects of your service. There may be SLIs regarding availability, latency, throughput, frequency, etc. Let's go over some of the simplest examples:
You may want to track how you're doing regarding uptime, in which case time periods where you're up, count as good events, and time periods where you're down, count as bad events. You may also care about how responsive your service feels, so you track the amount of time it takes for your service to answer user requests, in which case fast responses count as good events and slow responses count as bad events.
Service Level Objectives
The binary event classification allowed by SLIs grants the possibility of calculating the ratio of good events over all the events within a certain period of time. Having that ratio, you can then define service-level objectives that set reliability targets for the SLI.
By establishing an objective for a specific time window, you allow your team to evaluate your performance over time. Your service's reliability stops being an abstract thing that is difficult to grasp and starts being an objective, tangible metric.
With your objective defined, you are now targeting a certain percentage of good events that's smaller than 100%. This means that you have some wiggle room for errors, which can be smaller or greater depending on your objective. As long as the number of errors doesn't make your ratio of good events smaller than your objective then you're all right, you still have the budget for errors, otherwise, you've broken your objective and you are no longer in a region of acceptable quality you set out.
We won't go into much more detail about these concepts here. If you wish to have a deeper understanding, you can check out our article In-depth guide on SLIs, SLOs and Error Budgets. Over there we go over why service segmentation is important to reliability, what makes indicators and objectives meaningful, how they are calculated, and how to interpret them.
For now, just keep in mind the following: if your SLIs and SLOs are meaningful, then continuously monitoring error budgets provides a direct window into how your product looks from the perspective of your users. It allows you to, at any point in time, evaluate how you are doing regarding your own quality standards by simply looking at few graphs. No effort, no omissions of events, no confusion, it's just there.
Implementing SLOs in an organization
Additionally to what has already been mentioned, SLOs provide you with a new way of discussing your services internally in a way that anyone can understand, regardless of their role in the organization. With the right mindset - an SLO-driven mindset - everyone, whether from marketing, engineering or management can understand statements such as: "We are closer to breaking our budget this quarter than we ever were! What can we do to prevent that from happening?" or "We haven't broken our budget once in the last year, we're becoming more and more reliable!" or even "We only have 20 more minutes of downtime before we break our budget for the month, let's take a step back from introducing new features."
This ease of communication about things that are highly technical, allows you to tremendously reduce friction between different departments within your company. Even so, you probably won't be able to arrive at your desk one day and scream: “Let’s develop SLOs!” and have your entire team stop what they're doing and jump on that wagon. You'll have to start small, be patient, and yet relentless. It's a mindset that grows with time and becomes more and more valuable as everyone in the team gets involved. Don’t be discouraged if change doesn’t happen immediately.
Continue discussing these tools and concepts with your team, experiment with new ideas, and continuously move towards better monitoring and reliability agreements. Look for small wins as you go through this process and engage with everyone around you. More importantly, iterate over everything! Make adjustments to your SLO configurations and targets to keep them up to date with the evolution of your product and the feedback you receive from your colleagues.
Soon, we'll have another blog post describing how you can bring the SLO mindset into your organization, but for now, we'll leave you with the step-by-step priority guide created by Hidalgo et al. :
- Get buy-in. - Communicate how SLOs work and get everyone in agreement that they provide value
Prioritize SLO work. - Get the work on your roadmap, assign it to one or more people, and make it a priority.
Implement your SLOs. - Decide what SLIs to track, how to monitor them, and what level of reliability you want to provide, and learn how you’re performing against those targets.
Use your SLOs. - Decide as a team how to alert on your SLOs, how to use your error budget, and how to inform work priorities using your SLOs.
Iterate on your SLOs. - Discuss what is and isn’t working, add/remove/adjust your SLIs/SLOs, and continually revisit your SLOs to check that they reflect your stakeholders’ needs.
Advocate for others to use SLOs. - Use what you’ve learned to educate others about the benefits of SLOs.
Setting up SLIs and SLOs @ Detech.ai
We at detech.ai work every day to make your journey into SRE as convenient as possible and we are always exploring and learning to continue to do so! By using our platform, you'll be able to centralize all your SRE-related tasks in a way that enforces best practices, is easily manageable and expandable.
You may configure the Services and User Journeys that are of importance to you, on top of which you may define your SLOs, so that you are always aware of the state of each part of your system. When defining your SLOs you may also add other useful parameters such as its criticality level, its review date, and the person, or team, in your organization responsible for monitoring its performance.
Define your SLIs using as many metrics as you'd like and a variety of evaluation methods to tune it in to your exact needs.
Visualize how the SLI, error budget, and the rate of error budget consumption evolve over time in order to make the best decisions possible at any given time.
And we're just getting started! We are continuously improving our product offering and plan on adding many other features that will facilitate the implementation of SRE best practices within your company.
Implementing Service Level Objectives: A Practical Guide to SLIs, SLOs, and Error Budgets" by Alex Hidalgo | 2020
"The Site Reliability Workbook: Practical Ways to Implement SRE" by Betsy Beyer, Niall Richard Murphy, et al. | 2018