Service level indicators (SLIs)
Data used to measure an SLO is called a Service level indicator (SLI), because it indicates whether the SLO is met.
For example,
the SLO “The website landing/home page will take <5 seconds to load 99% of the time over a 7 day period” refers to website load times. To determine whether this SLO is met, you would need latency data capturing average landing page load times for the past 7 days. This latency data would be the SLI for this SLO.
Because SLOs aim to quantify user experience, their associated SLIs are values that correspond to the quality of that user experience. This includes things like:
- The number of requests to an endpoint that complete successfully
- The number of requests to an endpoint that complete within 500ms
- Average load times in specific areas and pages users need
SLIs are often formatted as percentages, representing the rate of “good” events out of all valid events.
For example, a valid event could be a user request to an endpoint, regardless of the request’s success or failure. A “good” event would be a successful 200 OK request to that endpoint. The SLI would be the percentage of all valid endpoint requests that were 200 OK successful.
An SLO is like a guarantee made to customers. By defining an SLO, an organization is promising to maintain a defined level of service, often related to the performance and availability of their systems.
Some examples:
- The website landing/home page will take <5 seconds to load 99% of the time over a 7 day period
- Checkout service will operate without error 99% of the time each 30 days
- The landing/home page will be able to successfully process at least 1,000 requests per second, 99% of the time, measured in 90-day increments
- Average response time in the cart service is <300 milliseconds, 98% of the time, measured in 7-day increments
Notice each includes the same components:
Scope | The specific area an SLO relates to, such as “checkout service”, or “landing page”. This is an area or function that impacts user experience. |
Target Value | A measurable threshold for performance, like “1,000 requests per second” or “less than 5 seconds”. Data used to measure an SLO is referred to as a Service Level Indicators (SLIs), because it indicates whether an SLO is met. |
Target Rate | A percentage of the time performance will meet the target value. |
Target Window | Period for which data is evaluated, such as “over 7 days”. |
Notice also that each SLO is centered around user experience. SLOs should capture performance and availability levels that, even if barely met, would keep the average user satisfied. In the simplest terms: