SLAs (Service Level Agreements) have become a universal metric used in contracts with new customers drawn up by legal teams to represent the level of service being offered. Failure to live up to the agreement can have multiple consequences including financial penalties, service credits, or license extensions. Although legal teams continue to draw up these agreements, technical departments are still too often oblivious to these agreements and how to measure their own products or services to meet customer expectations.
So, what is Service Assurance Monitoring, and why should your business care about it? It is safe to assume your business has users. Whether your business is customer facing or business facing, your technology is designed to serve users, and in a world that continues to demand 24/7 uptime and reliability, the challenge to deliver user expectations continue to grow
Service Assurance Monitoring is designed to help businesses deliver on their agreements with users. Service assurance monitoring is made up of SLAs (Service level agreements), SLOs (Service level objectives), and SLIs (Service level indicators). SLAs are the agreements a business makes with users, SLOs are the objectives technical teams must maintain to meet agreements, and SLIs are the actual metrics based on performance.
The benefits of using service assurance monitoring can be boiled down to a few primary use cases:
Universal monitoring – All metrics are measured the same way which allows for everyone internally and externally to be on the same page about system performance. This way everyone from customers to customer support to engineering can understand availability of your system, response times to system failures, promises made to users...etc.
Focus on client success – SLIs are designed to isolate customer pain. A broken SLO/SLI show only where customers are having a bad experience with your service eliminating noise away from what is important – customer success.
Error Budgets – Engineering teams and product managers can agree on when feature development versus paying down tech debt takes priority by using error budgets. Error budgets expose when agreements are being broken and time needs to be allocated to better the service’s performance.
Service assurance monitoring comes with multiple challenges that have kept engineering organizations from adopting it completely. It's difficult to create a monitoring service for universal monitoring when service metric data is siloed in multiple, different areas of the business and its existing tools. On average, business IT organizations use 9+ different tools for monitoring, making it difficult to get all data together to give a centralized view of the service’s performance.
Some existing tools offer service assurance monitoring, but fail to offer all the necessary data to make it successful. Instead, one team may have SLOs around application performance while another team may have SLOs for infrastructure availability, but fail to pull them together to create a view available and readable by the entirety of the business. And because they are siloed, it makes it impossible to link them to the appropriate agreements they are designed to maintain.
Other challenges to service assurance monitoring will be broken down in more detail with its associated level of monitoring – SLAs, SLOs, SLIs
An SLA defined by Google in the SRE handbook “an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.”. SLAs are defined and constructed at the business level and drawn up by legal teams. Because they are included in written agreements with users, the consequences for breaching an SLA can have serious ramifications.
SLAs are measured by a collection of SLOs. It's important for a business to have the correct SLOs defined in order to fully understand if SLAs are being met, and the business is meeting their customer agreements.
Free services, like Google search, usually don’t have an associated SLA because they don’t have a direct contract with its users. Instead, they focus on creating SLOs to maintain a strong product or service reputation and maximize delivering a high-quality product or service.
Measuring SLAs are extremely challenging for a multitude of reasons:
As previously mentioned, service or product metrics are siloed. Businesses sit on top of swamps of data with data separated and siloed within different areas of the business organization. Mapping all the necessary SLOs to the correct SLAs is impossible without the correct tooling.
SLAs are often not written by employees building or maintaining the service or product. This leads to SLA targets being unachievable, SLAs that are difficult to measure, and agreements that don’t take nuance into account.
Each user agreement can be unique and have different associated contract values. Measuring multiple different SLAs and calculating the different consequences associated with the agreements can become burdensome.
An SLO is a specific target value like uptime and response time, used by teams as an agreement within an SLA and is measured by a collection of SLIs. SLAs are legally drawn up in contracts with customers and users, and SLOs are the set targets necessary to meet those agreements. SLOs define customer expectations for products or services and teams use them to measure their success.
Teams struggle to understand what SLOs to set. They tend to want to measure everything and get over complicated with the process. Simplicity is the best approach to setting SLOs. If an SLO cannot be used in conversations about priority of work, then it is likely not necessary to be set. The focus of SLOs should always be occupied with customer success and should be easy to understand for any person internal or external to the business. With SLOs, less is more, and usually a defined SLO around availability and reliability is enough for each service or product.
SLOs can be used for free and paid products or services as well as for internal and external customers. SLOs are mostly used for customer success purposes, but can be used for internal systems as long as they are not associated with an SLA. This can be useful if the business wants to measure all teams based upon the same methodology.
An SLI defined by Google in the SRE handbook “a carefully defined quantitative measure of some aspect of the level of service that is provided.”. SLIs are the actual numbers based on the performance of the product or service, and measure compliance with an SLO. Teams can create as many SLIs as they find necessary to accurately measure an SLO, but need to be careful to not over-complicate what is being measured and keep to what actually matters to users. It's important to remember to keep it simple.
SLIs should all be measured the same way to make it easier for anyone to understand what's happening to the product or service and to isolate any customer pain. SLIs are measured as - Count of all successful transactions / Count of all attempted transactions.
To put it all together, say the business agrees on an SLA with a customer of 99.9% uptime, then the SLO should also be at or above the SLA of 99.9%. The SLI is the actual measurement of the uptime and to maintain compliance will need to meet or exceed the agreement of 99.9%. If the SLI were to hit 99.8%, it would then put the SLO and SLA out of compliance with the agreement.
In part 2 of this series covering Service Assurance Monitoring, we will go deeper into best practices for putting the customer first when setting SLAs, SLOs, and SLIs.
Ready to dive in?