Re-cap of what is Service Assurance Monitoring: Service Assurance Monitoring is designed to help businesses deliver on their agreements with users. Service assurance monitoring is made up of SLAs (Service level agreements), SLOs (Service level objectives), and SLIs (Service level indicators). SLAs are the agreements a business makes with users, SLOs are the objectives technical teams must maintain to meet agreements, and SLIs are the actual metrics based on performance.
Now that we understand what Service Assurance Monitoring is, we can explore why it is important and how to best implement it into your organization.
Current monitoring set by IT measures deep system metrics such as specific storage arrays disk space and CPU usage utilization for individual systems. Measuring the wrong things because they were important to measure in the past is a burden on technical teams leading to alert fatigue. A lot of these systems now have automated processes to scale with the business and no longer need alerting.
Alert fatigue is a serious problem facing most technology driven organizations with the average IT organization seeing 50k+ alerts a month. Too many non-contextualized alerts create noise and reduce productivity. Technology teams are chasing bad alerts instead of building new features that drive new revenue streams and make customers happy, and ultimately the goal is to make customers happy.
It's very likely your organization is already creating SLAs with your customers. It’s probably just as likely that your organization isn’t actually measuring those SLAs which means you are also probably dealing with noisy alerting. Google was having the same issue, and set out to fix the problem. So, in 2016 they released the Site Reliability Engineering book and the Site Reliability Workbook in 2018. With these books, Google introduced the world to the SRE and the key principals for eliminating unwanted alerts with Service Level Objectives and Service Level Indicators. The goal was to monitor and alert on the things that matter the most to the business, the customer experience.
As we covered in Part 1, SLOs and SLIs are designed to measure the pain points experienced by the customer. This forces the technology teams to think about the product through the perspectives of the customer, and start measuring their services in ways that are important to the customer. This not only makes it possible for the technology teams to see how the customer is being impacted by pain points in the product, but everyone internal to the business can also easily see the same impact those pain points are making on the customer. SLOs and SLIs can unify the business behind the same goal, improving the customer experience.
At this stage, the value of SLOs/SLIs has been discussed at length. Let's create a roadmap for some best practices to transform your organization to adapt and implement Service Assurance Monitoring.
The journey of digital transformation for any organization is difficult, but the most difficult part about it is the shift of mind-set and culture it requires to be successful. It's not enough to write code on the cloud and call it a digital transformation. The entire culture of the business has to shift to meet the new demands being placed on it by customers. The customer experience is owned by the business, not only by the product teams, which requires large change to the ways product are delivered, monitored, and maintained. It's not an easy feat to get an entire organization on-board with the change, especially if they are comfortable with the ways they currently work.
Success will look different from one business to the other. In the beginning, it makes sense to implement SLOs and SLIs for a few services and then grow from there. Having buy-in from stakeholders is important, but it's equally important to have buy-in from the teams creating the SLOs and SLIs as well.
The roadmap will outline a general approach to defining and implementing SLOs and SLIs.
Because SLOs and SLIs are meant for than just the technology teams, defining what to be measured takes a little more thought. We suggest creating a shared document that can be shared and everyone on the team can give input. Product and development teams will need to drill down on the customer journey and the technology behind it that makes it successful. That by itself is not an easy task. One service's customer experience may be considered a success based on one type of metric versus another service within the same product that may view the same metric as a failure. For an example, an API delivering the correct error code may be a successful transaction, while that same error code could mean a failed transaction on a different service. You may want to include a solution architect in the discussion to ensure the service also meets nonfunctional requirements.
A service is anything that provides functionality to a user. To create effective SLOs and SLIs, you’ll need to first define your services. Some examples of services you may find within your product offerings:
Now the hard part, defining the expectations of the user while utilizing the services and the goals of the business. At this stage, leaning on the product manager and customer success rep to give customer insight would be valuable. Determine two to four separate behaviors for each service. Once this is defined, it's time to define the metrics that will accurately measure the service based on the previously define customer expectations.
Remember to keep it simple. Each service should have no more than three to four SLOs, and limit the SLIs only to metrics that define what is most valuable to the customer. A valid event for one service may not be valid for a different service. Each event is contextual and can vary. This is where the engineering team needs to have a strong voice in defining what gets measured.
For reminder, an SLO in the target percentage of good events from the total number of events. It represents a cluster of events defined by the SLIs. An SLI is measured as good events over valid events that all roll up to the SLO.
Negotiating an SLO is an important part of the process, but shouldn’t be burdensome. SLOs are meant to be iterated upon and adjusted. 100% is not the correct target. It is expensive and does not allow room for innovation. Defining an SLO will require a trade-off between realistic user expectation and the effort required to meet those expectations. The reality with an SLO is you want to set it at the worst possible level before a user would notice. This gives the product team as much error budget as possible to innovate, release more often, and make the service better. A good SLO finds a good balance between resources, time, and business objectives. If you cannot use an SLO while negotiating future work priorities, its likely not an effective SLO.
SLO Decision Matrix:
Once your organization becomes mature in measuring SLOs and SLIs, the error budget will become an important piece to balancing feature velocity and service reliability. A very mature company should rely on the data from an error budget to determine how many bugs get worked on, how much tech debt to pay down, and how many new features to release. The error budget is the number of bad events that occur before the SLO is breached. As mentioned above, SLIs should have comfortable error budgets separating them from their target SLOs to allow for room to innovate.
In part 3 of this series of blog posts, we will dig into implementing SLIs to make them successful for your business. Creating dashboards, alerts, reviews, and more.
Ready to dive in?