A Leap towards SRE Culture: Measuring SLO, SLI and ErrorBudgets using Grafana Beyla.

Sibaprasad Tripathy
7 min readApr 29, 2024

Field of Site Reliability Engineering (SRE) deals with bridging gaps between Product, Engineering, and Operations. At its core, SRE takes a more development-centric approach to solving operational problems. As Google aptly puts it, “SRE is what you get when you treat operations as if it’s a software problem.”

With the emergence of distributed architecture, it becomes complex to manage and measure the productivity of each system. This is where SRE comes in. Instead of considering application reliability as an abstract term, SRE tries to quantify it with a data-driven approach.

Key SRE Practices:

  • SLA (Service Level Agreement): An SLA is an agreement between provider and client about measurable metrics like uptime, responsiveness, and responsibilities.
  • Example: The service should be available 99.9% of the time in a given month, and 95% of the requests should have a response time within 100 milliseconds.
  • SLO (Service Level Objective): A SLO is an internal target that measures how customers use the service. Typically, the SLO target is more aggressive than SLA.
  • Example: If your SLA for availability is 99.9%, the ideal SLO should be in the range of 99.9–99.95%.
  • SLI (Service Level Indicator): A “Service Level Indicator” is a metric that tracks how your users perceive your service, based on their usage. These are the key indicators that drive your decision towards making the service reliable. In the above case, we know the key indicators are HTTP requests and response time, so we need to have them plotted to get a fair idea of system performance.
  • Error Budget: An error budget is the amount of acceptable unreliability a service can have before customer happiness is impacted. If a service is well within its budget, developers can take more risks in their releases. If not, developers need to make safer choices.
  • Example:With a 99.9 SLA, the error budget comes down to 100–99.9 = 0.1, which means the service can be down 0.1% of the time for a given time period. For a 7-day period, the available limit is 10.8 minutes; for a 30-day window, the limit is 43.20 minutes, and so on.

While all of these principles sound great, one key challenge in implementing there is ensuring that the right metrics are readily available. Metrics such as requests, errors, and duration require a certain level of instrumentation to the service using Opentelemetry(oTel) or similar approach and often require developer involvement. While many organisations have started adopting the practice of instrumenting services from day zero, it still remains a significant challenge at large. This is where technologies like eBPF can be leveraged.

eBPF, or Extended Berkeley Packet Filter, is a Linux kernel technology that enables developers to construct programs capable of securely running in kernel space. Its unique capability to interact with the kernel allows eBPF to trace any traffic flowing through the kernel, exporting it as telemetry data as well it allows auto instrumentation of service using oTel. Today, all major observability vendors are harnessing the power of eBPF. Examples include cilium hubble from Isovalent(now Cisco), Pixie from New Relic, Beyla from Grafana, and Odigos, alongside proprietary solutions such as GroundCover and Senser.

Setup:

For our demo we would use grafna beyla as it’s equips well with our existing eco system of grafana products. It can auto instrument app using eBpf and opentelemetry to emit telemetry data both as prometheus metrics and traces. For the purpose of this blog we are only going to focus on the metrics part.

Prerequisites:

Kubernetes cluster with running promtheus and grafana.

Beyla can be deployed as a stand alone process, as docker container or in kubernetes as sidecar or daemonsets(https://grafana.com/docs/beyla/latest/setup/).For our case we deployed it as a demonset in kubernetes. While beyla team is still working on the official helm chart I have created a sample helm chart that can be referred here. The chart deploys beyla as a daemonset and makes the metrics available at port 9092. The chart also contains a prometheus service monitor for prometheus to scrap the metrics.The repo also has a sample application which can be deployed for beyla to instrument.

Installation:

Sample App: Clone the sample repo and deploy the sample app using helm.It’s going to bring up apps in default namespace.


helm install sample-app sample-app

Beyla:In beyla templates modify values.yaml, specifically the service discovery section in beyla config. Th existing code is going instrument services in default namespace.Refer https://grafana.com/docs/beyla/latest/configure/options/#service-discovery for more discovery options.

    discovery:
services:
- k8s_namespace: default
# uncomment the following line to also instrument the website server
# - k8s_deployment_name: "^website$"

Create a namespace named beyla and and run helm install, it should bring up beyla as demon sets and metrics would be exposed at port 9092.Access the metrics by port forwarding the daemonset.

helm create ns beyla
helm install beyla beyla -n beyla

And here are the list of metrics exposed by beyla both on Opentelemetry and prometheus format.As you can see the metrics track http and grpc calls for both ingress and egress scenarios.

By this time you should also see prometheus discovering beyla targets through service monitor.

Okay, as we have required metrics now, lets work on building queries for dashboard.For this demo we would set the availability SLO as 99.9 and latency SLO as 100ms(p95).

Availability SLI:

Availability of an app is percentage of requests that an application successfully honoured to that of total number of valid requests over a time period.Here is the promql.

sum(rate(http_server_request_duration_seconds_count{service_name="${Service}",http_response_status_code=~"2.*",instance=~"$instance",job=~"$job"}[$__range])) / sum(rate(http_server_request_duration_seconds_count{service_name="${Service}",http_response_status_code=~"(2|5).*",instance=~"$instance",job=~"$job"}[$__range])) * 100

If you can see here we have not included 4xx requests in the total request count, that’s because most of the time 4xx error code retuned because of an issue with user requests hence is not considered valid request for calculation.

Latency SLI:

Latency SLI is calculated measuring the p95th percentile of request duration, as our SLO states 95% requests should be honoured with in 100ms.

histogram_quantile(0.95, sum(rate(http_server_request_duration_seconds_bucket{service_name="$Service",instance=~"$instance",job=~"$job"}[$__range])) by (service_name, le))

Available Error Budget:

As per our SLO error budget is 0.1 percentage of time duration. So percentage availability of error budget is calculated as below.

100 - (((sum(rate(http_server_request_duration_seconds_count{service_name="${Service}",http_response_status_code=~"5.*",instance=~"$instance",job=~"$job"}[$__range])) / sum(rate(http_server_request_duration_seconds_count{service_name="${Service}",instance=~"$instance",job=~"$job"}[$__range]))) / 0.1) * 100) or vector(100)

vector(100) at the end of the query is useful when you don’t have any 5xx errors, when number of 5xx becomes zero the entire expression becomes irrelevant and the value of vector comes in to picture.

Available Error Budget in time for a time period:

Here we have calculated error budget for period of 7 days. But it can be be calculated for any duration using the same expressions. The expression is going to give available error budget in minutes.

0.01 * 7 * 60 * 24 * (0.1 - sum by (type,service_name,teamId) (rate(http_server_request_duration_seconds_count{http_response_status_code=~"5.*"}[7d])) / sum by (type,service_name,teamId) (rate(http_server_request_duration_seconds_count[7d]))) 

While avoiding slow processing of the query while calculating error budget for longer durations, so it’s better to create a recording rule using above metrics for faster execution.

- name: error_budget.rules
rules:
- record: error_budget:7d
expr: |
0.01 * 7 * 60 * 24 * (0.1 - sum by (type,service_name,teamId) (rate(http_server_request_duration_seconds_count{http_response_status_code=~"5.*"}[7d])) / sum by (type,service_name,teamId) (rate(http_server_request_duration_seconds_count[7d])))

Demo:

Okay, now that we have the queries ready lets try to plot it in grafana. I have created a sample dashboard that you can use from the repo. Along with SRE metrics it also has some useful panels to get deeper insight on to ingress/egress traffic.

Our sample app consists of two services. A go based api running behind nginx reverse proxy.The api basically expects a name query string like “name=your input” and returns a response “Hello, your input”. If no query has been passed in the request it’s going to return “Hello Guest” by default .

First lets generate some traffic by making api call to sample app.Lets do a port forward on nginx service and run below command to generate 2xx response.

for i in `seq 1 20`; do curl http://localhost:59038; done

lets generate some 4xx response.

for i in `seq 1 20`; do curl http://localhost:59038/12345; done

Now lets scale down go app replicas to zero and make the request again.We should see 5xx responses from nginx.

for i in `seq 1 20`; do curl http://localhost:59038; done

Now that we have generated the traffic lets analyse it in grafana. As we can see Availability and error budget being impacted due to the system downtime we induced earlier.

Ref:

git repo: https://github.com/sibu105636/beyla_ebpf/tree/main

Beyla:https://grafana.com/oss/beyla-ebpf/

eBpf: https://ebpf.io/

--

--

Sibaprasad Tripathy

MTS @Prosimo.io, Talks about SRE, Observability and Terraform