Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets. Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity. - Define service reliability targets
SLA (Service Level Agreement) ↓ Contract with customers SLO (Service Level Objective) ↓ Internal reliability target SLI (Service Level Indicator) ↓ Actual measurement
# Successful requests / Total requests sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d])) `#### 2\. Latency SLI` # Requests below latency threshold / Total requests sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d])) `#### 3\. Durability SLI` # Successful writes / Total writes sum(storage_writes_successful_total) / sum(storage_writes_total)
references/slo-definitions.mdslos: - name: api_availability target: 99.9 window: 28d sli: | sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d])) - name: api_latency_p95 target: 99 window: 28d sli: | sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))
Error Budget = 1 - SLO Targeterror_budget_policy: - remaining_budget: 100% action: Normal development velocity - remaining_budget: 50% action: Consider postponing risky changes - remaining_budget: 10% action: Freeze non-critical changes - remaining_budget: 0% action: Feature freeze, focus on reliability
references/error-budget.md# SLI Recording Rules groups: - name: sli_rules interval: 30s rules: # Availability SLI - record: sli:http_availability:ratio expr: | sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d])) # Latency SLI (requests < 500ms) - record: sli:http_latency:ratio expr: | sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d])) - name: slo_rules interval: 5m rules: # SLO compliance (1 = meeting SLO, 0 = violating) - record: slo:http_availability:compliance expr: sli:http_availability:ratio >= bool 0.999 - record: slo:http_latency:compliance expr: sli:http_latency:ratio >= bool 0.99 # Error budget remaining (percentage) - record: slo:http_availability:error_budget_remaining expr: | (sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100 # Error budget burn rate - record: slo:http_availability:burn_rate_5m expr: | (1 - ( sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) )) / (1 - 0.999) `### SLO Alerting Rules` groups: - name: slo_alerts interval: 1m rules: # Fast burn: 14.4x rate, 1 hour window # Consumes 2% error budget in 1 hour - alert: SLOErrorBudgetBurnFast expr: | slo:http_availability:burn_rate_1h > 14.4 and slo:http_availability:burn_rate_5m > 14.4 for: 2m labels: severity: critical annotations: summary: "Fast error budget burn detected" description: "Error budget burning at {{ $value }}x rate" # Slow burn: 6x rate, 6 hour window # Consumes 5% error budget in 6 hours - alert: SLOErrorBudgetBurnSlow expr: | slo:http_availability:burn_rate_6h > 6 and slo:http_availability:burn_rate_30m > 6 for: 15m labels: severity: warning annotations: summary: "Slow error budget burn detected" description: "Error budget burning at {{ $value }}x rate" # Error budget exhausted - alert: SLOErrorBudgetExhausted expr: slo:http_availability:error_budget_remaining < 0 for: 5m labels: severity: critical annotations: summary: "SLO error budget exhausted" description: "Error budget remaining: {{ $value }}%"
┌────────────────────────────────────┐ │ SLO Compliance (Current) │ │ ✓ 99.95% (Target: 99.9%) │ ├────────────────────────────────────┤ │ Error Budget Remaining: 65% │ │ ████████░░ 65% │ ├────────────────────────────────────┤ │ SLI Trend (28 days) │ │ [Time series graph] │ ├────────────────────────────────────┤ │ Burn Rate Analysis │ │ [Burn rate by time window] │ └────────────────────────────────────┘ `**Example Queries:**` # Current SLO compliance sli:http_availability:ratio * 100 # Error budget remaining slo:http_availability:error_budget_remaining # Days until error budget exhausted (at current burn rate) (slo:http_availability:error_budget_remaining / 100) * 28 / (1 - sli:http_availability:ratio) * (1 - 0.999) `## Multi-Window Burn Rate Alerts` # Combination of short and long windows reduces false positives rules: - alert: SLOBurnRateHigh expr: | ( slo:http_availability:burn_rate_1h > 14.4 and slo:http_availability:burn_rate_5m > 14.4 ) or ( slo:http_availability:burn_rate_6h > 6 and slo:http_availability:burn_rate_30m > 6 ) labels: severity: critical
assets/slo-template.md - SLO definition templatereferences/slo-definitions.md - SLO definition patternsreferences/error-budget.md - Error budget calculationsprometheus-configuration - For metric collectiongrafana-dashboards - For SLO visualization