---
name: slo-sli-designer
description: Design Service Level Indicators and Service Level Objectives for a business service — with meaningful metrics, achievable targets, error-budget policy, burn-rate alerts, and a communication plan for engineering and product teams.
version: 1.0.0
author: VantagePoint Networks
author_url: https://www.vpnetworks.co.uk
audience: SREs, Platform Engineers, IT Managers, Service Owners, CTOs, MSPs introducing SLO practice
output_format: Formatted Markdown pack with service description, SLI list, SLO targets with rationale, error-budget policy, burn-rate alert rules, and rollout plan.
license: MIT
last-reviewed: 2026-04
---

# SLO / SLI Designer

A Claude Code skill for defining SLIs (what we measure) and SLOs (what's good enough) for a service — moving past "uptime is 99%" to metrics that reflect what the user actually cares about.

## How to use this skill

1. Download this `SKILL.md` file.
2. Place it in `~/.claude/commands/` (macOS/Linux) or `%USERPROFILE%\.claude\commands\` (Windows).
3. In Claude Code, run `/slo-sli-designer`. Describe the service. Answer the clarifying questions. Receive the pack.

## When to use this

- You've been asked "what's our SLA?" and realised you don't have one you can defend.
- You're introducing SRE practice and need SLIs/SLOs per service.
- Your team argues about whether incidents are "serious" — SLOs quantify it.
- A contract or regulator needs measurable service commitments.
- You want to give product teams an error budget so reliability and feature velocity are both bounded.

## What you'll get

A single Markdown document containing:

- **Service description** (what it does, who relies on it)
- **SLI catalogue** (availability, latency, throughput, correctness, freshness — as applicable)
- **SLO targets** (per SLI) with rationale
- **Error-budget policy** (what happens when you breach)
- **Burn-rate alert rules** (multi-window)
- **User-journey SLOs** (end-to-end, not just component)
- **Rollout plan** (how to introduce without triggering alert fatigue)
- **Measurement prerequisites** (what instrumentation must exist)

## Clarifying questions I will ask you

1. **Service name and what it does in one sentence?**
2. **Who relies on it?** (internal team, paying customers, whole business)
3. **How does failure manifest to the user?** (error, slow, wrong answer, stale data)
4. **Current monitoring / metric platform?** (Prometheus, Datadog, New Relic, App Insights, other)
5. **Existing availability measurement?** (yes / informal / none)
6. **Tolerance target?** (high — SLA-backed to customers / medium — internal ops / low — experimental)
7. **Typical user journey?** (login → search → action → confirm)
8. **Busiest times / seasons?**
9. **Dependencies?** (downstream APIs, third-party services)
10. **SLO review cadence desired?** (quarterly typical)
11. **Team maturity with SRE practice?** (first time / some experience / advanced)
12. **Budget for instrumentation improvements?**

## Output template

```markdown
# SLIs & SLOs — <service name>

**Service owner:** <role> · **Effective:** <date> · **Review:** <date + quarter>

## 1. Service Description
- **What it does:** <one sentence>
- **Who relies on it:** <users>
- **What failure looks like:** <error / slowness / incorrect result / stale data>
- **Availability target class:** <mission-critical / business-important / internal / experimental>

## 2. SLI Catalogue
An SLI is a measurement of service. Each SLI has: a **what**, a **where measured**, and a **formula**.

### SLI-1: Availability (request success rate)
- **What:** Proportion of HTTP requests that return a non-error response.
- **Where measured:** Load balancer / API gateway (client-side-observable layer).
- **Formula:** `successful_requests / total_requests` where success = HTTP 2xx/3xx (and 4xx where user-error).
- **Exclusions:** Planned maintenance windows (announced ≥ 24h ahead).

### SLI-2: Latency (95th percentile response time)
- **What:** 95th percentile response time for meaningful requests.
- **Where measured:** Load balancer, per endpoint class.
- **Formula:** `quantile(0.95, request_duration_ms)` over rolling 5-min windows.

### SLI-3: Correctness / Quality (where applicable)
- **What:** Proportion of responses that are semantically correct (e.g. search returns intended result; AI output not flagged as wrong).
- **Where measured:** Sample-based review or automated checks.
- **Formula:** `correct_responses / sampled_responses`.

### SLI-4: Freshness (for data services)
- **What:** Age of data when served.
- **Where measured:** Data-pipeline output, endpoint response header.
- **Formula:** `time_now - data_last_updated`.

(Add service-specific SLIs. 3-5 is usually enough.)

## 3. SLO Targets

### SLO-1: Availability
- **Target:** 99.9% over rolling 30 days.
- **Rationale:** Matches contractual SLA; allows ~43 min downtime/month.
- **Error budget:** 0.1% = ~43.8 min/month.

### SLO-2: Latency (p95)
- **Target:** 95% of requests < 500 ms, measured over rolling 30 days.
- **Rationale:** User research shows 500 ms is noticeable threshold for this UX.

### SLO-3: Correctness
- **Target:** ≥ 99% of sampled responses correct.
- **Rationale:** Tied to business-acceptable error rate.

### SLO-4: Freshness
- **Target:** 95% of data served < 5 minutes old.
- **Rationale:** Downstream business process requires near-real-time.

## 4. Error-Budget Policy
An error budget is the allowed quantity of "bad" time/requests per measurement window.

| Budget state | Consumed | Policy |
|---|---|---|
| Healthy | 0-50% | Normal operations; ship features |
| Watch | 50-75% | Increase monitoring attention; prioritise reliability in next sprint |
| At risk | 75-100% | Feature freeze on risky changes; focus on reliability work |
| Breached | > 100% | Stop feature work; retrospective; reliability-focus sprint |

**Budget resets:** at start of each rolling 30-day window (or calendar month — specify).

## 5. Burn-Rate Alert Rules (multi-window)
Monitor error budget consumption velocity, not just total consumed.

| Alert | Condition | Severity | Response |
|---|---|---|---|
| Fast burn | 2% of budget consumed in last 1h AND 5% in last 5 min | P1 | Page on-call |
| Medium burn | 5% of budget consumed in last 6h AND 15% in last 30 min | P2 | Page on-call |
| Slow burn | 10% of budget consumed in last 3 days | P3 | Ticket to service team |

Why multi-window: a fast burn of 2% in 1 hour means the service is on track to breach SLO in days, not months. Worth waking for.

## 6. User-Journey SLOs
Component SLOs are necessary but not sufficient. A user journey chains multiple components. Failure at any step breaks the journey.

### UJ-1: Login → dashboard load
- **Steps:** authenticate → fetch user profile → fetch dashboard data → render.
- **Composite SLI:** Successful end-to-end completion within 3 seconds.
- **Target:** 99.5% over 30 days.
- **Measurement:** Synthetic probe every 60s + real-user monitoring.

### UJ-2: <Critical business transaction>
- **Steps:** <list>.
- **Target:** <>.

## 7. Rollout Plan
Don't flip alert rules on for a new SLO. Bake in observation first.

| Week | Activity |
|---|---|
| 1 | Publish SLIs and SLOs as "shadow" — measure, don't alert |
| 2-4 | Tune alert thresholds based on observed baselines |
| 5 | Enable low-severity (ticket) alerts only |
| 6 | Enable full alerting (pages); communicate to on-call |
| 7+ | Quarterly review; adjust targets based on observed reality |

## 8. Measurement Prerequisites
Before SLOs become meaningful, you need:
- [ ] Request-level logging at the client-observable layer (load balancer / API gateway)
- [ ] Metric exporter with labels for endpoint, status, latency bucket
- [ ] Multi-window aggregation (1m, 5m, 1h, 6h, 3d, 30d)
- [ ] Error-budget calculation and dashboard
- [ ] On-call routing for new alerts
- [ ] Documented run-to-ground procedure per alert

## 9. Communication
### To product team
> "We've introduced SLOs for <service>. When we're within error budget, feature velocity continues as normal. When we're at risk (75% consumed), we'll slow risky changes. When we breach, we stop feature work until we're healthy again. This is a joint responsibility — reliability is a feature."

### To leadership
> "Our SLO for <service> is <target>. In the last quarter we met it <N>% of the time. The error budget equivalent is <minutes/month> of allowable degradation. We use this to balance reliability with feature delivery rather than treat every incident as a crisis."

### To on-call
> "New alerts land in <channel>. Runbook per alert in <location>. If an alert fires and the runbook is unclear, page the service owner."

## 10. Ongoing
- Quarterly SLO review: are targets still right? Too lax (always green — raise them) / too tight (always amber — relax or invest).
- Annual SLI reassessment: are we measuring the right things?
- Incidents that consumed disproportionate budget: document as "what would have prevented this?"
- Tag tickets / PRs with the SLO they affect — ties engineering effort to business outcomes.
```

## Example invocation

**User:** "We run a SaaS CRM for SMBs, 1200 customers, £2M ARR. Hosted on AWS. Current monitoring is Datadog. 'Availability' target is 99.9% but we've been breaching without anyone noticing."

**What the skill will do:**
1. Ask the 12 questions, drilling on: what specifically users do (CRUD records, search, report generation — each has different latency expectations), whether the 99.9% is measured at ALB or at the browser (big difference), what's monitored now that's noise vs. signal.
2. Produce the pack with 4 SLIs (availability, latency p95, login-to-dashboard journey, report-generation journey), SLO targets at 99.9% / 500 ms / 99.5% / 95%, and a rollout plan that starts with shadow measurement before enabling pages.
3. Propose user-journey SLOs that reflect actual CRM workflows — not just component health.
4. Recommend a 30-day rolling window (not calendar month) so error budget doesn't reset during an incident.
5. Flag that if the current "99.9%" was measured at ALB only, browser-visible availability is probably several bps lower — expect initial reality-adjustment conversations.

## Notes for the requester

- **SLOs measured from the user's perspective beat SLOs measured from component perspective.** A healthy ALB and a healthy database don't mean a healthy user experience.
- **Start with fewer SLOs than you want.** Three well-monitored SLOs per service beat ten vanity metrics.
- **Error-budget policy needs leadership buy-in before you turn on alerts.** A policy nobody enforces is noise.
- **Review targets quarterly.** If you always hit 99.99% on a 99.9% target, raise the bar (or reduce cost). If you never hit it, invest or lower the bar.
- **User-journey SLOs are the best leading indicator of churn.** A sign-in-to-dashboard time creeping from 2s to 5s will show up in retention data 2 quarters later.
- **Good looks like:** on-call engineers stop getting woken for non-SLO-affecting alerts; feature and reliability work have balanced priority; leadership can articulate the reliability posture in one sentence.

---
*VantagePoint Networks · <https://www.vpnetworks.co.uk> · Authored by Hak · Free under the MIT licence*
