---
name: alert-triage-reviewer
description: Review a noisy alert list, classify each alert by actionability and signal quality, propose suppression/tuning/deletion, and produce a 30-day alert-hygiene backlog that kills the noise without losing the signal.
version: 1.0.0
author: VantagePoint Networks
author_url: https://www.vpnetworks.co.uk
audience: SREs, On-call Engineers, IT Operations teams, Platform Leads, MSPs cleaning up inherited environments
output_format: Formatted Markdown audit with per-alert verdict, suppression/tuning/deletion recommendations, 30-day backlog, and alert-hygiene policy draft.
license: MIT
last-reviewed: 2026-04
---

# Alert Triage Reviewer

A Claude Code skill for reducing alert fatigue — taking a sprawling list of monitoring alerts and classifying each one so only actionable ones page a human.

## How to use this skill

1. Download this `SKILL.md` file.
2. Place it in `~/.claude/commands/` (macOS/Linux) or `%USERPROFILE%\.claude\commands\` (Windows).
3. In Claude Code, run `/alert-triage-reviewer`. Paste the alert list (names + a line of context each). Answer the clarifying questions. Receive the audit.

## When to use this

- On-call engineers are burnt out from noise and threatening to quit the rota.
- You've inherited a monitoring stack with 100+ alert rules and no idea which matter.
- You want to move to SLO-based alerting and need to prune the non-SLO noise.
- An incident happened and the "we had an alert for that but ignored it" retrospective finding needs systemic fixing.
- You want to baseline your alert quality before adding more.

## What you'll get

A single Markdown document containing:

- **Per-alert verdict table** (keep / tune / suppress / delete)
- **Reasoning** for each verdict
- **Suppression patterns** (time-based, condition-based, correlation-based)
- **30-day backlog** (quick wins in week 1, deeper work later)
- **Alert-hygiene policy** (standards for new alerts going forward)
- **Before/after projection** (alert volume, page volume, MTTR expectation)

## Clarifying questions I will ask you

1. **How many alerts currently configured across your monitoring platforms?**
2. **How many pages per week does on-call receive?**
3. **What % of those are actionable (the engineer did something specific to fix)?**
4. **Primary monitoring platforms?** (Datadog, Grafana, CloudWatch, Prometheus, Nagios, mixed)
5. **Platforms you can silence from?** (PagerDuty, Opsgenie, direct from monitoring)
6. **Do you have SLOs defined?** (yes — base alerts on those / no)
7. **Most-common "noise" types?** (disk-space warnings, transient network blips, cert warnings, scheduled-job alerts)
8. **Any incidents where an alert DID catch something real?** (don't delete those)
9. **Tolerance for false-negatives?** (safety-critical vs. internal)
10. **Paste alert list (or subset) for me to evaluate.**
11. **Who owns each alert?** (can we find out?)
12. **Last time anyone did alert hygiene?**

## Output template

```markdown
# Alert Triage Audit — <environment/team> — <date>

**Prepared by:** <role> · **Scope:** <N> alerts across <platforms>
**Baseline:** <current pages/week>, <actionable %>
**Target:** <target pages/week>, <target actionable %>

## 1. Summary
Of <N> alerts reviewed:
- **Keep as-is:** <N> (<%>)
- **Tune threshold or conditions:** <N>
- **Suppress (time / condition / correlation):** <N>
- **Delete:** <N>

If the proposed changes land, weekly pages drop from <N> to ~<N>, actionable percentage rises from <N>% to ~<N>%.

## 2. Per-Alert Verdict Table
| # | Alert | Current behaviour | Last fired | Verdict | Reasoning | Action |
|---|---|---|---|---|---|---|
| 1 | Disk > 80% on <host> | Pages on-call | Weekly | **Tune** | Predictable, not actionable at night | Page only if > 90% OR growing > 5%/day |
| 2 | CPU > 90% for 5 min | Pages on-call | 3×/week | **Tune** | Symptom not cause; often transient | Page only if sustained > 15 min AND requests failing |
| 3 | Certificate expiring in 30 days | Emails team | Monthly | **Keep** | Actionable, correctly non-page | — |
| 4 | Backup job log contains "WARNING" | Pages on-call | 2×/week | **Suppress** | Informational only; real failures fire a different alert | Route to ticket, not page |
| 5 | Database connections > 80% of pool | Pages on-call | Rarely | **Keep** | Early indicator, actionable | — |
| 6 | HTTP 500 rate | Ticket | — | **Tune upward** | Should page if SLO at risk | Promote to page on burn-rate |
| 7 | Legacy host x down | Pages on-call | Never | **Delete** | Host decommissioned 6 months ago | Remove |
| 8 | ... | | | | | |

## 3. Suppression Patterns (to apply)

### Time-based
- **Scheduled-maintenance window suppression:** all alerts from hosts in maintenance are auto-suppressed during declared windows.
- **Off-hours non-critical:** alerts rated P3 and below route to ticket between 18:00-08:00, paging resumes in business hours.

### Condition-based
- **"Already on bridge":** if incident is open, duplicate alerts suppressed.
- **Flapping:** alerts that fire-clear-fire-clear within 10 min treated as a single event.

### Correlation-based
- **Dependency-aware:** if upstream dependency is down, downstream alerts suppress to reduce noise.
- **Parent/child:** host-down suppresses all service alerts on that host.

## 4. 30-Day Backlog

### Week 1 — Quick wins (reduces noise today)
- [ ] Delete alerts for decommissioned hosts/services (list: <N>)
- [ ] Route "informational" alerts from page to ticket (list: <N>)
- [ ] Add maintenance-window suppression
- [ ] Fix obviously broken thresholds (CPU 50% → 90% etc.)
- [ ] Document alert owner for every remaining alert (anything without an owner → investigate)

### Week 2 — Tuning
- [ ] Move disk-space from static threshold to rate-based
- [ ] Move CPU alerts from symptom (high %) to indicator (request failures + high CPU)
- [ ] Add flapping suppression
- [ ] Add dependency-aware correlation for top 5 services

### Weeks 3-4 — Move to SLO-based alerting
- [ ] Identify top 3 services with SLOs defined
- [ ] Implement burn-rate alerts per service (replaces multiple symptom alerts)
- [ ] Retire symptom alerts replaced by SLO alerts
- [ ] Measure: pages per week, actionability %

### Ongoing — Weekly hygiene
- [ ] Quick review of alerts fired in past 7 days: any new noise patterns?
- [ ] Post-incident: if an alert should have existed but didn't, add it; if one fired noisily during incident, suppress it.

## 5. Alert-Hygiene Policy (for all new / modified alerts)
Every alert must have:

1. **Actionability statement:** "When this fires, an engineer should <specific action> within <time>."
2. **Severity rating:** P1 (page) / P2 (page in business hours) / P3 (ticket) / P4 (log only).
3. **Runbook link** (even if one paragraph).
4. **Named owner** (a person or a team, not "unassigned").
5. **Expected frequency** (how often should this fire in normal operations? If "never" and it fires, that's a real signal; if "daily," reconsider threshold).
6. **Auto-resolution criteria** (when has the condition cleared?).

Monthly review:
- Alerts firing > 2×/week without action: tune or delete.
- Alerts never firing over 6 months: validate the condition can still happen; if not, delete.

## 6. Before / After Projection
| Metric | Today | After week 1 | After week 4 |
|---|---|---|---|
| Total alerts | <N> | ~<N-?> | ~<N-??> |
| Pages per week | <N> | ~<N/2> | ~<N/4> |
| Actionable page % | <N>% | ~<N+15>% | ~<N+30>% |
| False positives % | <N>% | ~<N-20>% | ~<N-35>% |

## 7. Risks & Mitigations
| Risk | Mitigation |
|---|---|
| We suppress a real incident | All suppressed alerts still route to ticket queue reviewed daily; none go to /dev/null |
| New blind spots after pruning | Baseline incident retrospectives for 3 months; any "we should have had an alert" → add back |
| Team resistance to changes | Involve on-call in classification; explicit approvals per deletion |
| SLO-based alerts replace symptom alerts before SLOs are trusted | Run in parallel for 2 weeks before retiring symptom alerts |
```

## Example invocation

**User:** "Our SRE team inherited an environment with 340 alert rules across Datadog + Grafana + CloudWatch. On-call gets ~50 pages a week, maybe 8 are actually actionable. Morale is terrible. We want to clean this up but nobody's sure what's safe to delete."

**What the skill will do:**
1. Ask the 12 questions, drilling especially on the 42 non-actionable alerts (patterns: what fires most?), whether any "informational" alerts could actually catch something real, and whether ownership can be established per alert.
2. Produce the triage audit with prioritised verdicts:
   - ~80 delete (decommissioned, orphaned, unnamed)
   - ~120 suppress/tune (noise, wrong threshold, informational masquerading as actionable)
   - ~140 keep (proven signal)
3. Suggest starting with the 80 deletions in week 1 (quickest morale win), parallel-running SLO-based alerts from week 3 before retiring symptom alerts.
4. Flag that projected end-state is ~150 alerts, ~12 pages/week, ~75% actionable — achievable in 30 days without new tooling.

## Notes for the requester

- **Every alert you can't name an action for is a candidate for deletion.** "Interesting data" belongs in a dashboard, not a pager.
- **Route, don't delete, ambiguous alerts.** Send them to a ticket queue first; if nobody cares for 90 days, delete.
- **The on-call team must co-own the pruning.** They'll undo your cleanup if they don't trust it.
- **Add before you delete.** If you're replacing symptom alerts with SLO-based ones, run both for 2 weeks to validate.
- **Track the burn-down.** Weekly metric: pages per week, actionability %. Visible improvement sustains the programme.
- **Good looks like:** 30 days in, on-call engineers sleep; actionability % doubles; incident-review findings stop saying "we had alerts firing but ignored them."

---
*VantagePoint Networks · <https://www.vpnetworks.co.uk> · Authored by Hak · Free under the MIT licence*
