---
name: runbook-qa-reviewer
description: Review an existing runbook by walking through it step-by-step — flagging unclear instructions, outdated commands, missing rollback, untested decision points, and producing a dated QA report with remediation priorities.
version: 1.0.0
author: VantagePoint Networks
author_url: https://www.vpnetworks.co.uk
audience: IT Managers, SREs, Platform Engineers, MSP Technical Consultants, Auditors verifying operational readiness
output_format: Formatted Markdown QA report with section-by-section findings, severity per issue, remediation checklist, and a sign-off-ready revised version outline.
license: MIT
last-reviewed: 2026-04
---

# Runbook QA Reviewer

A Claude Code skill for stress-testing a runbook before an on-call engineer has to follow it at 3am — catching the "wait, which server?" and "what's the password for that?" moments in daylight.

## How to use this skill

1. Download this `SKILL.md` file.
2. Place it in `~/.claude/commands/` (macOS/Linux) or `%USERPROFILE%\.claude\commands\` (Windows).
3. In Claude Code, run `/runbook-qa-reviewer`. Paste the runbook (or link to it + excerpts). Answer the clarifying questions. Receive the QA report.

## When to use this

- A runbook hasn't been reviewed in 12+ months.
- An incident retrospective flagged the runbook as unclear or out of date.
- You're auditing runbooks for ISO 27001, SOC 2, or Cyber Essentials Plus.
- A new team member followed the runbook and got stuck — you want to close the gap.
- You're rolling out on-call rotation and need each runbook pressure-tested.

## What you'll get

A single Markdown document containing:

- **Runbook metadata review** (owner, last-reviewed, version, scope clarity)
- **Section-by-section findings** with severity
- **Step-level gaps** (missing preconditions, unclear commands, missing rollback steps)
- **Decision-point evaluation** (are go/no-go gates clear and actionable?)
- **Test-your-understanding questions** (to validate the runbook would work under pressure)
- **Command / version / tooling currency check**
- **Remediation checklist** with severity
- **Revised-version outline** (structure to land the fixes)

## Clarifying questions I will ask you

1. **Runbook name and what it covers?**
2. **Last reviewed / updated?**
3. **When last used in anger?** (real incident, real change)
4. **Who's the stated owner?** (actual, not just named)
5. **Who would follow it at 3am?** (assumed reader level)
6. **Paste the runbook content or link + key excerpts.**
7. **Any known issues from last use?**
8. **Any environmental changes since last review?** (new tooling, upgraded OS, new vendor)
9. **Does it have a rollback section?** (yes / partial / no)
10. **Does it have a verification section?** (how do you know it worked)
11. **Tolerance for invasive edits?** (full rewrite OK / minor polish only)
12. **When would a revised version be tested?** (tabletop or real next time)

## Output template

```markdown
# Runbook QA Report — <runbook name>

**Reviewed by:** <role> · **Date:** <date> · **Status:** Draft / Signed-off

## 1. Metadata Review
| Element | Current | Assessment |
|---|---|---|
| Owner named | <yes / vague / no> | |
| Last reviewed | <date> | Stale? > 12 months = yes |
| Version number | <N / no> | |
| Scope statement | <yes / implicit / no> | |
| Intended audience | <explicit / implicit / unclear> | |
| Prerequisites section | <yes / no> | |
| Contact for questions | <yes / no> | |

## 2. Overall Assessment
**Verdict:** Fit for use / Fit-with-patches / Major rewrite needed / Retire.
**Top 3 risks if followed as-is:**
1. <risk>
2. <risk>
3. <risk>

## 3. Section-by-Section Findings

### <Section 1 title>
**Strengths:**
- <strength>

**Issues:**
| # | Severity | Issue | Recommendation |
|---|---|---|---|
| 1.1 | High | Step 2 says "run the backup script" — no path, no user context | Specify absolute path, user to run as, expected runtime |
| 1.2 | Medium | No expected output described | Add "expected: 'Backup complete' within 90 seconds" |
| 1.3 | Low | Command uses deprecated flag | Update to current syntax |

### <Section 2 title>
…

## 4. Step-Level Gaps (across the runbook)
Patterns:
- **Unspecified commands:** <N> — e.g. "reset the service" without which service / how.
- **Missing preconditions:** <N> — e.g. "make sure DB is in read-only mode" without how to verify.
- **No expected output:** <N> — steps that run commands but don't state what success looks like.
- **Undefined terms:** <N> — e.g. "primary" / "secondary" / "green" without definition.
- **Missing credentials references:** <N> — e.g. "log in to the portal" without which portal / where credentials live.

## 5. Decision-Point Evaluation
| Decision | Criteria stated? | Named authority? | Fallback documented? |
|---|---|---|---|
| Go / no-go at T-0 | yes / no | | |
| Rollback trigger | | | |
| Escalation to vendor | | | |
| Emergency override | | | |

## 6. Rollback Section
- [ ] Rollback is defined
- [ ] Rollback is step-by-step (not "revert the change")
- [ ] Point-of-no-return is explicit
- [ ] Rollback tested or documented as untested
- [ ] Verification after rollback is defined

Severity of gaps: <High / Medium / Low>

## 7. Verification Section
- [ ] Post-change verification is step-by-step
- [ ] Criteria for "done" are measurable
- [ ] Owner for monitoring next 24 hours is named
- [ ] Rollback is triggered by specific failing check

## 8. Command / Version / Tooling Currency
| Finding | Location | Severity |
|---|---|---|
| Uses deprecated Ansible 2.9 syntax; tested on 2.16 | Step 12 | Medium |
| Azure CLI command requires --resource-group flag now (breaking change 2024) | Step 7 | High |
| References "LegacyTool v3.x"; current is v5.2, different command set | Step 4 | High |
| SSH as root — newer builds block this by policy | Step 2 | Medium |

## 9. Test-Your-Understanding (to validate readability)
If a new on-call engineer followed this runbook with no prior exposure, would they be able to answer:
- [ ] What will this runbook do, in one sentence?
- [ ] Who runs each step?
- [ ] When is this runbook invoked vs. something else?
- [ ] What should they NOT do if unsure?
- [ ] Who do they call if the runbook doesn't work?

If any are "no" or "probably not," those answers need to be added to the runbook.

## 10. Remediation Checklist (prioritised)

### High (must fix before next use)
- [ ] <specific fix 1>
- [ ] <specific fix 2>
- [ ] <specific fix 3>

### Medium (fix in next review cycle)
- [ ] <fix>
- [ ] <fix>

### Low (nice to have)
- [ ] <fix>

## 11. Revised-Version Outline
Proposed structure for the next revision:

```
1. Metadata header (owner, last-reviewed, version, audience, scope)
2. Prerequisites + preconditions
3. Roles on the bridge
4. Decision gate (go/no-go)
5. Step-by-step execution
6. Verification
7. Rollback
8. Post-incident / post-change
9. Glossary
10. Contact + escalation
11. Change log
```

## 12. Sign-Off
**Current runbook:** Unfit / Fit-with-patches / Fit.
**Proposed owner action:** patch and retest / rewrite and retest / retire and replace.
**Next review due:** <date>.
```

## Example invocation

**User:** "Review our database failover runbook. It's 3 years old. Last used 14 months ago during a planned maintenance — took 2 hours longer than expected. [Pastes 2-page runbook]"

**What the skill will do:**
1. Ask the 12 questions, drilling on: what specifically added 2 hours (likely an outdated command or missing verification step), who the actual day-to-day owner is (not just the named one), whether the rollback has ever been executed.
2. Produce the QA report with:
   - Metadata flagging "last-reviewed 3 years ago" as high-severity (database tooling has changed meaningfully in that time)
   - Section findings flagging: ambiguous "failover" term (primary direction? both?), missing pre-check that the standby is synchronised, rollback described as "manual" rather than step-by-step
   - Verdict: Fit-with-patches if 6 specific fixes land before next use; Major rewrite if failover mechanism has changed
   - Test-your-understanding section identifying three "probably not" answers from a new engineer
3. Flag that the 2-hour delay correlates with likely-outdated commands in steps 7-9; these should be verified against current tooling before the next use.

## Notes for the requester

- **Runbooks decay fast.** Every change to the underlying system ages the runbook by more than the clock does.
- **"Revert the change" is not a rollback plan.** Step-by-step, with expected outputs, or it's not a rollback.
- **A runbook nobody has executed in 12 months is a hypothesis.** Schedule tabletop reviews quarterly.
- **The audience matters.** A runbook for the senior engineer who wrote it differs from a runbook for the on-call rotation. Write for the least-experienced intended reader.
- **Expected-output lines save lives.** "Run the command, expected output X within Y seconds" tells the reader immediately whether to continue or escalate.
- **Good looks like:** 95% of steps have expected outputs; the runbook has been walked through (tabletop) in the last 6 months; a new engineer can execute it with minimal supervision; the review log shows living maintenance.

---
*VantagePoint Networks · <https://www.vpnetworks.co.uk> · Authored by Hak · Free under the MIT licence*
