Runbook Template
Copy into docs/architecture/runbook-SERVICE-SCENARIO.md.
---
id: runbook-service-scenario
title: "Runbook: Service — Scenario"
description: "How to handle [Scenario] for [Service]"
status: approved
owner: "@devops-handle"
last_tested: YYYY-MM-DD
severity: P1 # P1 | P2 | P3
tags: [runbook, incident]
---
# Runbook: Service — Scenario
:::danger[Incident?]
**PagerDuty:** [Link] | **Slack:** #incidents | **Escalation:** @oncall-lead
:::
## Symptoms
What does this look like when it's happening?
- CloudWatch alarm: `AlarmName`
- Error in logs: `ERROR message pattern`
- User report: "Users seeing X error"
## Impact
- **Affected:** Which users / which features
- **Severity:** P1/P2/P3
- **SLA:** X minutes to resolve
## Diagnosis
### Step 1 — Confirm the issue
```bash
# Check service health
aws ecs describe-services --cluster prod --services service-name
# Check recent logs
aws logs tail /ecs/service-name --follow --since 5m
```
### Step 2 — Identify root cause
```bash
# Check DB connections
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"
# Check Redis
redis-cli -h $REDIS_HOST ping
```
## Remediation
### Option A — [Most common fix]
```bash
# Restart the service
aws ecs update-service \
--cluster prod \
--service service-name \
--force-new-deployment
```
Expected result: Service returns healthy within ~3 minutes.
### Option B — [If Option A fails]
Steps...
## Escalation
| Condition | Escalate to | Contact |
|---|---|---|
| Not resolved in 15 min | @eng-lead | PagerDuty |
| Data loss suspected | @cto | Phone |
## Post-incident
- [ ] Open post-mortem doc (use [post-mortem template])
- [ ] Update this runbook if steps were incorrect
- [ ] File Jira ticket for permanent fix
---
*Last tested by @handle on YYYY-MM-DD.*