Skip to main content

Runbook Template

Copy into docs/architecture/runbook-SERVICE-SCENARIO.md.


---
id: runbook-service-scenario
title: "Runbook: Service — Scenario"
description: "How to handle [Scenario] for [Service]"

status: approved
owner: "@devops-handle"
last_tested: YYYY-MM-DD
severity: P1 # P1 | P2 | P3
tags: [runbook, incident]
---

# Runbook: Service — Scenario

:::danger[Incident?]
**PagerDuty:** [Link] | **Slack:** #incidents | **Escalation:** @oncall-lead
:::

## Symptoms

What does this look like when it's happening?

- CloudWatch alarm: `AlarmName`
- Error in logs: `ERROR message pattern`
- User report: "Users seeing X error"

## Impact

- **Affected:** Which users / which features
- **Severity:** P1/P2/P3
- **SLA:** X minutes to resolve

## Diagnosis

### Step 1 — Confirm the issue

```bash
# Check service health
aws ecs describe-services --cluster prod --services service-name

# Check recent logs
aws logs tail /ecs/service-name --follow --since 5m
```

### Step 2 — Identify root cause

```bash
# Check DB connections
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"

# Check Redis
redis-cli -h $REDIS_HOST ping
```

## Remediation

### Option A — [Most common fix]

```bash
# Restart the service
aws ecs update-service \
--cluster prod \
--service service-name \
--force-new-deployment
```

Expected result: Service returns healthy within ~3 minutes.

### Option B — [If Option A fails]

Steps...

## Escalation

| Condition | Escalate to | Contact |
|---|---|---|
| Not resolved in 15 min | @eng-lead | PagerDuty |
| Data loss suspected | @cto | Phone |

## Post-incident

- [ ] Open post-mortem doc (use [post-mortem template])
- [ ] Update this runbook if steps were incorrect
- [ ] File Jira ticket for permanent fix

---

*Last tested by @handle on YYYY-MM-DD.*