Workflow Incident Replays (Monthly Fire Drill)
Workflow Incident Replays (Monthly Fire Drill)
Scenarios
- Queue congestion (p95 wait > 120s)
- Provider outage (local queue unavailable)
- Auth failure (cloud unauthorized)
- Cron drift/missed runs
Drill format
- Trigger condition
- Detection source
- Immediate response (first 10 min)
- Fallback path
- Recovery verification
- Postmortem note
Scenario 1: Queue congestion
- Trigger: SLO alert shows p95 wait >= 120s
- Immediate: pause experimental workloads; keep prod-critical high/urgent only
- Verify: p95 < 30s for 3 consecutive checks
Scenario 2: Local provider outage
- Trigger: queue status errors for local model route
- Immediate: activate approved API fallback for critical classes
- Verify: >=98% success restored
Scenario 3: Cloud auth failure
- Trigger: unauthorized errors on cloud lane
- Immediate: route premium tasks to local fallback and log impact
- Verify: cloud lane recovery test passes
Scenario 4: Cron drift
- Trigger: missing daily outputs by expected windows
- Immediate: run critical scripts manually, inspect crontab and logs
- Verify: next scheduled cycle passes