Skip to content

Incident Response


Severity Levels

Level Description Example
P1 — Critical Total loss of core service or security breach Host down, VPN broken
P2 — High Degraded service, data risk Container crash loop, disk near full
P3 — Medium Non-critical service down Non-essential container down
P4 — Low Minor issue, no user impact Config drift, outdated image

Response Procedure

Step 1 — Detect & Assess

docker ps -a          # Container states
htop                  # CPU / RAM
df -h                 # Disk
docker logs <container> --tail 100
journalctl -xe --since "30 minutes ago"

Step 2 — Isolate

ping <host-ip>        # Is the host reachable?
sudo systemctl status docker
sudo wg show          # WireGuard tunnel status

Step 3 — Remediate (in order)

  1. docker restart <container>
  2. Fix config → docker compose up -d --force-recreate
  3. Pull fresh image → docker compose pull && docker compose up -d
  4. Restore from backup → see Backup & Restore
  5. Rebuild from scratch using runbooks

Step 4 — Verify & Log

  • Service accessible
  • Monitoring shows green
  • Logs clean
  • Document in Incident Log below

Incident Log

Date Severity Service Root Cause Resolution Duration
YYYY-MM-DD P2 Grafana Disk 100% full Cleared old Prometheus data 45 min

Common Issues & Fixes

Symptom Likely Cause Fix
Container restart loop Bad env var or config Check logs, fix .env
Port already in use Conflicting service ss -tulnp \| grep <PORT>
DNS not resolving Pi-hole down Restart pihole container
WireGuard not connecting Key mismatch or firewall sudo wg show, check NSG/UFW
Disk full Log or volume growth docker system prune