Incident Response¶

Severity Levels¶

Level	Description	Example
P1 — Critical	Total loss of core service or security breach	Host down, VPN broken
P2 — High	Degraded service, data risk	Container crash loop, disk near full
P3 — Medium	Non-critical service down	Non-essential container down
P4 — Low	Minor issue, no user impact	Config drift, outdated image

Response Procedure¶

Step 1 — Detect & Assess¶

docker ps -a          # Container states
htop                  # CPU / RAM
df -h                 # Disk
docker logs <container> --tail 100
journalctl -xe --since "30 minutes ago"

Step 2 — Isolate¶

ping <host-ip>        # Is the host reachable?
sudo systemctl status docker
sudo wg show          # WireGuard tunnel status

Step 3 — Remediate (in order)¶

docker restart <container>
Fix config → docker compose up -d --force-recreate
Pull fresh image → docker compose pull && docker compose up -d
Restore from backup → see Backup & Restore
Rebuild from scratch using runbooks

Step 4 — Verify & Log¶

Service accessible
Monitoring shows green
Logs clean
Document in Incident Log below

Incident Log¶

Date	Severity	Service	Root Cause	Resolution	Duration
YYYY-MM-DD	P2	Grafana	Disk 100% full	Cleared old Prometheus data	45 min

Common Issues & Fixes¶

Symptom	Likely Cause	Fix
Container restart loop	Bad env var or config	Check logs, fix .env
Port already in use	Conflicting service	`ss -tulnp \\| grep <PORT>`
DNS not resolving	Pi-hole down	Restart pihole container
WireGuard not connecting	Key mismatch or firewall	`sudo wg show`, check NSG/UFW
Disk full	Log or volume growth	`docker system prune`