On-call runbook
Quick-reference playbook for the four most common PagerDuty alerts against a production TRUSCA stack. Each scenario lists:
- Symptom — what triggered the page
- Customer impact — what users can/cannot do right now
- Diagnose — exact commands to run (host + container)
- Recover — ordered remediation steps
- Escalate — when to wake the portal dev team
All commands assume docker-compose V1 (hyphen) and a bash host shell.
# Replace EMAIL/PASSWORD with the super-admin you created at install.
EMAIL=admin@example.com
PASSWORD=...
ACCESS_TOKEN=$(curl -fsS -X POST "https://<your-host>/api/auth/login" \
-H "Content-Type: application/json" \
-d "{\"email\":\"$EMAIL\",\"password\":\"$PASSWORD\"}" | jq -r '.access_token')
Scenario 1 — Trivy DB stale or missing
Symptom
PagerDuty: TRUSCA Trivy DB last refresh > 14 days or TRUSCA Trivy DB missing on worker. The upcoming /admin/health → Vulnerability data card drives this.
Customer impact
- New scans CAN still be queued —
cdxgen+ scancode still produce SBOMs and licence findings. - New CVE detections stop landing until the DB refresh succeeds.
- Existing
vulnerability_findingsrows are unchanged — the gap is forward-only.
Diagnose
# 1. Is the DB on disk?
docker-compose -f docker-compose.yml exec worker \
ls -lh /var/lib/trivy/db/
# 2. DB metadata (Created timestamp)
docker-compose -f docker-compose.yml exec worker \
cat /var/lib/trivy/db/metadata.json
# 3. Recent download / refresh logs
docker-compose -f docker-compose.yml logs --tail=500 worker | grep trivy_db
docker-compose -f docker-compose.yml logs --tail=500 beat | grep trivy_db_refresh
# 4. Outbound HTTPS to ghcr.io reachable?
docker-compose -f docker-compose.yml exec worker \
curl -fsS https://ghcr.io/v2/ -o /dev/null -w "%{http_code}\n"
Recover (in order)
- Force a one-shot refresh (preferred — single command, no restart):
docker-compose -f docker-compose.yml exec worker \celery -A apps.backend.tasks.celery_app call tasks.trivy_db.refreshsleep 30docker-compose -f docker-compose.yml exec worker \cat /var/lib/trivy/db/metadata.json | jq '.Created'
- Wipe + re-download (if metadata is corrupted):
The boot-timedocker-compose -f docker-compose.yml exec worker \rm -rf /var/lib/trivy/dbdocker-compose -f docker-compose.yml restart worker
trivy --download-db-onlyruns and re-populates the directory within 1–3 minutes. - Mirror fallback (if
ghcr.iois unreachable from the worker): pointTRIVY_DB_REPOSITORYat your internal mirror — see Vulnerability data — Air-gapped operation.
After recovery, the automatic re-match beat picks up missed CVEs against existing scans on its next cycle — no operator action.
Escalate
- If two refresh attempts fail with the same error, OR
- If the internal mirror itself reports
unauthorizeddespite recenttrivy registry login, OR - If
metadata.jsonexists butResultson a spot scan is empty across multiple ecosystems (suggests a schema mismatch).
Page the portal dev team with: worker logs (docker-compose logs --tail=2000 worker), the metadata.json content, and the output of trivy --version from inside the worker.
Scenario 2 — Auto-backup failed for 3 days
Symptom
PagerDuty: TRUSCA auto-backup task failure count = 3.
Customer impact
- All in-portal data is at risk if the host crashes (no recent backup to restore from). Plan downstream tasks (compliance freezes, etc.) accordingly until a fresh backup lands.
Diagnose
# 1. Celery Beat schedule heartbeat
docker-compose logs --tail=500 beat | grep daily-auto-backup
# 2. Worker logs for backup task runs
docker-compose logs --tail=2000 worker | grep -E 'backup\.(completed|failed)' | tail -20
# 3. Most recent backup row + status
curl -fsS "https://<your-host>/v1/admin/backup/list" \
-H "Authorization: Bearer $ACCESS_TOKEN" | jq '.items[0:5]'
# 4. Disk free on the backup volume (BACKUPS_ROOT is mounted at
# /opt/trustedoss/backups in the backend container)
docker-compose -f docker-compose.yml exec backend df -h /opt/trustedoss/backups
Recover
-
Manual trigger (UI:
/admin/backup→ Run manual backup now, or):curl -fsS -X POST "https://<your-host>/v1/admin/backup/trigger" \-H "Authorization: Bearer $ACCESS_TOKEN" -
If manual also fails — run the host backup script directly:
scripts/backup.shis a host script: it shells out todocker-compose ... execforpg_dumpand tars the workspace mount, so run it on the host (not inside a container). It writes toBACKUP_DIRwhen set, otherwisebackups/<stamp>under the repo root (mounted at/opt/trustedoss/backups).# From the deploy directory on the host (where docker-compose.yml + .env live).BACKUP_DIR=backups/debug-$(date +%Y%m%d-%H%M%S) bash scripts/backup.sh --no-prune 2>&1.env not found→ run from the deploy directory, or the install is incomplete.- Server version mismatch →
postgresql-client-17missing in the postgres image (regression — escalate). - Disk full → see Scenario 4.
Escalate
- If
bash scripts/backup.shfails for non-disk, non-permission reasons, OR - If the most recent successful backup is older than 7 days (auto-purge window — restore options narrowing).
Scenario 3 — Scan stuck in running for ≥ 4 hours
Symptom
PagerDuty: TRUSCA scan running > 4h for project X.
Customer impact
- That project: blocked from new scans (one-running-at-a-time).
- Other projects: unaffected unless worker concurrency = 1 (default 2).
Diagnose
# 1. Which stage is it stuck at?
curl -fsS "https://<your-host>/v1/scans/<scan_id>" \
-H "Authorization: Bearer $ACCESS_TOKEN" | jq '.progress_payload, .latest_log_frame'
# 2. Celery active tasks
docker-compose exec worker celery -A apps.backend.tasks.celery_app inspect active
# 3. Worker process tree (look for orphaned subprocesses)
docker-compose exec worker ps -ef | grep -E 'cdxgen|ort|trivy'
Recover
- Force-cancel the scan (preferred — no worker-wide impact):
curl -fsS -X POST "https://<your-host>/v1/admin/scans/<scan_id>/cancel" \-H "Authorization: Bearer $ACCESS_TOKEN"
- If cancel doesn't release the task (worker truly hung):
Other in-flight scans on the same worker will be marked failed and require manual re-run.# Last resort — kills all in-flight tasks on this worker.docker-compose restart worker
Escalate
- If the same project hangs at the same stage twice in a row (suggests a content-side issue — large git history, malformed lockfile, or
trivy sbomtimeout). Page portal dev team with<scan_id>and the last 200 lines ofworkerlogs filtered to that task.
Scenario 4 — Host disk ≥ 95%
Symptom
PagerDuty: TRUSCA disk = 95%+.
Customer impact
- In-flight scans continue. New scans are blocked at the
DISK_HARD_LIMIT_PCTthreshold (default 95%) —/admin/scansshows them as queued indefinitely.
Diagnose
# 1. Host-wide
df -h /opt/trustedoss
docker system df
# 2. Per-card breakdown via the portal
curl -fsS "https://<your-host>/v1/admin/disk" \
-H "Authorization: Bearer $ACCESS_TOKEN" | jq
# 3. Workspace breakdown (most common offender)
docker-compose exec worker du -sh /workspace/* | sort -h | tail -10
# 4. Postgres database size
docker-compose exec postgres psql -U trustedoss -d trustedoss \
-c "SELECT pg_size_pretty(pg_database_size('trustedoss'));"