Incidents and runbooks — overview

How incident response works on the fleet: the loop from alert to postmortem, the twelve incident classes the lab has paid for, and how to use this section under pressure.

This subsection is the operator’s incident library. Twelve incidents have been documented in detail because the lab has paid for each at least once. Reading them ahead of time is the cheapest insurance the fleet has.

Every runbook follows the same five-section schema, mandated by the section tracker (#229) and by the workspace’s runbook convention:

Section	What it answers
Symptom	What does the operator see? What does the alert / log line / `oc` output look like?
Root cause	Why does this happen? What is the underlying contract that broke?
Fix	Concrete commands — copy-paste-able — to recover the fleet.
Prevention	What guardrail, alert, or design change reduces the chance of recurrence?
References	The runbook file, ADRs, and incident issues that back the page.

The incident loop

Detect (alert / drift / user report)

Triage (symptom -> runbook)

Fix (GitOps MR or break-glass)

Validate (Argo Synced/Healthy)

Postmortem (session report)

Runbook update

Operator note

The loop runs left to right and bottom-up:

Detect. An alert fires, a kustomize builds fail, Argo CD shows ComparisonError, a user reports a 404. Open an incident issue on zeshaq/opp-full-plat with the label incident and the cluster name.
Triage. Match the symptom to one of the runbooks below. The diagnostic is usually a single read-only command (e.g., oc get --raw /openapi/v2 for the Routes CRD incident).
Fix. Most fixes are GitOps changes. Some — node cordons, a CRD delete to recover /openapi/v2, a stuck-node desiredConfig patch — are break-glass actions per ADR 0025 and produce an audit record.
Validate. Argo CD returns to Synced / Healthy (or the incident-specific success indicator). Update the issue with validation evidence.
Postmortem. Write a dated session report under opp-full-plat/reports/sessions/. If the lesson is durable, update the relevant runbook, and (if architectural) open an ADR amendment review issue.

The twelve incident classes

#	Page	Class	Severity
02	ACM gitops-addon Routes CRD	All Argo sync silently stalls; recurring on ACM + OpenShift GitOps pull model clusters	High — fleet-wide GitOps pause
03	ESO egress blocked to Vault	Red Hat ESO ships default-deny NetworkPolicies; ClusterSecretStore hangs	Medium — affects secret delivery
04	IPv6 disable breaks OVN-K	”Disable IPv6 at host” MachineConfig variants both break OVN-K; nodes go `Ready=False`	Critical — cluster network down
05	OBC -> operand storage-Secret bridge	NooBaa OBC Secret key shape doesn’t match LokiStack/TempoStack expectation	Medium — observability stack stuck `Warning Degraded`
06	SigNoz v0.122 auth API shape	v0.121 -> v0.122 login API moved; old code silently fails	Low — affects automation, not user UI
07	WSO2 APIM JMS URL encoding	HTML-escaped `&` in JMS URL where `%26` is required; published APIs don’t reach gateway	Medium — silent app-publish failure
08	Break-glass procedure	Operator-facing companion to ADR 0025: when GitOps can’t recover in time	Procedure — gates every direct mutation
09	DefectDojo Jenkins import	Trivy -> DefectDojo import RED/UNSTABLE: token drift, parser bumps, dedup gotchas	Medium — supply-chain visibility
10	Jenkins + GitLab webhook pollSCM bootstrap	New job: webhook `HTTP 200` but no builds; declarative triggers register at build time	Low — first-push gotcha
11	MCO stuck-node recovery	Procedural companion to #135: `desiredConfig` annotation patch unsticks a node MCO refuses to roll	Critical — cluster network down
12	Secrets custody drift check	Local-mirror credential silently mismatches the live VM; probe-before-you-work protocol	Procedure — prevents 10-minute mid-task fights
13	PCI-DSS remediation and evidence	Playbook for the PCI-0 -> PCI-5 chain: scan interpretation, exception process, evidence pack	Procedure — quarterly audit cycle

How to use this section under pressure

At 3 AM, start at the symptom table at the top of each runbook. Each page leads with “When this runbook applies” — match your situation against the bullets. If your situation does not match, do not force-fit.
Capture before-state first. Every fix that involves a live mutation requires a before-<kind>-<name>.yaml capture per the break-glass procedure. The capture is the audit record.
Make the smallest change. A single oc delete crd is preferable to a oc apply -k of the whole cluster overlay. Argo will own the resource going forward; you are unblocking, not reconfiguring.
Walk the validation block. Each runbook ends with a “Validation” or “When this is resolved” block. Run every command in it. Do not declare done because the symptom went away — the symptom going away can be a side effect.

How runbook entries land here

A runbook becomes a published page when all of the following are true:

The incident has occurred at least once in the lab with a recorded session-report or GitHub-issue post-mortem.
The fix has been validated in production by walking the documented commands.
A future operator (or the same operator, six months from now) could follow it without re-deriving the recipe.

If you experience an incident that does not fit one of the seven below, open an incident issue, write the postmortem, then propose a new runbook MR against this site and a parallel runbook under opp-full-plat/runbooks/. The same lesson lives in two places: the operator-facing runbook (procedure under pressure) and this section’s published page (the public-facing summary).

What is not in this section

Generic OpenShift troubleshooting. This section covers incident classes that the lab has specifically paid for. For generic CrashLoopBackOff / ImagePullBackOff / PodSchedulingFailed troubleshooting, the OpenShift docs and the cluster’s Events stream are the right entry points.
Application-side incidents. Workload-specific incidents (CNPG primary failover, Open Liberty health-check fail) belong in the application’s own runbook surface under developer-handbook/, not here.
Compliance scan failures. These have their own audit / remediation cycle under connection-details/compliance-implementor-handbook.md — they are operator-followed but not incident-shaped.

References

opp-full-plat/runbooks/break-glass-procedure.md — overall break-glass policy
opp-full-plat/runbooks/mco-stuck-node-recovery.md
opp-full-plat/runbooks/openshift-ipv6-disable-correct-approach.md
opp-full-plat/runbooks/wso2-apim-jms-url-encoding-trap.md
opp-full-plat/runbooks/secrets-custody-drift-check.md
opp-full-plat/runbooks/jenkins-gitlab-webhook-pollscm.md
opp-full-plat/runbooks/defectdojo-jenkins-import.md
ADRs: 0018 (pull model), 0019 (image supply), 0025 (break-glass)
Issues: #135 (IPv6/OVN incident), #142 (MCO runbook), #153 (Routes CRD permanent fix), #229 (this section)