Incidents and runbooks — overview
How incident response works on the fleet: the loop from alert to postmortem, the twelve incident classes the lab has paid for, and how to use this section under pressure.
This subsection is the operator’s incident library. Twelve incidents have been documented in detail because the lab has paid for each at least once. Reading them ahead of time is the cheapest insurance the fleet has.
Every runbook follows the same five-section schema, mandated by the section tracker (#229) and by the workspace’s runbook convention:
| Section | What it answers |
|---|---|
| Symptom | What does the operator see? What does the alert / log line / oc output look like? |
| Root cause | Why does this happen? What is the underlying contract that broke? |
| Fix | Concrete commands — copy-paste-able — to recover the fleet. |
| Prevention | What guardrail, alert, or design change reduces the chance of recurrence? |
| References | The runbook file, ADRs, and incident issues that back the page. |
The incident loop
The loop runs left to right and bottom-up:
- Detect. An alert fires, a
kustomizebuilds fail, Argo CD showsComparisonError, a user reports a 404. Open an incident issue onzeshaq/opp-full-platwith the labelincidentand the cluster name. - Triage. Match the symptom to one of the runbooks below. The diagnostic is usually a single read-only command (e.g.,
oc get --raw /openapi/v2for the Routes CRD incident). - Fix. Most fixes are GitOps changes. Some — node cordons, a CRD delete to recover
/openapi/v2, a stuck-nodedesiredConfigpatch — are break-glass actions per ADR 0025 and produce an audit record. - Validate. Argo CD returns to
Synced / Healthy(or the incident-specific success indicator). Update the issue with validation evidence. - Postmortem. Write a dated session report under
opp-full-plat/reports/sessions/. If the lesson is durable, update the relevant runbook, and (if architectural) open an ADR amendment review issue.
The twelve incident classes
| # | Page | Class | Severity |
|---|---|---|---|
| 02 | ACM gitops-addon Routes CRD | All Argo sync silently stalls; recurring on ACM + OpenShift GitOps pull model clusters | High — fleet-wide GitOps pause |
| 03 | ESO egress blocked to Vault | Red Hat ESO ships default-deny NetworkPolicies; ClusterSecretStore hangs | Medium — affects secret delivery |
| 04 | IPv6 disable breaks OVN-K | ”Disable IPv6 at host” MachineConfig variants both break OVN-K; nodes go Ready=False | Critical — cluster network down |
| 05 | OBC -> operand storage-Secret bridge | NooBaa OBC Secret key shape doesn’t match LokiStack/TempoStack expectation | Medium — observability stack stuck Warning Degraded |
| 06 | SigNoz v0.122 auth API shape | v0.121 -> v0.122 login API moved; old code silently fails | Low — affects automation, not user UI |
| 07 | WSO2 APIM JMS URL encoding | HTML-escaped & in JMS URL where %26 is required; published APIs don’t reach gateway | Medium — silent app-publish failure |
| 08 | Break-glass procedure | Operator-facing companion to ADR 0025: when GitOps can’t recover in time | Procedure — gates every direct mutation |
| 09 | DefectDojo Jenkins import | Trivy -> DefectDojo import RED/UNSTABLE: token drift, parser bumps, dedup gotchas | Medium — supply-chain visibility |
| 10 | Jenkins + GitLab webhook pollSCM bootstrap | New job: webhook HTTP 200 but no builds; declarative triggers register at build time | Low — first-push gotcha |
| 11 | MCO stuck-node recovery | Procedural companion to #135: desiredConfig annotation patch unsticks a node MCO refuses to roll | Critical — cluster network down |
| 12 | Secrets custody drift check | Local-mirror credential silently mismatches the live VM; probe-before-you-work protocol | Procedure — prevents 10-minute mid-task fights |
| 13 | PCI-DSS remediation and evidence | Playbook for the PCI-0 -> PCI-5 chain: scan interpretation, exception process, evidence pack | Procedure — quarterly audit cycle |
How to use this section under pressure
- At 3 AM, start at the symptom table at the top of each runbook. Each page leads with “When this runbook applies” — match your situation against the bullets. If your situation does not match, do not force-fit.
- Capture before-state first. Every fix that involves a live mutation requires a
before-<kind>-<name>.yamlcapture per the break-glass procedure. The capture is the audit record. - Make the smallest change. A single
oc delete crdis preferable to aoc apply -kof the whole cluster overlay. Argo will own the resource going forward; you are unblocking, not reconfiguring. - Walk the validation block. Each runbook ends with a “Validation” or “When this is resolved” block. Run every command in it. Do not declare done because the symptom went away — the symptom going away can be a side effect.
How runbook entries land here
A runbook becomes a published page when all of the following are true:
- The incident has occurred at least once in the lab with a recorded session-report or GitHub-issue post-mortem.
- The fix has been validated in production by walking the documented commands.
- A future operator (or the same operator, six months from now) could follow it without re-deriving the recipe.
If you experience an incident that does not fit one of the seven below, open an incident issue, write the postmortem, then propose a new runbook MR against this site and a parallel runbook under opp-full-plat/runbooks/. The same lesson lives in two places: the operator-facing runbook (procedure under pressure) and this section’s published page (the public-facing summary).
What is not in this section
- Generic OpenShift troubleshooting. This section covers incident classes that the lab has specifically paid for. For generic CrashLoopBackOff / ImagePullBackOff / PodSchedulingFailed troubleshooting, the OpenShift docs and the cluster’s Events stream are the right entry points.
- Application-side incidents. Workload-specific incidents (CNPG primary failover, Open Liberty health-check fail) belong in the application’s own runbook surface under
developer-handbook/, not here. - Compliance scan failures. These have their own audit / remediation cycle under
connection-details/compliance-implementor-handbook.md— they are operator-followed but not incident-shaped.
References
opp-full-plat/runbooks/break-glass-procedure.md— overall break-glass policyopp-full-plat/runbooks/mco-stuck-node-recovery.mdopp-full-plat/runbooks/openshift-ipv6-disable-correct-approach.mdopp-full-plat/runbooks/wso2-apim-jms-url-encoding-trap.mdopp-full-plat/runbooks/secrets-custody-drift-check.mdopp-full-plat/runbooks/jenkins-gitlab-webhook-pollscm.mdopp-full-plat/runbooks/defectdojo-jenkins-import.md- ADRs: 0018 (pull model), 0019 (image supply), 0025 (break-glass)
- Issues: #135 (IPv6/OVN incident), #142 (MCO runbook), #153 (Routes CRD permanent fix), #229 (this section)