Incidents and runbooks — overview

How incident response works on the fleet: the loop from alert to postmortem, the twelve incident classes the lab has paid for, and how to use this section under pressure.

This subsection is the operator’s incident library. Twelve incidents have been documented in detail because the lab has paid for each at least once. Reading them ahead of time is the cheapest insurance the fleet has.

Every runbook follows the same five-section schema, mandated by the section tracker (#229) and by the workspace’s runbook convention:

SectionWhat it answers
SymptomWhat does the operator see? What does the alert / log line / oc output look like?
Root causeWhy does this happen? What is the underlying contract that broke?
FixConcrete commands — copy-paste-able — to recover the fleet.
PreventionWhat guardrail, alert, or design change reduces the chance of recurrence?
ReferencesThe runbook file, ADRs, and incident issues that back the page.

The incident loop

The loop runs left to right and bottom-up:

  1. Detect. An alert fires, a kustomize builds fail, Argo CD shows ComparisonError, a user reports a 404. Open an incident issue on zeshaq/opp-full-plat with the label incident and the cluster name.
  2. Triage. Match the symptom to one of the runbooks below. The diagnostic is usually a single read-only command (e.g., oc get --raw /openapi/v2 for the Routes CRD incident).
  3. Fix. Most fixes are GitOps changes. Some — node cordons, a CRD delete to recover /openapi/v2, a stuck-node desiredConfig patch — are break-glass actions per ADR 0025 and produce an audit record.
  4. Validate. Argo CD returns to Synced / Healthy (or the incident-specific success indicator). Update the issue with validation evidence.
  5. Postmortem. Write a dated session report under opp-full-plat/reports/sessions/. If the lesson is durable, update the relevant runbook, and (if architectural) open an ADR amendment review issue.

The twelve incident classes

#PageClassSeverity
02ACM gitops-addon Routes CRDAll Argo sync silently stalls; recurring on ACM + OpenShift GitOps pull model clustersHigh — fleet-wide GitOps pause
03ESO egress blocked to VaultRed Hat ESO ships default-deny NetworkPolicies; ClusterSecretStore hangsMedium — affects secret delivery
04IPv6 disable breaks OVN-K”Disable IPv6 at host” MachineConfig variants both break OVN-K; nodes go Ready=FalseCritical — cluster network down
05OBC -> operand storage-Secret bridgeNooBaa OBC Secret key shape doesn’t match LokiStack/TempoStack expectationMedium — observability stack stuck Warning Degraded
06SigNoz v0.122 auth API shapev0.121 -> v0.122 login API moved; old code silently failsLow — affects automation, not user UI
07WSO2 APIM JMS URL encodingHTML-escaped & in JMS URL where %26 is required; published APIs don’t reach gatewayMedium — silent app-publish failure
08Break-glass procedureOperator-facing companion to ADR 0025: when GitOps can’t recover in timeProcedure — gates every direct mutation
09DefectDojo Jenkins importTrivy -> DefectDojo import RED/UNSTABLE: token drift, parser bumps, dedup gotchasMedium — supply-chain visibility
10Jenkins + GitLab webhook pollSCM bootstrapNew job: webhook HTTP 200 but no builds; declarative triggers register at build timeLow — first-push gotcha
11MCO stuck-node recoveryProcedural companion to #135: desiredConfig annotation patch unsticks a node MCO refuses to rollCritical — cluster network down
12Secrets custody drift checkLocal-mirror credential silently mismatches the live VM; probe-before-you-work protocolProcedure — prevents 10-minute mid-task fights
13PCI-DSS remediation and evidencePlaybook for the PCI-0 -> PCI-5 chain: scan interpretation, exception process, evidence packProcedure — quarterly audit cycle

How to use this section under pressure

  • At 3 AM, start at the symptom table at the top of each runbook. Each page leads with “When this runbook applies” — match your situation against the bullets. If your situation does not match, do not force-fit.
  • Capture before-state first. Every fix that involves a live mutation requires a before-<kind>-<name>.yaml capture per the break-glass procedure. The capture is the audit record.
  • Make the smallest change. A single oc delete crd is preferable to a oc apply -k of the whole cluster overlay. Argo will own the resource going forward; you are unblocking, not reconfiguring.
  • Walk the validation block. Each runbook ends with a “Validation” or “When this is resolved” block. Run every command in it. Do not declare done because the symptom went away — the symptom going away can be a side effect.

How runbook entries land here

A runbook becomes a published page when all of the following are true:

  • The incident has occurred at least once in the lab with a recorded session-report or GitHub-issue post-mortem.
  • The fix has been validated in production by walking the documented commands.
  • A future operator (or the same operator, six months from now) could follow it without re-deriving the recipe.

If you experience an incident that does not fit one of the seven below, open an incident issue, write the postmortem, then propose a new runbook MR against this site and a parallel runbook under opp-full-plat/runbooks/. The same lesson lives in two places: the operator-facing runbook (procedure under pressure) and this section’s published page (the public-facing summary).

What is not in this section

  • Generic OpenShift troubleshooting. This section covers incident classes that the lab has specifically paid for. For generic CrashLoopBackOff / ImagePullBackOff / PodSchedulingFailed troubleshooting, the OpenShift docs and the cluster’s Events stream are the right entry points.
  • Application-side incidents. Workload-specific incidents (CNPG primary failover, Open Liberty health-check fail) belong in the application’s own runbook surface under developer-handbook/, not here.
  • Compliance scan failures. These have their own audit / remediation cycle under connection-details/compliance-implementor-handbook.md — they are operator-followed but not incident-shaped.

References

  • opp-full-plat/runbooks/break-glass-procedure.md — overall break-glass policy
  • opp-full-plat/runbooks/mco-stuck-node-recovery.md
  • opp-full-plat/runbooks/openshift-ipv6-disable-correct-approach.md
  • opp-full-plat/runbooks/wso2-apim-jms-url-encoding-trap.md
  • opp-full-plat/runbooks/secrets-custody-drift-check.md
  • opp-full-plat/runbooks/jenkins-gitlab-webhook-pollscm.md
  • opp-full-plat/runbooks/defectdojo-jenkins-import.md
  • ADRs: 0018 (pull model), 0019 (image supply), 0025 (break-glass)
  • Issues: #135 (IPv6/OVN incident), #142 (MCO runbook), #153 (Routes CRD permanent fix), #229 (this section)

Last reviewed: 2026-05-12