On-call and escalation

How the lab's small-team on-call rotation works: the paging path, the escalation matrix, who owns hub vs spoke vs supporting VMs, and the break-glass triggers.

This page documents how on-call works on the lab. The lab is operated by a small team (currently a solo operator), so “on-call” reads more as “the convention you follow when something is on fire” rather than a multi-tier PagerDuty rota. The conventions still matter — they keep recovery deterministic when the operator is tired, paged at 3 AM, or context-switching from another task.

The paging path, the escalation matrix, and the break-glass triggers are documented here. Per section tracker #229, this page does not expose pager numbers, on-call phone numbers, or the rotation calendar — those live in the operator’s local notes and the team-internal channels.

The on-call posture

ItemValue
Active rotationSolo operator (workspace owner)
Off-hoursBest-effort; the lab has no 24x7 SLO
Paging surfaceGitHub issue with incident label + workspace owner notification
EscalationWorkspace owner -> infrastructure team -> network team (external)
Acceptable response windowNinety minutes for incident label; same-business-day for bug label

The lab does not carry a customer-facing SLA. The on-call posture is appropriate for an internal platform: fast enough that incidents do not compound, deliberate enough that recovery has an audit record.

The paging path

What triggers a page (and who/what does the paging):

TriggerSourcePage targetSeverity
OCP Critical alert fires on the hub or spoke (Watchdog excepted)Prometheus alertmanager -> on-call channelOn-call operatorHigh
Argo CD Application becomes OutOfSync or DegradedCluster controller -> alert -> on-call channelOn-call operatorHigh
ClusterOperator goes Available=False or Degraded=TruePrometheus -> alertmanagerOn-call operatorHigh
MachineConfigPool Updating=True outside a known rollout windowPeriodic check; not an alertmanager rule todayOn-call operator (read-only sweep catches it)Medium
MinIO bucket inaccessiblePeriodic probe from operator workstationOn-call operatorMedium
Vault sealed=truePeriodic probe; alertmanager rule recommendedOn-call operatorCritical
RHACS Central reports centralStatus.administered=falseCentral API probeOn-call operatorMedium

The “page” itself is currently:

  1. An issue is opened on zeshaq/opp-full-plat with the incident label (manually if the operator notices first; automatically as the alerting integrations mature).
  2. The workspace owner is notified (channel TBD per local convention).
  3. The operator picks up the issue, follows the relevant runbook, and updates the issue with progress.

The convention is “one issue per incident”. Sub-tasks are comments; spinoff issues are linked.

The escalation matrix

Who owns what, in escalation order:

SurfaceOwnerBackup
hub-dc-v6 cluster lifecycleWorkspace owner(none — solo)
spoke-dc-v6 cluster lifecycleWorkspace owner(none — solo)
Hub OpenShift GitOps (Argo CD on hub)Workspace owner(none — solo)
Spoke OpenShift GitOps (Argo CD on spoke)Workspace owner(none — solo)
Nexus VM (nexus-mirror.sub.comptech-lab.com)Workspace ownerInfrastructure team
Vault VM (vault.sub.comptech-lab.com)Workspace ownerInfrastructure team
MinIO VM (minio.sub.comptech-lab.com)Workspace ownerInfrastructure team
GitLab VM (gitlab.sub.comptech-lab.com)Workspace ownerInfrastructure team
SigNoz VM (signoz.sub.comptech-lab.com)Workspace ownerInfrastructure team
Jenkins VM (jenkins.sub.comptech-lab.com)Workspace ownerInfrastructure team
HAProxy edge VMInfrastructure teamNetwork team
PowerDNS auth + recursor VMInfrastructure teamNetwork team
Cluster network (CNI, OVN-K, NetworkPolicy)Workspace ownerNetwork team for upstream
Cluster storage (ODF, LSO)Workspace owner(none — solo)
Compliance auditor / implementor splitSee compliance-implementor-handbook.mdn/a

The lab’s “infrastructure team” and “network team” are external teams (not the OpenShift admin). When escalating to them:

  • Capture the symptom and the diagnostic evidence in the GitHub issue first.
  • Translate cluster-side observations into infrastructure-side terms (e.g., “spoke node X cannot resolve vault.sub.comptech-lab.com” rather than “ESO is failing”).
  • Avoid asking external teams to log into the cluster. They do not have OpenShift access.

Break-glass triggers

Break-glass is the explicit policy in ADR 0025 governing live changes that bypass the GitOps MR flow. The full procedure is in opp-full-plat/runbooks/break-glass-procedure.md; this page is the operator-facing summary.

All four of these conditions MUST be true at the moment you decide to act:

  1. There is a declared production incident or imminent production impact.
  2. Argo CD on the owning cluster cannot recover the state on its own within the time budget.
  3. A corrective action is needed in under fifteen minutes.
  4. No GitOps PR path can complete inside that window.

If any of those is false, this is not break-glass. Open an MR.

Acceptable break-glass examples

  • Emergency node cordon + drain because of kernel errors threatening pods.
  • oc delete crd routes.route.openshift.io to recover /openapi/v2 (the Routes CRD incident).
  • oc annotate node ... desiredConfig=... to unstick an MCP after a failed rollout (the MCO stuck-node procedure).
  • A stale finalizer is blocking an operator-managed deletion that GitOps cannot resolve.

Forbidden break-glass actions

Per ADR 0025 §4 — these need code-owner review even mid-incident:

  • Bypassing or disabling an RHACS image, deployment, runtime, or admission policy.
  • Disabling, deleting, or modifying security operators directly (RHACS, compliance, cert-manager, External Secrets, oauth).
  • Granting cluster-admin or expanded AppProject destinations to a user or service account.
  • Direct edits to rendered-* MachineConfig objects.
  • Direct edits to etcd, etcd Secrets, or kube-system cluster signing keys.
  • Silent rotation of Vault / Nexus / GitLab credentials without updating Vault and the local mirror in the same change window.

If one of these forbidden actions IS the incident remediation, a second platform admin must be named in the audit record per ADR 0025 §4.

The audit record

Every break-glass action produces a record under opp-full-plat/reports/break-glass/YYYY-MM-DD-<incident-slug>.md. The template lives in runbooks/break-glass-procedure.md. The record captures:

  • Incident issue, cluster, namespace, resource(s), owning Argo CD Application.
  • Actor, start / end / expiry timestamps (UTC).
  • Trigger (one paragraph: why GitOps could not recover).
  • Conditions verified (the four conditions above).
  • Prohibited-action exception (yes/no + second reviewer if yes).
  • Commands executed (verbatim — paste from /tmp/break-glass-*/commands.log).
  • Before/after state captures (redacted for secrets).
  • SSH session log if SSH was used.
  • Validation: immediate symptom cleared, Argo CD post-backport status, backport PR link.
  • Follow-up: issues opened, pattern to promote.

The audit record is the FG-6 evidence per plans/federated-gitops-readiness-gates.md. Its existence is what makes the live change accountable.

The clock

Twenty-four hours after the live change, one of the following must be true:

  1. A backport MR has merged into the owning GitOps repo and Argo CD is Synced / Healthy on the merge SHA. The live change is durable desired state.
  2. The live change has been reverted (the captured before-...yaml re-applied) and the cluster has returned to the pre-incident state.

The twenty-four-hour clock is absolute. Drifting past it without backport or revert is the FG-6 audit failure mode.

What every operator should keep on the desk

  • The kubeconfigs for both clusters (K_HUB, K_SPOKE) in shell env vars.
  • The GitLab PAT ("$LOCAL_GITLAB_PAT_FILE" # operator PAT, local-only) loaded into the curl wrapper.
  • A scratch directory for /tmp/break-glass-$(date -u +%Y%m%dT%H%M%SZ) captures.
  • The day-1 handoff and break-glass procedure runbook (the runbook lives in the workspace; this site has the operator-facing summary).
  • An open browser tab to the hub OpenShift console -> Observe -> Alerts.

Postmortem cadence

After every incident with the incident label:

  1. Write a dated session report under opp-full-plat/reports/sessions/.
  2. Close the incident issue with: timeline, root cause, fix, validation evidence, follow-up issues opened.
  3. If the lesson is durable: update the relevant runbook in opp-full-plat/runbooks/ and the relevant published page in this section.
  4. If the lesson is architectural: open an ADR amendment review issue (the pattern that produced ADR 0026 from ADR 0005 — see IPv6 incident).

The postmortem is not a separate scheduled meeting in the lab’s solo posture; it is the close-out of the incident issue. The deliverables are the same.

References

  • opp-full-plat/adr/0025-gitops-only-operations-break-glass.md
  • opp-full-plat/runbooks/break-glass-procedure.md
  • opp-full-plat/connection-details/platform-admin-handoff.md §“Break-Glass Rules”
  • opp-full-plat/plans/federated-gitops-readiness-gates.md (FG-6)

Last reviewed: 2026-05-11