Operations — overview
How the platform actually runs day to day: the handoff path, the MR pipeline, the incident loop, and the on-call rotation that ties them together.
This section is the operator’s handbook for the active fleet — what the day looks like, what changes when something breaks, how a change gets from a thought to a running cluster, and the seven incidents the lab has actually paid for. It is the page you read on day one and the page you re-skim on every on-call shift.
If you are a brand-new platform admin, read this page, then day-1 admin handoff, then MR mechanics. Everything else is reference material organised by what just happened.
The operating loop
The arrows are the rules:
- Every non-trivial change starts with a GitHub issue in
zeshaq/opp-full-plat. The issue carries the milestone, the phase, the governing ADR set, and the validation/rollback plan. This is the durable record per the workspace’s GitHub-first tracking convention; chat is ephemeral, the issue is authoritative. - Changes land as MRs against the internal GitLab
platform-gitopsrepo. The repo is GitLab (not GitHub) so theghCLI does not work for opening MRs — you POST to the GitLab API with a PAT. Branch names embed the issue key, commits carry a tracking header, and the MR description follows the documented template. See MR mechanics. - Argo CD pulls. Hub and spoke each run their own OpenShift GitOps instance; the spoke is the routine reconciler, the hub coordinates placement. The operator never
oc applys as a routine —oc applyis break-glass. - The cluster emits signal. Prometheus on each cluster ships a curated subset of metrics to the hub’s federated observability stack; SigNoz on its own VM is the trace/log backend the platform leans on for app-side investigations.
- Alerts open incident issues. The incident issue is the artefact of an on-call shift; it captures the symptom, the diagnosis, the action (and the break-glass record if a live change happened), and the postmortem.
- Postmortems land as runbooks and operator notes. A runbook is the literal recipe the next operator follows under pressure; the operator note is the one-paragraph “what this taught us” so the lesson survives session boundaries.
The four operator surfaces
| Surface | What lives there | Authoritative reference |
|---|---|---|
| Active clusters | hub-dc-v6 (3-master compact, all-in-one, management) and spoke-dc-v6 (3 control-plane + 3 physical workers, ODF) | connection-details/openshift-hub-dc-v6.md, connection-details/openshift-spoke-dc-v6.md |
| GitOps source of truth | Internal GitLab repo comptech-platform/openshift-ops/openshift-platform-gitops plus the operator working copy under clones/platform-gitops/ | connection-details/platform-admin-handoff.md, connection-details/gitlab-operator-guide.md |
| Image supply | Nexus three-endpoint split: mirror-registry.* (oc-mirror), docker-group.* (dev pull), app-registry.* (CI push) | connection-details/nexus.md |
| Secrets & evidence | Vault VM (vault.sub.comptech-lab.com:8200) for app/platform credentials, MinIO for backups and CI evidence, opp-full-plat/secrets/ for the local mirror of platform creds not yet in Vault | connection-details/vault-app-secrets.md, connection-details/minio.md |
Each surface has its own access plane, its own rotation schedule, and its own break-glass procedure. Day-1 handoff covers the access details; routine tasks cover the rotations; incidents cover the breakage.
Day-to-day, by clock time
- At session start: read
CURRENT_STATE.md,SESSION_LOG.md,TODO.mdinopp-full-plat. They are the snapshot from the last operator (often yourself, the day before). Then run the 10-second cluster health snapshot from the day-1 handoff. - During focused work: open or refresh the GitHub issue for the work, edit GitOps under
clones/platform-gitops/,kubectl kustomizeboth cluster overlays, commit on a branch matching the issue key, open the MR via the GitLab API, watch Argo reconcile, validate, comment on the issue with the validation evidence. - On alerts: open an incident issue, run the symptom-matching part of the relevant runbook from the incidents index, capture the before-state, make the smallest correct change, capture the after-state, write the audit record.
- At session end: add a dated session report under
reports/sessions/, updateCURRENT_STATE.md/SESSION_LOG.md/TODO.md, commit the workspace, do not push live state without a backport.
What this section is and is not
This section is the operator handbook. It assumes:
- you have read section 1 — foundations (workspace conventions, ADR catalogue, ownership boundaries);
- you understand the hub-spoke pull model at the architectural level;
- you have shell access from a workspace host that can reach the lab
30.30.0.0/16network.
It is not the developer handbook — apps targeting the Docker runtime VM (paused for OpenShift app delivery per the user’s 2026-05-09 decision) follow a separate path covered in section 4. It is also not the compliance auditor handbook — that lives at opp-full-plat/connection-details/compliance-implementor-handbook.md and is consumed by the auditor role.
What you will find in each subsection
- Day-1 admin handoff — the first 100 hours: access plane, what to read, must-watch dashboards.
- platform-gitops MR mechanics — the working-copy path, the GitLab API flow, sync-wave conventions, the operator-install pattern.
- Incidents and runbooks — seven incidents with symptom → root cause → fix → prevention.
- Routine tasks — six recurring workflows: secret rotation, operator bumps, fleet onboarding, policy rollout, evidence backfill.
- On-call and escalation — paging path, escalation matrix, who-owns-what.
- Known gotchas — index of every incident runbook plus the gotchas not severe enough to warrant their own page.
A note on scope
The active fleet is the v6 pair — hub-dc-v6 and spoke-dc-v6. The pre-v6 hub-dr / spoke-dr clusters are decommissioned. Any DR-related conversation is about hub-dr-v6 / spoke-dr-v6, which do not exist today. If a runbook mentions a cluster that is not in the active set, treat it as historical context, not as a live target.
References
opp-full-plat/connection-details/platform-admin-handoff.mdopp-full-plat/runbooks/break-glass-procedure.mdopp-full-plat/adr/0015,0016,0018,0019,0025- Issues: #127 (handoff), #142 (MCO runbook), #143 (MR conventions doc), #229 (this section tracker)