Operations — overview

How the platform actually runs day to day: the handoff path, the MR pipeline, the incident loop, and the on-call rotation that ties them together.

This section is the operator’s handbook for the active fleet — what the day looks like, what changes when something breaks, how a change gets from a thought to a running cluster, and the seven incidents the lab has actually paid for. It is the page you read on day one and the page you re-skim on every on-call shift.

If you are a brand-new platform admin, read this page, then day-1 admin handoff, then MR mechanics. Everything else is reference material organised by what just happened.

The operating loop

Platform operator (human)

GitHub issues (durable record)

platform-gitops MR (GitLab)

Argo CD (hub + spoke)

hub-dc-v6 + spoke-dc-v6

Prometheus + SigNoz alerts

Incident issue (break-glass record)

Postmortem + operator note

Runbook (under runbooks/)

ADR amendment (if architectural)

The arrows are the rules:

Every non-trivial change starts with a GitHub issue in zeshaq/opp-full-plat. The issue carries the milestone, the phase, the governing ADR set, and the validation/rollback plan. This is the durable record per the workspace’s GitHub-first tracking convention; chat is ephemeral, the issue is authoritative.
Changes land as MRs against the internal GitLab platform-gitops repo. The repo is GitLab (not GitHub) so the gh CLI does not work for opening MRs — you POST to the GitLab API with a PAT. Branch names embed the issue key, commits carry a tracking header, and the MR description follows the documented template. See MR mechanics.
Argo CD pulls. Hub and spoke each run their own OpenShift GitOps instance; the spoke is the routine reconciler, the hub coordinates placement. The operator never oc applys as a routine — oc apply is break-glass.
The cluster emits signal. Prometheus on each cluster ships a curated subset of metrics to the hub’s federated observability stack; SigNoz on its own VM is the trace/log backend the platform leans on for app-side investigations.
Alerts open incident issues. The incident issue is the artefact of an on-call shift; it captures the symptom, the diagnosis, the action (and the break-glass record if a live change happened), and the postmortem.
Postmortems land as runbooks and operator notes. A runbook is the literal recipe the next operator follows under pressure; the operator note is the one-paragraph “what this taught us” so the lesson survives session boundaries.

The four operator surfaces

Surface	What lives there	Authoritative reference
Active clusters	`hub-dc-v6` (3-master compact, all-in-one, management) and `spoke-dc-v6` (3 control-plane + 3 physical workers, ODF)	`connection-details/openshift-hub-dc-v6.md`, `connection-details/openshift-spoke-dc-v6.md`
GitOps source of truth	Internal GitLab repo `comptech-platform/openshift-ops/openshift-platform-gitops` plus the operator working copy under `clones/platform-gitops/`	`connection-details/platform-admin-handoff.md`, `connection-details/gitlab-operator-guide.md`
Image supply	Nexus three-endpoint split: `mirror-registry.` (oc-mirror), `docker-group.` (dev pull), `app-registry.*` (CI push)	`connection-details/nexus.md`
Secrets & evidence	Vault VM (`vault.sub.comptech-lab.com:8200`) for app/platform credentials, MinIO for backups and CI evidence, `opp-full-plat/secrets/` for the local mirror of platform creds not yet in Vault	`connection-details/vault-app-secrets.md`, `connection-details/minio.md`

Each surface has its own access plane, its own rotation schedule, and its own break-glass procedure. Day-1 handoff covers the access details; routine tasks cover the rotations; incidents cover the breakage.

Day-to-day, by clock time

At session start: read CURRENT_STATE.md, SESSION_LOG.md, TODO.md in opp-full-plat. They are the snapshot from the last operator (often yourself, the day before). Then run the 10-second cluster health snapshot from the day-1 handoff.
During focused work: open or refresh the GitHub issue for the work, edit GitOps under clones/platform-gitops/, kubectl kustomize both cluster overlays, commit on a branch matching the issue key, open the MR via the GitLab API, watch Argo reconcile, validate, comment on the issue with the validation evidence.
On alerts: open an incident issue, run the symptom-matching part of the relevant runbook from the incidents index, capture the before-state, make the smallest correct change, capture the after-state, write the audit record.
At session end: add a dated session report under reports/sessions/, update CURRENT_STATE.md / SESSION_LOG.md / TODO.md, commit the workspace, do not push live state without a backport.

What this section is and is not

This section is the operator handbook. It assumes:

you have read section 1 — foundations (workspace conventions, ADR catalogue, ownership boundaries);
you understand the hub-spoke pull model at the architectural level;
you have shell access from a workspace host that can reach the lab 30.30.0.0/16 network.

It is not the developer handbook — apps targeting the Docker runtime VM (paused for OpenShift app delivery per the user’s 2026-05-09 decision) follow a separate path covered in section 4. It is also not the compliance auditor handbook — that lives at opp-full-plat/connection-details/compliance-implementor-handbook.md and is consumed by the auditor role.

What you will find in each subsection

Day-1 admin handoff — the first 100 hours: access plane, what to read, must-watch dashboards.
platform-gitops MR mechanics — the working-copy path, the GitLab API flow, sync-wave conventions, the operator-install pattern.
Incidents and runbooks — seven incidents with symptom → root cause → fix → prevention.
Routine tasks — six recurring workflows: secret rotation, operator bumps, fleet onboarding, policy rollout, evidence backfill.
On-call and escalation — paging path, escalation matrix, who-owns-what.
Known gotchas — index of every incident runbook plus the gotchas not severe enough to warrant their own page.

A note on scope

The active fleet is the v6 pair — hub-dc-v6 and spoke-dc-v6. The pre-v6 hub-dr / spoke-dr clusters are decommissioned. Any DR-related conversation is about hub-dr-v6 / spoke-dr-v6, which do not exist today. If a runbook mentions a cluster that is not in the active set, treat it as historical context, not as a live target.

References

opp-full-plat/connection-details/platform-admin-handoff.md
opp-full-plat/runbooks/break-glass-procedure.md
opp-full-plat/adr/0015, 0016, 0018, 0019, 0025
Issues: #127 (handoff), #142 (MCO runbook), #143 (MR conventions doc), #229 (this section tracker)