Day-1 admin handoff — the first 100 hours
Concrete checklist for a new platform operator: the access plane, what to read in what order, the MR conventions, and the dashboards/alerts that must be on-screen during a shift.
This page is the first-100-hours runbook for a platform admin who has just been handed the keys to the active OpenShift v6 fleet. It is a checklist, not a narrative. Tick the items in order; stop and ask before improvising. If you are returning to the role after a session break, you can skip the “before you start” block and resume at the “every-session warm-up”.
Sanitization note: all kubeadmin passwords, root tokens, robot pull-secrets, Vault root tokens, MAC addresses, hardware serials, and internal IPv4 addresses are kept in the operator’s local opp-full-plat/secrets/ mirror and in the corresponding entries in Vault. They are deliberately not published on this page. Where this page says “the X host” or “the Vault VM”, the address is in connection-details/<service>.md inside the workspace.
Before you start
- You have a writable workspace. The operator workspace boundary is
/home/ze/ops-workspace,/home/ze/secrets,/home/ze/opp-full-plat. Other/home/ze/*paths are off-limits unless explicitly named. If your operator account does not have read access toopp-full-plat/secrets/and write access toopp-full-plat/, stop and re-do onboarding. - You have a GitHub identity in
zeshaq/opp-full-plat. All issues, milestones, ADR review threads, and the operations tracking record live there. - You have a GitLab identity with developer permission on
comptech-platform/openshift-ops/openshift-platform-gitopsand a Personal Access Token stored atopp-full-plat/the operator GitLab PAT (local-only). The PAT is used by the API-driven MR flow; it does not need to be in your shell environment. - You have
oc,kubectl,git,curl,jq,skopeo, andsshon the workstation you operate from. The workspace assumes these as the minimal toolset; nooc-mirroris needed on the operator workstation (it runs on the mirror VM).
Hour 1: read these, in this order
These are the source-of-truth documents. Skim every one front to back at least once; come back and read in detail when the relevant work surfaces.
| Order | Path | What you take away |
|---|---|---|
| 1 | connection-details/platform-admin-handoff.md | Cluster endpoints, kubeconfig custody, operator baseline, GitOps flow, image-supply rules, break-glass rules, “stop criteria” list |
| 2 | connection-details/gitlab-operator-guide.md | LAN-only automation, runner taxonomy (11 ct-* groups, 5 planned runner classes), GitLab roles |
| 3 | connection-details/compliance-implementor-handbook.md | PCI-DSS phase chain (PCI-0 -> PCI-5), auditor vs implementor roles, evidence-pack format |
| 4 | connection-details/openshift-hub-dc-v6.md, openshift-spoke-dc-v6.md, openshift-gitops-hub-dc-v6.md | Per-cluster API/console URLs, machine inventory, ODF status, GitOps Applications |
| 5 | connection-details/nexus.md, vault-app-secrets.md, minio.md | Three-endpoint Nexus split, Vault kv-v2 paths, MinIO buckets and lifecycle |
| 6 | connection-details/signoz.md, jenkins.md, docker-runtime-vm.md | Supporting VM topology |
| 7 | adr/0015, 0016, 0018, 0019, 0025 | The five ADRs that govern day-to-day operator behaviour |
| 8 | runbooks/break-glass-procedure.md | The procedure you follow when GitOps cannot recover within fifteen minutes |
Allow about 90 minutes for the first pass.
Hour 2: the access plane
Open shells against both clusters with the local kubeconfigs:
K_HUB=/home/ze/.kube/configs/hub-dc-v6.kubeconfig
K_SPOKE=/home/ze/.kube/configs/spoke-dc-v6.kubeconfig
oc --kubeconfig "$K_HUB" whoami
oc --kubeconfig "$K_SPOKE" whoami
Expected: each command returns the configured identity (kubeadmin for first-day bootstrap, a named cluster-admin identity after IdP is wired). If whoami fails, the kubeconfig is stale — re-fetch from the bootstrap host (ze@ocp-bootstrap) under /home/ze/ocp-clusters/<cluster>/auth/kubeconfig.
Confirm GitLab reachability via PAT:
PAT=$(tr -d '
' < "$LOCAL_GITLAB_PAT_FILE")
curl -sSf -H "PRIVATE-TOKEN: $PAT" \
http://<gitlab-vm>/api/v4/projects/comptech-platform%2Fopenshift-ops%2Fopenshift-platform-gitops \
| jq '.name, .default_branch'
Expected: project name and main as default branch. Anything else is a permission or PAT problem — fix before continuing.
Confirm Vault reachability:
VAULT_ADDR=https://vault.sub.comptech-lab.com:8200
curl -sSf $VAULT_ADDR/v1/sys/health | jq '.initialized, .sealed, .standby'
Expected: true, false, false. If sealed is true, escalate immediately — the platform’s secret plane is offline.
Confirm Nexus three endpoints:
for ep in mirror-registry docker-group app-registry; do
curl -ksSI https://$ep.apps.sub.comptech-lab.com/v2/ | head -1
done
Expected: each returns 401 Unauthorized (reachable, requesting auth). 503 or 5xx is a Nexus-side problem; Connection refused is a DNS / HAProxy problem.
Confirm SigNoz reachability:
curl -ksSf https://signoz.apps.sub.comptech-lab.com/api/v1/version | jq '.data.version'
Expected: v0.122.0 (or whatever the lab is running). Anything but 200 is either a VM-down event or the v0.122 API-shape gotcha — see SigNoz v0.122 auth runbook.
Hour 3: the cluster health snapshot
The “is the fleet OK?” snapshot is three commands per cluster:
oc --kubeconfig "$K_HUB" get nodes -o wide
oc --kubeconfig "$K_HUB" get clusterversion version
oc --kubeconfig "$K_HUB" get co \
| awk 'NR==1 || $3 != "True" || $4 != "False" || $5 != "False"'
oc --kubeconfig "$K_SPOKE" get nodes -o wide
oc --kubeconfig "$K_SPOKE" get clusterversion version
oc --kubeconfig "$K_SPOKE" get co \
| awk 'NR==1 || $3 != "True" || $4 != "False" || $5 != "False"'
Expected steady state, as of the latest fleet handoff:
hub-dc-v6: three mastersReady, no worker machines, ClusterVersion4.20.18, every ClusterOperatorAvailable=True / Progressing=False / Degraded=False.spoke-dc-v6: three masters and three workers (gold-1,gold-2,gpu-01)Ready, ClusterVersion4.20.18, COs clean.
Any deviation is the first item on your incident issue.
Plus a one-shot Argo CD sanity check on each cluster:
oc --kubeconfig "$K_HUB" -n openshift-gitops get applications.argoproj.io -o wide
oc --kubeconfig "$K_SPOKE" -n openshift-gitops get applications.argoproj.io -o wide
Expected: every Application is Synced / Healthy. If something is OutOfSync or Degraded, the drift triage flow starts here.
Hour 4: image-supply sanity check
The platform is disconnected. Every image must come from an approved local prefix. Run the drift script:
cd /home/ze/opp-full-plat
scripts/openshift-image-supply-check.sh hub-dc-v6 "$K_HUB"
scripts/openshift-image-supply-check.sh spoke-dc-v6 "$K_SPOKE"
Expected: zero uncovered external runtime image references, default OperatorHub sources disabled, mirrored CatalogSources and ClusterCatalogs present. The latest known-clean reports live under reports/.
If a pod is pulling from registry.redhat.io or quay.io directly and no IDMS/ITMS covers the source, treat it as image-supply drift and open an issue. Do not re-enable upstream OperatorHub sources to “fix” it.
Day 1: write your first MR
Pick something tiny — a CHANGELOG fix, a typo in a kustomization.yaml, a comment in a Subscription. The goal is to walk the MR flow end to end while the failure modes are inexpensive:
-
Open or comment on a GitHub issue describing the change (even one-line fixes get an issue under the GitHub-first tracking convention).
-
Branch in
/home/ze/ops-workspace/clones/platform-gitops/:cd /home/ze/ops-workspace/clones/platform-gitops git pull --ff-only origin main git checkout -b doc-1/first-mr -
Make the change.
kubectl kustomizeboth cluster overlays:kubectl kustomize clusters/hub-dc-v6 > /dev/null kubectl kustomize clusters/spoke-dc-v6 > /dev/null -
Commit with the tracking-header block (see MR mechanics for the exact shape).
-
Push and open the MR via the GitLab API. The
ghCLI does not work for GitLab — use the documentedcurl+ PAT path. -
Watch Argo reconcile after merge:
oc --kubeconfig "$K_SPOKE" -n openshift-gitops get app spoke-dc-v6-cluster-config \ -o jsonpath='{.status.sync.status}{" "}{.status.health.status}{" "}{.status.sync.revision}{"\n"}' -
Update the issue with the validation evidence and close it.
Day-1 success criterion: you have walked the loop once without surprises. If the MR API call was unfamiliar, repeat the loop with a second trivial change before tackling anything substantive.
Day 2-3: shadow a real change
If a routine task is on the TODO — operator bump, policy rollout, evidence backfill — pair on it with the outgoing operator or with the workspace’s session-log history. Read the most recent five session reports under opp-full-plat/reports/sessions/ to see the conventions in practice.
The session-report convention is one report per session under reports/sessions/YYYYMMDD-HHMMSS-<slug>.md capturing what was done, what changed, what was validated, and any residual risk. Adopt the convention before you accumulate the first three sessions of debt.
Day 4-5: own one incident category
Pick one incident class — the seven runbooks under incidents — and walk every step on a non-production reproducer if available, or paper-walk it otherwise. The aim is to own the recovery commands before paged at 3 AM.
The recommended starting category for a new operator is the Routes CRD incident because it is recurring, the diagnostic is one oc get --raw /openapi/v2, and the fix is one oc delete crd.
Must-watch dashboards and alerts
Keep these on a second monitor during a shift:
| Surface | What to watch | Why |
|---|---|---|
| Hub OpenShift console -> Observe -> Alerts | Watchdog + any Critical | The Watchdog confirms the alerting pipeline itself is alive |
| Spoke OpenShift console -> Observe -> Alerts | Watchdog + any Critical | Same |
| Argo CD UI (hub and spoke, port-forward or Route) | Sync + Health of every Application | The single fastest failure indicator |
| SigNoz UI -> Services | Throughput + error rate per service | Catches workload-side regressions ahead of pod-level alerts |
oc get co (both clusters) | Anything not Available=True / Progressing=False / Degraded=False | The cluster-operator view aggregates CRDs/operators/webhooks |
oc get mcp (spoke) | Updating=False outside of a known rollout window | MCP-stuck-on-revert is the #135 incident class |
If you find yourself watching oc get pods -A -w, you are watching the wrong layer. The pod view is the symptom; the alerts and Argo health view is the cause.
Every-session warm-up (re-entry)
Once you have done the first-100-hours pass, every subsequent session opens with:
- Read
opp-full-plat/CURRENT_STATE.md,SESSION_LOG.md,TODO.md(in that order). - Read the two most recent dated reports under
opp-full-plat/reports/sessions/. - Run the cluster health snapshot (hour 3 above).
- Run the Argo sanity check.
- Run the image-supply drift check (weekly cadence is fine if there has been no install activity).
- Skim open incident issues (label
incidentonzeshaq/opp-full-plat). - Pick the next item from
TODO.md’s “Next Order Of Business” section.
The protocol is the opp-full-plat workspace’s stated convention (read CURRENT_STATE / SESSION_LOG / TODO at start, write a dated session record at end). It is the difference between operating a fleet and groping for context.
What MUST stay out of GitHub, GitLab, chat, and this page
Per the workspace’s “Do not paste …” rule in the handoff doc:
- kubeadmin passwords;
- pull secrets (registry.redhat.io, cloud.redhat.com, robot tokens);
- PATs (GitHub, GitLab, Nexus, Quay);
- Nexus admin password and per-bot passwords;
- Vault root token;
- full
Secretmanifests (thedata:block); - htpasswd hashes (RHACS Central admin, etc.);
- raw internal IPv4 addresses in public docs (this site is published to Cloudflare Pages — descriptive labels only).
If a credential lands in chat or an issue body by mistake, treat it as a rotation event: rotate the credential, update Vault and the local secrets/ mirror, and record the rotation in the active session report.
Stop criteria — when to escalate, not push through
The handoff doc’s “Stop Criteria” list is short and absolute. Stop and escalate before continuing if:
- default external OperatorHub sources must be re-enabled to proceed;
- a package is missing from the mirrored catalog;
- a pod is pulling from upstream without a mirror rule;
- MCPs are degraded or stuck updating;
- Argo CD is not
Synced / Healthyafter the expected reconciliation window; - ODF or core storage is degraded;
- any workflow requires committing or printing a secret;
- two controllers or two Argo Applications would own the same cluster-scoped resource;
- a direct live change cannot be backported or justified.
“Escalate” in the lab means: post in the active GitHub issue, page the workspace owner, and wait. Do not improvise.
References
opp-full-plat/connection-details/platform-admin-handoff.md(the canonical handoff doc)opp-full-plat/connection-details/openshift-hub-dc-v6.md,openshift-spoke-dc-v6.md,openshift-gitops-hub-dc-v6.mdopp-full-plat/connection-details/nexus.md,vault-app-secrets.md,minio.md,signoz.mdopp-full-plat/runbooks/break-glass-procedure.md,secrets-custody-drift-check.mdopp-full-plat/adr/0015,0016,0018,0019,0025- Issues: #127 (this handoff), #229 (this section)