Rebuild Timeline

A chronological account of the CompTech lab rebuild attempts — what was tried, when, what broke, and what the v6 fleet inherited from each pass.

This page is the rolled-up history of the CompTech lab rebuild. It collects every significant rebuild attempt — what was tried, when, what came out of it, and what the next pass kept versus threw away. Use it to orient before reading any of the detailed history pages: cloud-init failure, pre-v6 → v6 transition, site-replication readiness, day-wrap archive.

The five eras at a glance

EraWindowHeadlineWhat survived
E0 — cloud-init scrappre-2026-05-08Earlier attempts that never reached a working OpenShift install; left credential blobs in /home/ze/cloud-init/Nothing operational. Folder is now read-only historical evidence (see cloud-init page).
E1 — pre-v6 fleet2026-05-04 → 2026-05-08hub-dc + spoke-dc + hub-dr + spoke-dr; ADR 0001 baseline; push-model GitOps from lab-gitopsArchitectural lessons (cluster naming, Argo push pain), lab-gitops-full repo as desired-state reference
E2 — VM platform reset2026-05-08 (single intensive day)Edge (PDNS + HAProxy) re-set; GitLab + MinIO + Nexus + oc-mirror + Vault + Jenkins + SigNoz + Trivy + DefectDojo + Monitoring + Redis + Kafka + WSO2 + docker-runtime VMs deployedEvery VM listed in Lab Infrastructure §2
E3 — v6 install2026-05-09hub-dc-v6 then spoke-dc-v6 installed on OCP 4.20.18 from the mirror; pull-model Argo bootstrapThe active fleet today
E4 — federation + compliance2026-05-10 → 2026-05-11Federated GitLab content model; pre-v6 purge (ADR 0022); 22-operator queue + ACM + ODF + ESO + RHACS + OADP; PCI-DSS v4 baseline ran end-to-endCurrent operating state — see CURRENT_STATE.md

The rest of this page is the same story with dates, scope, outcomes, and lessons per major attempt.

Whiteboard

Three horizontal lanes — VM platform (amber), OpenShift fleet (gray, bold), GitOps / governance (dashed green) — read left to right by date. Solid arrows are sequential milestones within a lane; dashed amber arrows are transitions where the model changed (fresh start, de-scope, model swap); red purge is the destructive removal of pre-v6 names from active inventory. Cross-lane arrows show the load-bearing dependencies — the mirror VM feeding the OCP install, hub-dc-v6 hosting Argo, and the platform VMs hosting GitLab content.

E0 — cloud-init scrap (pre-2026-05-08)

Scope. Earlier rebuild attempts whose only surviving artifact is /home/ze/cloud-init/ — a folder of ~117 files including admin passwords, deploy keys, OIDC client secrets, and RKE2 join tokens.

Outcome. Never reached a working OpenShift fleet. The artifacts are not authoritative for current state. The workspace boundary rule (see REP-6 issue — REP-6 migrated this to runbooks) forbids writing to /home/ze/cloud-init/; reading is allowed but every value found there is treated as historical, not current truth.

Lesson kept. Secrets sprawl across cloud-init files is exactly what later led to:

  • ADR 0019 — Nexus-only image supply chain. Single tracked registry path, not registry credentials scattered through VM cloud-init.
  • Vault on a dedicated VM (ADR not numbered, see connection-details/vault-app-secrets.md). Centralized credential custody; per-tenant app-secret namespaces; ESO bridges to OpenShift.
  • Secrets custody runbook (runbooks/secrets-custody-drift-check.md, REP-6 output): the convention that every credential has exactly one authoritative location.

See the dedicated cloud-init scrap page for the failure classes and the boundary rule’s exact text.

E1 — pre-v6 fleet (2026-05-04 → 2026-05-08)

Scope. Four-cluster OpenShift fleet defined in ADR 0001:

ClusterRoleStatus at end of E1
hub-dcactive management hubrunning but on push-mode GitOps; storage and operator drift accumulating
spoke-dcactive workload clusterrunning
hub-drmanagement standbyrunning; DR drill exercises stalled with prune regressions
spoke-drworkload standbyrunning; ACM governance crashloops

Outcome. This fleet got the lab to “a working hub + a working spoke + a working DR pair.” But fundamentals were brittle:

  • DR drills uncovered prune regressions (reports/sessions/20260505-093816-hub-dr-pull-gitops-prune-regression.md then 20260505-102325-hub-dr-pull-gitops-prune-recovery.md).
  • Storage on hub clusters was a permanent pain point — the 05-05 sessions ran a chain of keep hub LVMS, remove hub storage consumershub storage-light live applyhub-dc LVMS repair.
  • spoke-dr ACM governance was crashlooping until manual cleanup (20260506-223558-spoke-dr-acm-governance-fix.md).
  • The push-model GitOps placement made the four-cluster scope a permanent registration overhead.

Decision: de-scope DR. On 2026-05-08 09:05 UTC the user instructed that hub-dr and spoke-dr be de-scoped from active operations. The session report at reports/sessions/20260508-090555-descope-dr-clusters.md records the safe (non-destructive) GitOps placement narrowing to env=dc, role=spoke; the DR clusters were left running but stopped receiving active work.

Lessons kept.

  • ADR 0001 was preserved as the historical anchor when the cluster list was later superseded by ADR 0022. The workspace concept survived even though the cluster names didn’t.
  • The push-model pain motivated ADR 0018 — ACM + OpenShift GitOps pull model for v6.
  • The DR storage pain motivated the management-only hub design (20260505-153113-management-only-hub-dr-design.md) — hub clusters keep LVMS but do not host ODF/storage consumers. This carried into hub-dc-v6.
  • lab-gitops-full repo (the full desired-state reference) survived and was used as the seed for the federated platform-gitops repo on internal GitLab.

E2 — VM platform reset (2026-05-08)

Scope. A single intensive day where the entire VM-platform tier was rebuilt or re-validated. Every VM that supports the v6 fleet was either deployed or hardened on this day.

Chronological log (UTC):

TimeEvent
10:29GitLab and MinIO fresh-start reset
11:01Rebuild script import — scripts/rebuild/* moved into the workspace
11:39Network/ingress PKI decision recorded (ADR 0005)
11:42Gateway CIDR correction in the rebuild plan
11:59PowerDNS readiness check
12:06Roadmap PDNS + DNS gates added
12:14GitHub planning pack + tracker created
12:25Repeatable rebuild documentation model adopted
12:34Cluster names approved: hub-dc-v6, spoke-dc-v6 (base domain sub.comptech-lab.com)
12:46PDNS resolver + API cleanup
12:57Hub topology decision: compact 3-master, no separate workers
13:05NIC source rule + iLO MAC check
13:14Allocation accepted, DNS records applied
13:28Disconnected mirror baseline (ADR 0019 emerging)
13:44Standalone Nexus + oc-mirror VMs decided
14:09Mirror TLS, pull secret, dry-run
15:10Pinned fast mirror image set
15:44Expanded previous-platform mirror dry-run
15:55Full mirror download started
16:39hub-dc-v6 install inputs prepared
17:11Vault memory reset and VM plan
17:35Kafka KRaft tracker created
17:46Vault OSS VM deployed
18:07Redis Sentinel VM deployed
18:13Kafka KRaft VM deployed
18:17Redis hardening ADR + roadmap (ADR 0006)
18:24Kafka ADR + production readiness milestone (ADR 0007)
18:40Jenkins single-VM deployed (ADR 0009)
18:41SigNoz ADR + VM observability milestone (ADR 0010)
18:45Trivy ADR + VM scanner milestone (ADR 0011)
18:55WSO2 APIM/Identity VM deployed (ADR 0008)
18:58Monitoring observability VM ADR + milestone (ADR 0012)
19:18SigNoz VM deployed
19:24Trivy VM deployed
19:31SigNoz P1 hardening milestone
19:43Monitoring observability VM deployed
19:45Redis/Kafka WAL utility VM deployed
20:09DefectDojo VM deployed (ADR 0013)
20:18Trivy/DefectDojo import contracts milestone

Outcome. End of 2026-05-08 the lab had:

  • A fresh GitLab CE 18.11.1 instance with the user’s admin account restored.
  • A MinIO VM with buckets reserved for OADP backup, Loki logs, and Tempo traces.
  • A standalone Nexus VM acting as the docker-group dev mirror and the app-registry push target.
  • A standalone oc-mirror VM with the full OCP 4.20 platform mirror starting to download.
  • HashiCorp Vault (Raft, integrated storage, three nodes + transit seal).
  • Jenkins for the developer build path (single VM by ADR 0009).
  • SigNoz for VM-tier observability (OTLP traces + logs).
  • Trivy + DefectDojo for the security scan path.
  • Monitoring VM (Prometheus + Grafana for the VM tier).
  • WSO2 APIM + Identity VMs.
  • Redis Sentinel and Kafka KRaft VMs (out-of-scope for the OpenShift workload, retained as supporting infra).

Lessons kept.

  • The per-VM ADR convention (ADRs 0006–0013, one per VM tier) made each VM’s purpose, sizing, and config decision traceable. This pattern persisted; later ADRs (0014–0026) are all single-decision documents.
  • HAProxy edge serves VM hosts only (memory feedback_haproxy_scope.md): never put OpenShift routes behind it. This rule was set on 2026-05-08 and has held.
  • oc-mirror is reserved for OpenShift install only; all application image work goes through Nexus (memory feedback_oc_mirror_off_limits.md). This separation became ADR 0019.
  • GitHub-first tracking (memory feedback_github_first_tracking.md) — every non-trivial change opens an issue first. This pattern was set on 2026-05-08 and is the source-of-truth governance to today.

E3 — v6 install (2026-05-09)

Scope. The two-cluster v6 install ran inside a single 24-hour window:

Time (UTC)Event
06:37OCP mirror partial completion recorded
06:45hub-dc-v6 install preflight
06:53hub-dc-v6 install workdir prep
07:11hub-dc-v6 ISO + VM definitions
08:00hub-dc-v6 boot and install complete — OCP 4.20.18
08:15Local connection cleanup
08:23day-1 checkpoint + baseline
08:43Disconnected catalog baseline (mirrored Red Hat + certified catalogs only, default sources disabled)
08:47Vault readiness recheck
08:55Developer readiness track + ADR 0014
08:57OpenShift GitOps installed (openshift-gitops-operator.v1.20.3)
09:01Developer handbook scaffold
09:12hub-dc-v6 minimal GitLab GitOps bootstrap (Application hub-dc-v6-bootstrap Synced/Healthy)
09:28GitOps AppProject hardening
09:34docker-runtime VM deployed
09:39hub-dc-v6 bootstrap namespace baseline
10:08Federated GitOps architecture ADR (0015) + milestones
10:17Federated GitOps readiness gates
10:23GitLab repo group access matrix
10:35 → 10:58GitLab FG-1: preflight → group skeleton → project migration → access grants
11:17 → 11:54spoke-dc-v6 install preflight → inputs → workdir prep
13:02spoke-dc-v6 ISO and VM definitions
13:15spoke-dc-v6 PCI-DSS tracking initialized
13:25spoke-dc-v6 physical worker boot safety gate
15:20spoke-dc-v6 base install complete (3 VM masters + 3 physical workers)
15:34 → 16:27spoke-dc-v6 ODF preflight blocked → disk remediation → drift audit → manual ODF/LSO removal
16:32hub-dc-v6 management GitOps reset
16:49management GitOps pull-model baseline (ADR 0018)
20:10spoke-dc-v6 ACM registration
22:13spoke-dc-v6 ODF StorageConsumer cleanup
22:47spoke-dc-v6 ODF CSI mirror fix
23:27Nexus-only image supply baseline (ADR 0019)

Outcome. End of 2026-05-09 the v6 fleet was live:

  • hub-dc-v6 — 3-node compact cluster, all-in-one masters (control-plane + worker roles), OCP 4.20.18, kubelet v1.33.9.
  • spoke-dc-v6 — 3 VM masters + 3 physical workers (gold-1, gold-2, gpu-01), ACM-registered to hub-dc-v6, ODF data plane healthy.
  • OpenShift GitOps installed on both, pull-model registered (Argo on hub orchestrates, spoke Argo applies).
  • Nexus is the single image supply path for both clusters.

Lessons kept.

  • The ISO/agent-install path was the right choice over IPI/UPI for the rebuild — repeatable from input package, no PXE infra required.
  • Compact 3-master clusters for both hub and spoke (workers are control-plane + worker on hub; physical workers on spoke). Reduces node count without losing the role-separation that 4.x enforces.
  • First Argo application must be tinyhub-dc-v6-bootstrap only manages namespace platform-bootstrap. Validates the GitOps rail before any real workload.
  • Disconnected catalog baseline before any operator install — IDMS/ITMS, mirrored catalog, disableAllDefaultSources: true, validated before the operator queue starts.

Failure modes that surfaced.

  • IPv6 vs OVN-Kipv6.disable=1 kernel arg AND net.ipv6.conf.all.disable_ipv6=1 sysctl both break OVN-K (geneve uses IPv6 link-local even on IPv4-only clusters). Discovered during the v6 install; ADR 0026 amended ADR 0005 to record the correct posture. Runbook: runbooks/openshift-ipv6-disable-correct-approach.md.
  • ODF CSI image mirror gapspoke-dc-v6 had only the release-payload IDMS, not the oc-mirror operator/operand IDMS. CSI pods failed external pulls. Fix: applied ImageDigestMirrorSet/idms-operator-0, mirrored missing digests into Nexus. Tracked under issue #120.
  • ACM gitops-addon ships a routes.route.openshift.io CRD that collides with the aggregated Route APIService and breaks /openapi/v2 on real OpenShift clusters. Quick fix: oc delete crd routes.route.openshift.io. Tracked under issue #153.

E4 — federation + compliance (2026-05-10 → 2026-05-11)

Scope. With the v6 fleet running, the focus moved to federated content, the 22-operator queue, and the PCI-DSS baseline. Two ADRs accepted: 0020 (PCI-DSS baseline) and 0022 (v6 fleet membership).

2026-05-10 day-wrap (issue #159) — headline numbers:

MetricCount
Operators installed end-to-end7 (Compliance, OADP, cert-manager × 2 clusters, FIO, SPO, CSO)
MRs merged to platform-gitops main16
MRs merged to opp-full-plat main2
Issues closed15
ADRs accepted0020, 0022
Project board #10 cards moved to Validated6 of 22
Cluster-breaking incidents (reverted)2 (both IPv6 forms vs OVN-K)

Issues closed on 2026-05-10: #109 PCI-1 day-zero · #110 PCI-2 Compliance Operator + GitOps · #125 IMG-SUPPLY2 ODF dep coverage · #129 SPOKE-GUARD1 · #130 PCI-HANDBOOK · #132 ADR-0020 review · #133 PCI-1.10 etcd encryption · #134 PCI-1.12 OAuth tokenConfig · #136 IMG-REVIEW1 · #138 IMG-CLEAN1 · #139 IMG-CNV1 OpenShift Virtualization mirror · #152 BACKUP-1 OADP Phase A · #156 OPS-V6-FLEET-1 pre-v6 purge · #157 CERT-MGR-1 cert-manager (both clusters) · #158 PCI-3.A operator-presence batch.

2026-05-11 milestones:

  • PCI-DSS baseline closed end-to-end on spoke-dc-v6 — PCI-0 through PCI-5 plus PCI-1.13 closed; sanitized auditor-facing evidence pack published at reports/pci-dss/spoke-dc-v6-pci-dss-v4-baseline-2026-05-11.md.
  • MR !53 merged the hardening manifests + TailoredProfile; MR !54 unblocked the spoke argocd-platform-extensions ClusterRole; MR !55 added the TailoredProfile exclusion for CSO + ingress-ciphers rules.
  • Final PCI-DSS counts: ocp4-pci-dss-4-0 FAILs dropped 17 → 10 → 8; node-master scan stayed at 3 FAILs; node-worker scan stayed at 0 FAILs. All 11 remaining FAILs map to follow-up sub-issues #246–#252.
  • RHOAI direct mirror retry running through the night; 572 successful image copies, 1 failed image copy at last check.

Lessons kept.

  • GitHub-first tracking + MR-first delivery is the durable record. Chat is ephemeral, GitHub issues + MR history are the auditable trail.
  • Compliance Operator XCCDF rules have hardcoded namespace/operand expectations — TailoredProfile is how you reconcile lab reality with the rule baseline.
  • Federation needs ADRs as much as it needs code — 0015 (federated GitOps repo architecture), 0018 (pull model), 0023 (GitLab group ownership), 0024 (OpenShift-only platform-gitops boundary), 0025 (GitOps-only operations + break-glass) each closed one ambiguity that would otherwise leak into runtime drift.

What carried through every rebuild

Three things never broke and never had to be re-decided:

  1. Network plane. The 30.30.0.0/16 lab subnet allocation pattern, PDNS as the lab resolver, HAProxy as the edge for VM hosts (not for OpenShift), and the *.sub.comptech-lab.com DNS plane. Set in E2 and unchanged since.
  2. Image supply chain. Nexus for everything except OCP install mirrors (which go through oc-mirror). ADR 0019 codified this; ADR 0024 reinforced it; the rule has held through every rebuild attempt.
  3. Workspace at /home/ze/opp-full-plat. The decision in ADR 0001 to maintain a local operator workspace with AGENTS.md, CLUSTERS.md, plans/, scripts/rebuild/, ADRs, and reports/sessions/ is the reason this timeline exists. Without the workspace, every rebuild attempt would have been amnesic.

References

  • ADR 0001 — operator workspace (historical anchor; cluster list portion superseded by 0022)
  • ADR 0005 — OpenShift rebuild network ingress PKI
  • ADR 0014 — developer readiness platform contract
  • ADR 0015 — federated GitOps repo architecture
  • ADR 0018 — ACM + OpenShift GitOps pull model (v6)
  • ADR 0019 — Nexus-only image supply chain
  • ADR 0020 — PCI-DSS profile compliance baseline
  • ADR 0022 — v6 fleet membership (pre-v6 purge)
  • ADR 0023 — federated GitLab group/repo ownership
  • ADR 0024 — OpenShift-only platform-gitops boundary
  • ADR 0026 — IPv6 baseline for OVN-Kubernetes
  • opp-full-plat/SESSION_LOG.md — 229 entries from 2026-05-05 to 2026-05-11
  • opp-full-plat/reports/sessions/ — 244 timestamped session reports
  • Day-wrap issue #159 — 2026-05-10 closeout
  • Site Replication Readiness milestone (REP-0..REP-7) — see /docs/08-history-and-replay/04-site-replication-readiness/

Last reviewed: 2026-05-11