Site Replication Readiness
The REP-* tracker — promoting opp-full-plat + dc-lab into a replicable site-builder framework: profile schema, GitLab bootstrap, scripts refactor, template generators, and the dry-run gate.
Site Replication Readiness (milestone #31) is the long-running effort to promote the opp-full-plat workspace + the dc-lab instance into a replicable site-builder framework: another team takes the package, runs it against their site profile, ends up with the same platform shape. This page documents what the framework is, where it stands today, the REP-* tracker, and the gap between “framework exists” and “framework proven.”
The PCI-DSS baseline that closed on 2026-05-11 is the first repeatable end-to-end exercise on the v6 fleet. REP-* is what makes the next site repeat it without reaching for chat history or undocumented context.
The seven layers a new site must cross
REP-0 (the meta issue) defines the layers any new lab build has to traverse, with each layer’s current state:
| Layer | What | Current state |
|---|---|---|
| L0 Site prereqs | Network, DNS resolver, IP/MAC plan, hardware BOM | Partial template under plans/disconnected-rebuild/environments/dc-lab/ |
| L1 VM platform tools | HAProxy, PDNS, MinIO, Nexus, Jenkins, Trivy, Vault, SigNoz, DefectDojo, docker-runtime, GitLab | Per-VM ADRs (0008–0014); scripts/rebuild/* per service — partial scripts, mostly dc-lab-specific until REP-4 refactor lands |
| L2 GitLab content bootstrap | Groups, role groups, projects, runners, CODEOWNERS, branch protection | gitlab-operator-guide.md describes target; REP-3 Phase-1 has terraform/gitlab-bootstrap/ design |
| L3 OpenShift install | oc-mirror, install-config, hub + spoke deploy | RUNBOOK.md + connection-details/ — partial, manual; agent-install path |
| L4 Day-2 OpenShift platform | ACM, GitOps, mirrors, catalogs, ODF | platform-gitops repo on GitLab — fully GitOps, replicable |
| L5 Operator queue | 22-operator install track | Project board #10 — replicable as template |
| L6 Compliance | PCI-0 → PCI-5 chain | Compliance handbook + 6 phase issues + ADR 0020 — methodology solid, parameterization missing |
| L7 App delivery | GitLab → Jenkins → Trivy → Nexus → runtime → MinIO evidence | ADR 0014 contract; demo-smoke / liberty-smoke / openliberty-readiness-probe exist — partial reference apps |
The point is: every layer has an ADR (or set of ADRs), a checked-in plan or template, and (eventually) a profile-driven script. The framework is coherent; it just isn’t complete yet.
REP-* tracker
REP-0 at the top is the parent meta. Solid arrows are sequence dependencies; dashed red arrows are “must close before” relationships that gate REP-7 (the proof gate at the bottom).
| Issue | State | Artifact (today) |
|---|---|---|
| REP-0 #144 | OPEN (long-running parent) | n/a |
| REP-1 #145 — profile schema + dc-lab extraction | CLOSED | plans/disconnected-rebuild/site-profile-schema.md + plans/disconnected-rebuild/environments/dc-lab/profile.yaml + scripts/validate-profile.py (295/295 PASS) |
| REP-2 #146 — parameterize handoffs | CLOSED (Phase 1) | scripts/render-handoffs.py + 3 handoffs as .j2 templates; dc-lab byte-clean render |
| REP-2.1 #241 — parameterize remaining docs | OPEN (in flight) | Jinja templates parameterized for 8 more handoff/connection-details docs; RUNBOOK.md still pending |
| REP-2.2 #243 — unify the two profile.yaml files | CLOSED | Single canonical profile at environments/<site>/profile.yaml; dc-lab profile.yaml unified into the single canonical schema; schema doc updated |
| REP-3 #147 — GitLab content bootstrap | OPEN (Phase 2) | Phase 1 design + scaffold; Phase 2 GitLab Terraform bootstrap module landing + runner-token helper |
| REP-3.2/3.3/3.4 #235/#236/#237 | OPEN (active) | GitLab Terraform bootstrap module covers groups + role groups + branch protection (#235); runner-token helper landed (#236); full repo seeding (#237) is the remaining work |
REP-4 #148 — scripts/rebuild/ profile-aware refactor | OPEN (Phase 1) | scripts/rebuild/REFACTOR-PATTERN.md + lib/profile.sh + 3 reference refactors; 11/11 smoke PASS |
REP-4.2 #244 — refactor remaining scripts/rebuild/* | CLOSED | Profile-aware refactor of every remaining scripts/rebuild/* script — the pattern is now uniform across Nexus, Jenkins, Vault, SigNoz, DefectDojo, Monitoring, docker-runtime, WSO2, Trivy, edge-DNS, MinIO, vm-provisioning, developer-smoke |
REP-5 #149 — make new-site generator | OPEN (Phase 1) | Makefile + scripts/new-site.py + templates/issues/r-chain/ R0-R6 |
| REP-5.2/3/4/5 #238/#239/#240/#242 | OPEN (active) | L-chain (#238), PCI-chain (#239), FG-DH-chain (#240) issue templates progressing; GitHub Projects v2 GraphQL integration (#242) wired so make new-site populates the project board |
| REP-6 #150 — notes → runbooks | CLOSED | 5 operator notes migrated to runbooks/* + connection-details/signoz.md |
| REP-7 #151 — dry-run rebuild | OPEN (BLOCKED — proof gate) | Comment listing preconditions |
REP-1 — site profile schema (closed)
The first deliverable: define a single, authoritative site profile schema capturing every parameter that varies per site. Extract today’s dc-lab values out of scattered files into a single environments/dc-lab/profile.yaml instance of the schema.
A representative skeleton (excerpted from REP-1’s design):
site:
name: dc-lab
domain: sub.comptech-lab.com
bootstrap_admin: zahid
trust_domain: comptech-lab
network:
machine_subnet: 30.30.0.0/16 # lab /24 allocation pattern
reserved_node_subnet: 30.30.75.0/24
dns_resolver: <DNS VM in lab /24>
haproxy_edge: <HAProxy edge in lab /24>
ipv6_disabled: true # superseded by ADR 0026
clusters:
hub:
name: hub-dc-v6
api_vip: <hub api vip>
ingress_vip: <hub ingress vip>
masters:
- { name: master-0, ip: <…> }
- { name: master-1, ip: <…> }
- { name: master-2, ip: <…> }
workers: [] # compact
spoke:
name: spoke-dc-v6
api_vip: <spoke api vip>
ingress_vip: <spoke ingress vip>
masters: [...]
workers:
- { name: worker-0, cpu: 128, mem_gib: 503 }
- { name: worker-1, cpu: 128, mem_gib: 503 }
- { name: worker-2, cpu: 128, mem_gib: 503 } # GPU worker
vm_platform:
haproxy: { role: edge }
pdns: { role: auth+recursor }
minio: { role: object-storage }
nexus: { role: image-supply, endpoints: [mirror-registry, docker-group, app-registry] }
vault: { role: credential-custody, nodes: 3 }
jenkins: { role: ci }
signoz: { role: vm-observability }
trivy: { role: scanner }
defectdojo: { role: security-dashboard }
docker-runtime: { role: developer-runtime }
gitops:
hub_repo: openshift-platform-gitops
governance_repo: openshift-governance
webhook_secret_path: secret/apps/<tenant>/ci/webhook
governance:
pci_phases: [PCI-0, PCI-1, PCI-1.13, PCI-2, PCI-3, PCI-4, PCI-5]
fg_phases: [FG-0, FG-1, FG-2, FG-3, FG-4, FG-5, FG-6]
dh_phases: [DH-1, DH-2, DH-3, DH-4, DH-5, DH-6, DH-7, DH-8, DH-9]
Specific IPs and node addresses are intentionally placeholdered for the public version — the live profile carries those concrete values. The schema shape is the public artifact.
A scripts/validate-profile.py exists and passed 295/295 checks against the dc-lab profile when REP-1 closed. The validator enforces schema shape, value types, cross-references (a cluster’s api_vip must be inside its subnet, etc.), and the no-overlap rule between cluster ranges.
REP-2 — parameterize handoff docs (closed Phase 1)
The three operator handoff docs (platform-admin-handoff.md, gitlab-operator-guide.md, compliance-implementor-handbook.md) were converted into Jinja-style .j2 templates with {{ profile.* }} references. A scripts/render-handoffs.py reads the profile and renders site-specific markdown into environments/<site>/rendered/.
Phase-1 acceptance: render the templates against dc-lab/profile.yaml and diff against today’s checked-in handoff docs. The diff must be empty (or only cosmetic whitespace). This was achieved.
REP-2.1 (#241) is now in flight: eight more handoff/connection-details documents have been converted to .j2 templates with \{\{ profile.* \}\} references and render byte-clean against the unified dc-lab profile. RUNBOOK.md remains the last large conversion target; the per-service connection-details docs (nexus.md, jenkins.md, minio.md, vault-app-secrets.md, and the rest) that haven’t already been converted are mechanical follow-ups against the same render pipeline.
REP-2.2 — profile unification (closed)
REP-1 and REP-2 each landed a different profile.yaml for dc-lab:
- REP-1 canonical at
plans/disconnected-rebuild/environments/dc-lab/profile.yaml— 432 lines, flatoperatorslist,top_level_groups/subgroupssplit, validated byscripts/validate-profile.py. - REP-2 handoff at
environments/dc-lab/profile.yaml— 223 lines, superset shape withgitops,governance,image_supply_reportssections.
Two YAMLs for the same site = drift trap. REP-2.2 unified them at environments/<site>/profile.yaml as the canonical path, extending the REP-1 schema to absorb REP-2’s extra sections; the dc-lab profile was migrated into the unified schema. The old path was retired with a deprecation note.
REP-3 — GitLab content bootstrap (open, Phase 2)
The GitLab operator guide describes the target federated structure (groups, role groups, repos, runners, branch protection, CODEOWNERS). Without automation, new sites would have to click through GitLab UI for hours and likely diverge.
Phase-1 delivered the design + scaffold:
plans/disconnected-rebuild/gitlab-bootstrap-design.md— the design.terraform/gitlab-bootstrap/— Terraform module skeleton using thegitlabhq/gitlabprovider.scripts/gitlab/render-codeowners.py— Python helper that renders CODEOWNERS files per repo from the profile’srepo_role_map.
Phase-2 (active) is the actual Terraform bootstrap module:
- REP-3.2 #235 — the GitLab Terraform bootstrap module landed: top-level groups + 11
ct-*role groups + group share permissions + project-level branch protection are now declarable from the profile. - REP-3.3 #236 — runner-token helper landed: generates GitLab runner registration tokens and writes them to Vault so the planned runner classes can be brought up against the same custody path the rest of the framework uses.
- REP-3.4 #237 — full repo seeding (infra-ops, platform-services, tenant-registry, divisions) is the remaining gap before a new site can stand up its GitLab content end-to-end from the profile.
REP-4 — scripts/rebuild profile-aware refactor (open, Phase 1)
Each script under scripts/rebuild/* (Nexus, Jenkins, Vault, SigNoz, DefectDojo, Monitoring, docker-runtime, WSO2, Trivy, edge-DNS, MinIO, vm-provisioning, developer-smoke) currently takes site-specifics via pwd-local files or hardcoded constants. REP-4 introduces a shared lib/profile.sh with profile_init, profile_get, profile_get_list, profile_require, and dry_run_or_run helpers, plus a 6-step refactor pattern.
Phase-1 reference refactors:
scripts/rebuild/signoz/bootstrap-signoz.sh— converted.scripts/rebuild/minio/configure-ci-evidence-bucket.sh— converted.scripts/rebuild/edge-dns/setup-haproxy-rke2.sh— converted.
Smoke test scripts/rebuild/lib/test-profile-lib.sh passes 11/11. REP-4.2 #244 — the mechanical pass over the remaining scripts/rebuild/* — is closed. The refactor pattern is now uniform across Nexus, Jenkins, Vault, SigNoz, DefectDojo, Monitoring, docker-runtime, WSO2, Trivy, edge-DNS, MinIO, vm-provisioning, and developer-smoke. Every script reads its site-specifics from profile.yaml via lib/profile.sh; no hardcoded constants or pwd-local files remain.
REP-5 — make new-site generator (open, Phase 1)
A new site rollout needs the same GitHub structure dc-lab has: a milestone, a project board with a custom Status, and a long chain of phase issues (R0–R6, PCI-0–PCI-5, FG-0–FG-6, DH-1–DH-9, REP, OPS). Manually clicking them into existence is error-prone and inconsistent.
make new-site SITE=<name> reads the site profile and creates the full GitHub scaffold: milestone + project board + all phase issues + cross-references, with bodies templated from the existing dc-lab issues.
Phase-1 delivered:
Makefilewith thenew-sitetarget.scripts/new-site.py— the generator.templates/issues/r-chain/— R0–R6 issue bodies as templates.
Open sub-issues (active):
- REP-5.2 #238 — L-chain (L0-L7) issue templates progressing; each layer the framework defines now has an issue body template the generator can populate.
- REP-5.3 #239 — PCI-chain (PCI-0..PCI-5) issue templates progressing, modeled on the closed dc-lab PCI chain.
- REP-5.4 #240 — FG-DH-chain (FG-0..FG-6 + DH-1..DH-9) issue templates progressing.
- REP-5.5 #242 — GitHub Projects v2 GraphQL integration wired so
make new-sitecreates the milestone, the project board with a custom Status field, and links every generated phase issue onto the board as one batched operation.
REP-6 — memory → runbooks (closed)
Per-session operator notes are invisible to another operator. For replicability, the lessons had to migrate into tracked opp-full-plat docs.
Closed at commit c27dd5f. Migrated artifacts:
| Operator note | Migrated to |
|---|---|
project_jenkins_pollscm_bootstrap.md | runbooks/jenkins-gitlab-webhook-pollscm.md |
project_lab_cred_drift.md | runbooks/secrets-custody-drift-check.md (+ tracking issues per stale credential) |
project_signoz_v0122_auth.md | connection-details/signoz.md (Known gotchas) |
project_wso2_jms_url_encoding.md | runbooks/wso2-apim-jms-url-encoding-trap.md |
| IPv6/OVN-K incident memory | runbooks/openshift-ipv6-disable-correct-approach.md |
| MCO stuck-node memory | runbooks/mco-stuck-node-recovery.md |
REP-6 codified the distinction maintained going forward: behavioral feedback (how the agent operates) stays in memory; operational knowledge (what any operator needs) goes to runbooks/ADRs/connection-details.
REP-7 — the dry-run gate (open, BLOCKED)
This is the proof gate for the entire framework. The framework is only “replication-ready” when it actually replicates. REP-7 plans a full end-to-end rebuild against parallel test infrastructure using only the artifacts produced by REP-1..REP-6. If anything requires reaching for chat history or undocumented memory, that’s a gap → close it before declaring REP-0 done.
Plan:
- Provision parallel test infra. Separate VM hypervisor / network segment / hardware corner. Minimum: 3 VMs for the management VM stack + 6 VMs for OCP hub+spoke. Less than production footprint; just enough to exercise every layer.
- Build a
test-replicationprofile atenvironments/test-replication/profile.yamlwith new domain, new IPs (different /24), new hostnames. Reuse the REP-1 schema unchanged. - Execute layer-by-layer using only the framework artifacts:
- L0: Site prereqs from rendered handoff (REP-2).
- L1: VM platform via refactored
scripts/rebuild/*(REP-4). - L2: GitLab content via REP-3 bootstrap automation.
- L3: OpenShift install via rendered
RUNBOOK+ scripts. - L4: Day-2 platform via
platform-gitopstemplate + Argo bootstrap pattern. - L5: Operator queue via project board template (REP-5) + 22-operator install MRs.
- L6: Compliance via PCI-0..PCI-5 phase issues (REP-5) + ADR 0020 baseline.
- L7: App delivery via developer-smoke reference apps + ADR 0014 contract.
- Track gaps as REPLAY-N issues — anything requiring undocumented action gets a parallel sub-issue marked “blocking REP-7 closure.”
- Record full elapsed time — wall-clock from “fresh hardware” to “first PCI scan ran” is the framework-quality metric.
Success criteria:
- Test-site OpenShift cluster running, all 22+1 (compliance) operators installed.
oc get nodesshows test-site nodes.platform-gitopsfork on test-site GitLab reconciles cleanly.- PCI-DSS scan ran and produced results (PASS/FAIL irrelevant; the scan happening is the proof).
- Developer pipeline did one round-trip: GitLab → Jenkins → Trivy → Nexus → docker-runtime → MinIO evidence.
- Zero ad-hoc lookups during execution; every step traceable to a checked-in artifact.
REP-7 is blocked on: REP-2.1 (remaining handoffs + RUNBOOK), REP-3.4 (full repo seeding; REP-3.2 + REP-3.3 are landed), REP-5.2/3/4/5 (remaining chain template content + Projects v2 end-to-end validation), AND test infrastructure availability. REP-4.2 is no longer a blocker.
Hub + spoke site-replication strategy
Site replication isn’t only about the framework artifacts; it’s about what shape a replicated site takes. The strategy: every replicated site is one hub + one spoke (the dc-lab pattern), with the same role split:
- Hub: compact 3-master, all-in-one (control plane + worker roles), no ODF, no application workload, runs ACM + OpenShift GitOps + per-cluster Argo CD. The hub is the management plane, deliberately storage-light.
- Spoke: 3 VM masters + N physical workers, runs ODF (or equivalent), all application workloads, ACM-registered to the hub, runs its own Argo CD that reconciles its cluster directory in
platform-gitops.
Future v6-era DR adds two more clusters (hub-dr-v6, spoke-dr-v6), each mirroring the role of its non-DR counterpart. ADR 0022 reserves the names; no manifests target them today. When DR is reintroduced, the work goes through a new tracked issue + ADR amendment.
MinIO replication patterns (forward-looking)
MinIO at each site backs three workloads:
- OADP backup target —
oadp-backupsbucket; cloud-credentials Secret on the spoke. - LokiStack object storage —
loki-chunksbucket via NooBaa OBC; ESO bridges OBC’s AWS_KEY keys + ConfigMap into the lowercase LokiStack storage-Secret keys (memoryproject_obc_to_operand_secret_bridge.md). - TempoStack object storage —
tempo-chunksbucket; ExternalSecret atclusters/spoke-dc-v6/platform-services/tracing/externalsecret-tempo-storage.yaml.
For multi-site replication, the canonical pattern is MinIO site replication — bi-directional async replication of object data between two MinIO deployments at separate sites. The framework doesn’t ship a MinIO replication config today; that’s part of the forward-looking DR design that comes with reintroducing hub-dr-v6 / spoke-dr-v6.
Current status: not configured. The dc-lab has one MinIO; the test-replication target will have its own. Cross-site replication is out of scope until DR is.
Vault DR pairing (forward-looking)
The lab’s Vault is a 3-node Raft cluster with HashiCorp transit auto-unseal on a fourth node. The Raft pattern supports performance replicas and DR replicas via Vault Enterprise; the lab runs Vault OSS so DR is a hot-standby pattern at the Raft level, not the Vault Enterprise DR-replication path.
For a replicated site:
- Each site stands up its own Vault Raft cluster (3 nodes + transit seal).
- Each site’s Vault is the source-of-truth for its site’s credentials; cross-site secret sync (if needed) is a separate concern (ESO with cross-site SecretStores, or a dedicated sync controller).
- Vault transit auto-unseal nodes can be kept independent per site — no cross-site dependency.
Current status: per-site Vault, no cross-site pairing. When DR is reintroduced, the choice is “per-DR-site independent Vault” vs “single Vault serving both sites with Vault Enterprise DR replication.” Today’s design assumes the former.
RHACS Central replication considerations
RHACS Central is hub-side and serves the entire fleet. For multi-site:
- Central can run on the hub of each site (independent Centrals per site) — clusters at each site report to their local Central. Useful when sites are operationally independent.
- Central can run on one hub and be the single Central for both sites — every cluster reports to that one Central. Useful when fleet-wide policy and view is required.
The init-bundle generation pattern (memory reference_rhacs_init_bundle_via_api.md) is site-independent: POST /v1/cluster-init/init-bundles with admin creds; bundle pushed to Vault for ESO delivery in the stackrox namespace. Each site that runs its own Central uses its own init bundle.
Current status: single Central on hub-dc-v6 covering the v6 fleet. Replicated-site decision is per-site Central by default, with an explicit ADR if a site wants to consume an existing Central from another site.
Ready vs pending
| Capability | Ready | Pending |
|---|---|---|
| Profile schema (REP-1) | Yes | Schema covers all observed sections (REP-2.2 unified); no known gaps |
Profile validation (validate-profile.py) | Yes | 295/295 PASS on dc-lab |
| Handoff rendering (REP-2 Phase 1) | Yes for 3 handoffs | REP-2.1: +8 docs converted; RUNBOOK.md still pending |
| GitLab bootstrap automation | Design + module skeleton + Terraform bootstrap module + runner-token helper | Full repo seeding (REP-3.4) |
scripts/rebuild/ profile-aware | Pattern + all scripts refactored (REP-4.2 closed) | none |
make new-site | Skeleton + R-chain templates + L/PCI/FG-DH chains + Projects v2 GraphQL integration in flight | Final chain template content + cross-link validation |
| Memory → runbooks migration | Yes | none |
| End-to-end proof | no | REP-7 dry-run rebuild on test infra |
| MinIO replication | n/a | Forward-looking; requires DR |
| Vault DR pairing | n/a | Forward-looking; requires DR |
| RHACS multi-site | n/a | Forward-looking; requires DR |
Where the work continues
Next visible chunks (per REP-0 status, post-Wave 3, current as of 2026-05-12):
- REP-2.1 — finish
RUNBOOK.mdparameterization and the last connection-details holdouts. Eight more handoff docs have already been templated; the remaining work is mostly the longRUNBOOK. - REP-3.4 — full repo seeding (infra-ops, platform-services, tenant-registry, divisions). The Terraform bootstrap module (REP-3.2) and runner-token helper (REP-3.3) are in. Seeding is the last gap before the GitLab content layer is reproducible end-to-end.
- REP-5.2 / 5.3 / 5.4 / 5.5 — finalize the chain template content (L, PCI, FG-DH) and validate the Projects v2 GraphQL integration end-to-end against a throwaway milestone.
- Test infra — REP-7 needs hardware. Could be budget-friendly Proxmox or cloud VMs.
REP-4.2 is now closed, which removes one of REP-7’s gating dependencies.
REP-0 itself remains open as the long-running parent issue.
References
- REP-0 #144 — meta + strategy
- REP-1 #145 — profile schema (CLOSED)
- REP-2 #146 — handoff templates (CLOSED Phase 1)
- REP-2.1 #241 — remaining handoff parameterization
- REP-2.2 #243 — profile unification (CLOSED)
- REP-3 #147 — GitLab bootstrap (OPEN Phase 1)
- REP-3.2/3.3/3.4 #235/#236/#237
- REP-4 #148 — scripts refactor (OPEN Phase 1)
- REP-4.2 #244 — remaining scripts
- REP-5 #149 —
make new-site(OPEN Phase 1) - REP-5.2/3/4/5 #238/#239/#240/#242
- REP-6 #150 — memory → runbooks (CLOSED)
- REP-7 #151 — dry-run rebuild (BLOCKED)
- Milestone Site Replication Readiness #31
- ADR 0001, 0015, 0018, 0019, 0022, 0023, 0024 — the foundational decisions the framework relies on