Site Replication Readiness

The REP-* tracker — promoting opp-full-plat + dc-lab into a replicable site-builder framework: profile schema, GitLab bootstrap, scripts refactor, template generators, and the dry-run gate.

Site Replication Readiness (milestone #31) is the long-running effort to promote the opp-full-plat workspace + the dc-lab instance into a replicable site-builder framework: another team takes the package, runs it against their site profile, ends up with the same platform shape. This page documents what the framework is, where it stands today, the REP-* tracker, and the gap between “framework exists” and “framework proven.”

The PCI-DSS baseline that closed on 2026-05-11 is the first repeatable end-to-end exercise on the v6 fleet. REP-* is what makes the next site repeat it without reaching for chat history or undocumented context.

The seven layers a new site must cross

REP-0 (the meta issue) defines the layers any new lab build has to traverse, with each layer’s current state:

Layer	What	Current state
L0 Site prereqs	Network, DNS resolver, IP/MAC plan, hardware BOM	Partial template under `plans/disconnected-rebuild/environments/dc-lab/`
L1 VM platform tools	HAProxy, PDNS, MinIO, Nexus, Jenkins, Trivy, Vault, SigNoz, DefectDojo, docker-runtime, GitLab	Per-VM ADRs (0008–0014); `scripts/rebuild/*` per service — partial scripts, mostly dc-lab-specific until REP-4 refactor lands
L2 GitLab content bootstrap	Groups, role groups, projects, runners, CODEOWNERS, branch protection	`gitlab-operator-guide.md` describes target; REP-3 Phase-1 has `terraform/gitlab-bootstrap/` design
L3 OpenShift install	oc-mirror, install-config, hub + spoke deploy	`RUNBOOK.md` + `connection-details/` — partial, manual; agent-install path
L4 Day-2 OpenShift platform	ACM, GitOps, mirrors, catalogs, ODF	`platform-gitops` repo on GitLab — fully GitOps, replicable
L5 Operator queue	22-operator install track	Project board #10 — replicable as template
L6 Compliance	PCI-0 → PCI-5 chain	Compliance handbook + 6 phase issues + ADR 0020 — methodology solid, parameterization missing
L7 App delivery	GitLab → Jenkins → Trivy → Nexus → runtime → MinIO evidence	ADR 0014 contract; demo-smoke / liberty-smoke / openliberty-readiness-probe exist — partial reference apps

The point is: every layer has an ADR (or set of ADRs), a checked-in plan or template, and (eventually) a profile-driven script. The framework is coherent; it just isn’t complete yet.

REP-* tracker

REP-0 #144 Meta + strategy (OPEN, long-running)

REP-1 #145 Profile schema + dc-lab profile.yaml CLOSED

REP-2 #146 Handoff templates CLOSED (Phase 1)

REP-3 #147 GitLab bootstrap TF OPEN (Phase 2)

REP-4 #148 scripts/rebuild refactor OPEN (Phase 1)

REP-5 #149 make new-site generator OPEN (Phase 1)

REP-6 #150 Notes → runbooks CLOSED

REP-2.2 #243 Unify profiles CLOSED

REP-2.1 #241 Parameterize remaining docs OPEN (+8 docs)

REP-4.2 #244 Refactor remaining scripts CLOSED

REP-7 #151 Dry-run rebuild on test infra OPEN (BLOCKED — proof gate)

REP-0 at the top is the parent meta. Solid arrows are sequence dependencies; dashed red arrows are “must close before” relationships that gate REP-7 (the proof gate at the bottom).

Issue	State	Artifact (today)
REP-0 #144	OPEN (long-running parent)	n/a
REP-1 #145 — profile schema + dc-lab extraction	CLOSED	`plans/disconnected-rebuild/site-profile-schema.md` + `plans/disconnected-rebuild/environments/dc-lab/profile.yaml` + `scripts/validate-profile.py` (295/295 PASS)
REP-2 #146 — parameterize handoffs	CLOSED (Phase 1)	`scripts/render-handoffs.py` + 3 handoffs as `.j2` templates; dc-lab byte-clean render
REP-2.1 #241 — parameterize remaining docs	OPEN (in flight)	Jinja templates parameterized for 8 more handoff/connection-details docs; RUNBOOK.md still pending
REP-2.2 #243 — unify the two profile.yaml files	CLOSED	Single canonical profile at `environments/<site>/profile.yaml`; dc-lab `profile.yaml` unified into the single canonical schema; schema doc updated
REP-3 #147 — GitLab content bootstrap	OPEN (Phase 2)	Phase 1 design + scaffold; Phase 2 GitLab Terraform bootstrap module landing + runner-token helper
REP-3.2/3.3/3.4 #235/#236/#237	OPEN (active)	GitLab Terraform bootstrap module covers groups + role groups + branch protection (#235); runner-token helper landed (#236); full repo seeding (#237) is the remaining work
REP-4 #148 — `scripts/rebuild/` profile-aware refactor	OPEN (Phase 1)	`scripts/rebuild/REFACTOR-PATTERN.md` + `lib/profile.sh` + 3 reference refactors; 11/11 smoke PASS
REP-4.2 #244 — refactor remaining `scripts/rebuild/*`	CLOSED	Profile-aware refactor of every remaining `scripts/rebuild/*` script — the pattern is now uniform across Nexus, Jenkins, Vault, SigNoz, DefectDojo, Monitoring, docker-runtime, WSO2, Trivy, edge-DNS, MinIO, vm-provisioning, developer-smoke
REP-5 #149 — `make new-site` generator	OPEN (Phase 1)	`Makefile` + `scripts/new-site.py` + `templates/issues/r-chain/` R0-R6
REP-5.2/3/4/5 #238/#239/#240/#242	OPEN (active)	L-chain (#238), PCI-chain (#239), FG-DH-chain (#240) issue templates progressing; GitHub Projects v2 GraphQL integration (#242) wired so `make new-site` populates the project board
REP-6 #150 — notes → runbooks	CLOSED	5 operator notes migrated to `runbooks/*` + `connection-details/signoz.md`
REP-7 #151 — dry-run rebuild	OPEN (BLOCKED — proof gate)	Comment listing preconditions

REP-1 — site profile schema (closed)

The first deliverable: define a single, authoritative site profile schema capturing every parameter that varies per site. Extract today’s dc-lab values out of scattered files into a single environments/dc-lab/profile.yaml instance of the schema.

A representative skeleton (excerpted from REP-1’s design):

site:
  name: dc-lab
  domain: sub.comptech-lab.com
  bootstrap_admin: zahid
  trust_domain: comptech-lab

network:
  machine_subnet: 30.30.0.0/16      # lab /24 allocation pattern
  reserved_node_subnet: 30.30.75.0/24
  dns_resolver: <DNS VM in lab /24>
  haproxy_edge: <HAProxy edge in lab /24>
  ipv6_disabled: true               # superseded by ADR 0026

clusters:
  hub:
    name: hub-dc-v6
    api_vip: <hub api vip>
    ingress_vip: <hub ingress vip>
    masters:
      - { name: master-0, ip: <…> }
      - { name: master-1, ip: <…> }
      - { name: master-2, ip: <…> }
    workers: []                     # compact
  spoke:
    name: spoke-dc-v6
    api_vip: <spoke api vip>
    ingress_vip: <spoke ingress vip>
    masters: [...]
    workers:
      - { name: worker-0, cpu: 128, mem_gib: 503 }
      - { name: worker-1, cpu: 128, mem_gib: 503 }
      - { name: worker-2, cpu: 128, mem_gib: 503 }    # GPU worker

vm_platform:
  haproxy:    { role: edge }
  pdns:       { role: auth+recursor }
  minio:      { role: object-storage }
  nexus:      { role: image-supply, endpoints: [mirror-registry, docker-group, app-registry] }
  vault:      { role: credential-custody, nodes: 3 }
  jenkins:    { role: ci }
  signoz:     { role: vm-observability }
  trivy:      { role: scanner }
  defectdojo: { role: security-dashboard }
  docker-runtime: { role: developer-runtime }

gitops:
  hub_repo: openshift-platform-gitops
  governance_repo: openshift-governance
  webhook_secret_path: secret/apps/<tenant>/ci/webhook

governance:
  pci_phases: [PCI-0, PCI-1, PCI-1.13, PCI-2, PCI-3, PCI-4, PCI-5]
  fg_phases:  [FG-0, FG-1, FG-2, FG-3, FG-4, FG-5, FG-6]
  dh_phases:  [DH-1, DH-2, DH-3, DH-4, DH-5, DH-6, DH-7, DH-8, DH-9]

Specific IPs and node addresses are intentionally placeholdered for the public version — the live profile carries those concrete values. The schema shape is the public artifact.

A scripts/validate-profile.py exists and passed 295/295 checks against the dc-lab profile when REP-1 closed. The validator enforces schema shape, value types, cross-references (a cluster’s api_vip must be inside its subnet, etc.), and the no-overlap rule between cluster ranges.

REP-2 — parameterize handoff docs (closed Phase 1)

The three operator handoff docs (platform-admin-handoff.md, gitlab-operator-guide.md, compliance-implementor-handbook.md) were converted into Jinja-style .j2 templates with {{ profile.* }} references. A scripts/render-handoffs.py reads the profile and renders site-specific markdown into environments/<site>/rendered/.

Phase-1 acceptance: render the templates against dc-lab/profile.yaml and diff against today’s checked-in handoff docs. The diff must be empty (or only cosmetic whitespace). This was achieved.

REP-2.1 (#241) is now in flight: eight more handoff/connection-details documents have been converted to .j2 templates with \{\{ profile.* \}\} references and render byte-clean against the unified dc-lab profile. RUNBOOK.md remains the last large conversion target; the per-service connection-details docs (nexus.md, jenkins.md, minio.md, vault-app-secrets.md, and the rest) that haven’t already been converted are mechanical follow-ups against the same render pipeline.

REP-2.2 — profile unification (closed)

REP-1 and REP-2 each landed a different profile.yaml for dc-lab:

REP-1 canonical at plans/disconnected-rebuild/environments/dc-lab/profile.yaml — 432 lines, flat operators list, top_level_groups / subgroups split, validated by scripts/validate-profile.py.
REP-2 handoff at environments/dc-lab/profile.yaml — 223 lines, superset shape with gitops, governance, image_supply_reports sections.

Two YAMLs for the same site = drift trap. REP-2.2 unified them at environments/<site>/profile.yaml as the canonical path, extending the REP-1 schema to absorb REP-2’s extra sections; the dc-lab profile was migrated into the unified schema. The old path was retired with a deprecation note.

REP-3 — GitLab content bootstrap (open, Phase 2)

The GitLab operator guide describes the target federated structure (groups, role groups, repos, runners, branch protection, CODEOWNERS). Without automation, new sites would have to click through GitLab UI for hours and likely diverge.

Phase-1 delivered the design + scaffold:

plans/disconnected-rebuild/gitlab-bootstrap-design.md — the design.
terraform/gitlab-bootstrap/ — Terraform module skeleton using the gitlabhq/gitlab provider.
scripts/gitlab/render-codeowners.py — Python helper that renders CODEOWNERS files per repo from the profile’s repo_role_map.

Phase-2 (active) is the actual Terraform bootstrap module:

REP-3.2 #235 — the GitLab Terraform bootstrap module landed: top-level groups + 11 ct-* role groups + group share permissions + project-level branch protection are now declarable from the profile.
REP-3.3 #236 — runner-token helper landed: generates GitLab runner registration tokens and writes them to Vault so the planned runner classes can be brought up against the same custody path the rest of the framework uses.
REP-3.4 #237 — full repo seeding (infra-ops, platform-services, tenant-registry, divisions) is the remaining gap before a new site can stand up its GitLab content end-to-end from the profile.

REP-4 — scripts/rebuild profile-aware refactor (open, Phase 1)

Each script under scripts/rebuild/* (Nexus, Jenkins, Vault, SigNoz, DefectDojo, Monitoring, docker-runtime, WSO2, Trivy, edge-DNS, MinIO, vm-provisioning, developer-smoke) currently takes site-specifics via pwd-local files or hardcoded constants. REP-4 introduces a shared lib/profile.sh with profile_init, profile_get, profile_get_list, profile_require, and dry_run_or_run helpers, plus a 6-step refactor pattern.

Phase-1 reference refactors:

scripts/rebuild/signoz/bootstrap-signoz.sh — converted.
scripts/rebuild/minio/configure-ci-evidence-bucket.sh — converted.
scripts/rebuild/edge-dns/setup-haproxy-rke2.sh — converted.

Smoke test scripts/rebuild/lib/test-profile-lib.sh passes 11/11. REP-4.2 #244 — the mechanical pass over the remaining scripts/rebuild/* — is closed. The refactor pattern is now uniform across Nexus, Jenkins, Vault, SigNoz, DefectDojo, Monitoring, docker-runtime, WSO2, Trivy, edge-DNS, MinIO, vm-provisioning, and developer-smoke. Every script reads its site-specifics from profile.yaml via lib/profile.sh; no hardcoded constants or pwd-local files remain.

REP-5 — make new-site generator (open, Phase 1)

A new site rollout needs the same GitHub structure dc-lab has: a milestone, a project board with a custom Status, and a long chain of phase issues (R0–R6, PCI-0–PCI-5, FG-0–FG-6, DH-1–DH-9, REP, OPS). Manually clicking them into existence is error-prone and inconsistent.

make new-site SITE=<name> reads the site profile and creates the full GitHub scaffold: milestone + project board + all phase issues + cross-references, with bodies templated from the existing dc-lab issues.

Phase-1 delivered:

Makefile with the new-site target.
scripts/new-site.py — the generator.
templates/issues/r-chain/ — R0–R6 issue bodies as templates.

Open sub-issues (active):

REP-5.2 #238 — L-chain (L0-L7) issue templates progressing; each layer the framework defines now has an issue body template the generator can populate.
REP-5.3 #239 — PCI-chain (PCI-0..PCI-5) issue templates progressing, modeled on the closed dc-lab PCI chain.
REP-5.4 #240 — FG-DH-chain (FG-0..FG-6 + DH-1..DH-9) issue templates progressing.
REP-5.5 #242 — GitHub Projects v2 GraphQL integration wired so make new-site creates the milestone, the project board with a custom Status field, and links every generated phase issue onto the board as one batched operation.

REP-6 — memory → runbooks (closed)

Per-session operator notes are invisible to another operator. For replicability, the lessons had to migrate into tracked opp-full-plat docs.

Closed at commit c27dd5f. Migrated artifacts:

Operator note	Migrated to
`project_jenkins_pollscm_bootstrap.md`	`runbooks/jenkins-gitlab-webhook-pollscm.md`
`project_lab_cred_drift.md`	`runbooks/secrets-custody-drift-check.md` (+ tracking issues per stale credential)
`project_signoz_v0122_auth.md`	`connection-details/signoz.md` (Known gotchas)
`project_wso2_jms_url_encoding.md`	`runbooks/wso2-apim-jms-url-encoding-trap.md`
IPv6/OVN-K incident memory	`runbooks/openshift-ipv6-disable-correct-approach.md`
MCO stuck-node memory	`runbooks/mco-stuck-node-recovery.md`

REP-6 codified the distinction maintained going forward: behavioral feedback (how the agent operates) stays in memory; operational knowledge (what any operator needs) goes to runbooks/ADRs/connection-details.

REP-7 — the dry-run gate (open, BLOCKED)

This is the proof gate for the entire framework. The framework is only “replication-ready” when it actually replicates. REP-7 plans a full end-to-end rebuild against parallel test infrastructure using only the artifacts produced by REP-1..REP-6. If anything requires reaching for chat history or undocumented memory, that’s a gap → close it before declaring REP-0 done.

Plan:

Provision parallel test infra. Separate VM hypervisor / network segment / hardware corner. Minimum: 3 VMs for the management VM stack + 6 VMs for OCP hub+spoke. Less than production footprint; just enough to exercise every layer.
Build a test-replication profile at environments/test-replication/profile.yaml with new domain, new IPs (different /24), new hostnames. Reuse the REP-1 schema unchanged.
Execute layer-by-layer using only the framework artifacts:
- L0: Site prereqs from rendered handoff (REP-2).
- L1: VM platform via refactored scripts/rebuild/* (REP-4).
- L2: GitLab content via REP-3 bootstrap automation.
- L3: OpenShift install via rendered RUNBOOK + scripts.
- L4: Day-2 platform via platform-gitops template + Argo bootstrap pattern.
- L5: Operator queue via project board template (REP-5) + 22-operator install MRs.
- L6: Compliance via PCI-0..PCI-5 phase issues (REP-5) + ADR 0020 baseline.
- L7: App delivery via developer-smoke reference apps + ADR 0014 contract.
Track gaps as REPLAY-N issues — anything requiring undocumented action gets a parallel sub-issue marked “blocking REP-7 closure.”
Record full elapsed time — wall-clock from “fresh hardware” to “first PCI scan ran” is the framework-quality metric.

Success criteria:

Test-site OpenShift cluster running, all 22+1 (compliance) operators installed.
oc get nodes shows test-site nodes.
platform-gitops fork on test-site GitLab reconciles cleanly.
PCI-DSS scan ran and produced results (PASS/FAIL irrelevant; the scan happening is the proof).
Developer pipeline did one round-trip: GitLab → Jenkins → Trivy → Nexus → docker-runtime → MinIO evidence.
Zero ad-hoc lookups during execution; every step traceable to a checked-in artifact.

REP-7 is blocked on: REP-2.1 (remaining handoffs + RUNBOOK), REP-3.4 (full repo seeding; REP-3.2 + REP-3.3 are landed), REP-5.2/3/4/5 (remaining chain template content + Projects v2 end-to-end validation), AND test infrastructure availability. REP-4.2 is no longer a blocker.

Hub + spoke site-replication strategy

Site replication isn’t only about the framework artifacts; it’s about what shape a replicated site takes. The strategy: every replicated site is one hub + one spoke (the dc-lab pattern), with the same role split:

Hub: compact 3-master, all-in-one (control plane + worker roles), no ODF, no application workload, runs ACM + OpenShift GitOps + per-cluster Argo CD. The hub is the management plane, deliberately storage-light.
Spoke: 3 VM masters + N physical workers, runs ODF (or equivalent), all application workloads, ACM-registered to the hub, runs its own Argo CD that reconciles its cluster directory in platform-gitops.

Future v6-era DR adds two more clusters (hub-dr-v6, spoke-dr-v6), each mirroring the role of its non-DR counterpart. ADR 0022 reserves the names; no manifests target them today. When DR is reintroduced, the work goes through a new tracked issue + ADR amendment.

MinIO replication patterns (forward-looking)

MinIO at each site backs three workloads:

OADP backup target — oadp-backups bucket; cloud-credentials Secret on the spoke.
LokiStack object storage — loki-chunks bucket via NooBaa OBC; ESO bridges OBC’s AWS_KEY keys + ConfigMap into the lowercase LokiStack storage-Secret keys (memory project_obc_to_operand_secret_bridge.md).
TempoStack object storage — tempo-chunks bucket; ExternalSecret at clusters/spoke-dc-v6/platform-services/tracing/externalsecret-tempo-storage.yaml.

For multi-site replication, the canonical pattern is MinIO site replication — bi-directional async replication of object data between two MinIO deployments at separate sites. The framework doesn’t ship a MinIO replication config today; that’s part of the forward-looking DR design that comes with reintroducing hub-dr-v6 / spoke-dr-v6.

Current status: not configured. The dc-lab has one MinIO; the test-replication target will have its own. Cross-site replication is out of scope until DR is.

Vault DR pairing (forward-looking)

The lab’s Vault is a 3-node Raft cluster with HashiCorp transit auto-unseal on a fourth node. The Raft pattern supports performance replicas and DR replicas via Vault Enterprise; the lab runs Vault OSS so DR is a hot-standby pattern at the Raft level, not the Vault Enterprise DR-replication path.

For a replicated site:

Each site stands up its own Vault Raft cluster (3 nodes + transit seal).
Each site’s Vault is the source-of-truth for its site’s credentials; cross-site secret sync (if needed) is a separate concern (ESO with cross-site SecretStores, or a dedicated sync controller).
Vault transit auto-unseal nodes can be kept independent per site — no cross-site dependency.

Current status: per-site Vault, no cross-site pairing. When DR is reintroduced, the choice is “per-DR-site independent Vault” vs “single Vault serving both sites with Vault Enterprise DR replication.” Today’s design assumes the former.

RHACS Central replication considerations

RHACS Central is hub-side and serves the entire fleet. For multi-site:

Central can run on the hub of each site (independent Centrals per site) — clusters at each site report to their local Central. Useful when sites are operationally independent.
Central can run on one hub and be the single Central for both sites — every cluster reports to that one Central. Useful when fleet-wide policy and view is required.

The init-bundle generation pattern (memory reference_rhacs_init_bundle_via_api.md) is site-independent: POST /v1/cluster-init/init-bundles with admin creds; bundle pushed to Vault for ESO delivery in the stackrox namespace. Each site that runs its own Central uses its own init bundle.

Current status: single Central on hub-dc-v6 covering the v6 fleet. Replicated-site decision is per-site Central by default, with an explicit ADR if a site wants to consume an existing Central from another site.

Ready vs pending

Capability	Ready	Pending
Profile schema (REP-1)	Yes	Schema covers all observed sections (REP-2.2 unified); no known gaps
Profile validation (`validate-profile.py`)	Yes	295/295 PASS on dc-lab
Handoff rendering (REP-2 Phase 1)	Yes for 3 handoffs	REP-2.1: +8 docs converted; RUNBOOK.md still pending
GitLab bootstrap automation	Design + module skeleton + Terraform bootstrap module + runner-token helper	Full repo seeding (REP-3.4)
`scripts/rebuild/` profile-aware	Pattern + all scripts refactored (REP-4.2 closed)	none
`make new-site`	Skeleton + R-chain templates + L/PCI/FG-DH chains + Projects v2 GraphQL integration in flight	Final chain template content + cross-link validation
Memory → runbooks migration	Yes	none
End-to-end proof	no	REP-7 dry-run rebuild on test infra
MinIO replication	n/a	Forward-looking; requires DR
Vault DR pairing	n/a	Forward-looking; requires DR
RHACS multi-site	n/a	Forward-looking; requires DR

Where the work continues

Next visible chunks (per REP-0 status, post-Wave 3, current as of 2026-05-12):

REP-2.1 — finish RUNBOOK.md parameterization and the last connection-details holdouts. Eight more handoff docs have already been templated; the remaining work is mostly the long RUNBOOK.
REP-3.4 — full repo seeding (infra-ops, platform-services, tenant-registry, divisions). The Terraform bootstrap module (REP-3.2) and runner-token helper (REP-3.3) are in. Seeding is the last gap before the GitLab content layer is reproducible end-to-end.
REP-5.2 / 5.3 / 5.4 / 5.5 — finalize the chain template content (L, PCI, FG-DH) and validate the Projects v2 GraphQL integration end-to-end against a throwaway milestone.
Test infra — REP-7 needs hardware. Could be budget-friendly Proxmox or cloud VMs.

REP-4.2 is now closed, which removes one of REP-7’s gating dependencies.

REP-0 itself remains open as the long-running parent issue.

References

REP-0 #144 — meta + strategy
REP-1 #145 — profile schema (CLOSED)
REP-2 #146 — handoff templates (CLOSED Phase 1)
REP-2.1 #241 — remaining handoff parameterization
REP-2.2 #243 — profile unification (CLOSED)
REP-3 #147 — GitLab bootstrap (OPEN Phase 1)
REP-3.2/3.3/3.4 #235/#236/#237
REP-4 #148 — scripts refactor (OPEN Phase 1)
REP-4.2 #244 — remaining scripts
REP-5 #149 — make new-site (OPEN Phase 1)
REP-5.2/3/4/5 #238/#239/#240/#242
REP-6 #150 — memory → runbooks (CLOSED)
REP-7 #151 — dry-run rebuild (BLOCKED)
Milestone Site Replication Readiness #31
ADR 0001, 0015, 0018, 0019, 0022, 0023, 0024 — the foundational decisions the framework relies on