Backup and Disaster Recovery
Back up the hub, back up the workloads, and design for a real datacenter loss — RPO, RTO, the cluster-backup operator, OADP, and the drills that actually validate the plan.
The interview question for a multicluster platform team is “what happens when the hub goes.” The polished answer talks about ApplicationSet reconciliation and a cold restore target. The honest answer is that most teams have never actually tested it, and the day they do, the restore fails because the CSI snapshotter on the destination cluster is from a different vendor.
This module is the design level — what to protect, what tools to use, what RPO and RTO targets to set, and how to validate the whole thing with a drill that isn’t theatre.
What you’re actually protecting
There are two distinct things, and they need separate stories.
(a) The hub itself. The hub does not run customer workloads, but it owns a great deal of state. The list:
ManagedClusterCRs (the inventory of who is in the fleet).ManagedClusterSet,ManagedClusterSetBinding,Placement,PlacementDecision(the segmentation).Policy,PolicyTemplate,PolicySet(the governance baseline).ApplicationSet,Application,AppProject(the GitOps fan-out).KlusterletAddonConfig(the agent enablement).- The Vault custody for init-bundles, ESO ClusterSecretStore configs, the hub’s own
argocd-server-tls. - Argo CD’s
argocd-secretand the GitOps repo credentials.
If the hub disappears, the spokes carry on running — their workloads do not care — but drift, policy enforcement, image scanning fan-out, and observability collection all stop. The fleet becomes a herd of unmanaged clusters until a new hub comes back.
(b) Each managed cluster’s stateful workload data. This is the per-application story: the Deployments, the StatefulSets, the ConfigMaps and Secrets, and the persistent volumes attached to databases and message queues. Most of the config of a managed cluster is rebuilt from GitOps. Most of the data is on disks that GitOps does not know about.
These two stories use different tools, even though under the hood it is Velero either way.
The cluster-backup operator (ACM)
ACM ships a bundled operator called cluster-backup. It is OADP under the hood — same Velero, same plugins — but pre-configured with the schedules and resource selectors that match the hub’s CR inventory. You enable it on the hub, point it at object storage, and it produces three rolling backups:
- acm-credentials-schedule — Secret material the hub depends on.
- acm-resources-schedule — the ACM CRs and the policy and application objects.
- acm-managed-clusters-schedule — the registration data needed to re-establish spoke connections after a restore.
For the lab’s enable/disable runbook, see /docs/openshift-platform/openshift-platform/acm-multicluster/acm-cluster-backup/.
It is disabled on hub-dc-v6 today. The doc explains why — the lab is single-DC and the restore target (hub-dr-v6) has not been built yet — but the operator itself is a 30-second install once the DR pair lands.
OADP — the per-cluster story
OADP is Red Hat’s productised Velero. Same APIs, same CRDs, with a small wrapper that integrates with the OpenShift operator catalogue and bundles the OpenShift plugin. It is what you put on each managed cluster to back up workloads.
The unit of configuration is a DataProtectionApplication CR. It declares one or more BackupStorageLocation blocks (where to write — usually an S3-compatible endpoint) and one or more VolumeSnapshotLocation blocks (which CSI snapshotter to use for PVCs). Once that CR is reconciled, you create normal Backup and Restore CRs and Velero does the work.
The two failure modes that always bite:
- You forgot the OpenShift plugin. Vanilla Velero does not know about
Route,ImageStream,BuildConfig, orSecurityContextConstraints. Thevelero-plugin-for-openshiftplugin teaches it; without it, your backup is missing the OpenShift-specific surface. - You forgot CSI snapshotting. Without
EnableCSIset and aVolumeSnapshotLocation, Velero falls back to file-by-file restic copies of PVs. That works but is slow and brittle on large volumes. CSI snapshots are atomic, fast, and the only reasonable choice for stateful workloads.
For the lab’s install recipe and the DataProtectionApplication pattern, see /docs/openshift-platform/openshift-platform/backup-oadp/operator-install/.
The picture
Reading the diagram:
- The hub runs
cluster-backup, which writes ACM-specific CR dumps and registration data to the object store. - Each managed cluster runs OADP, which uses the CSI snapshotter for PVs and writes workload manifests (plus snapshot metadata) to the same object store.
- On disaster, a cold DR hub is built; it reads the latest backup from object store and reconstitutes the ACM control plane. The spokes — which never went down — re-register against the new hub.
- The dashed-green edges are the active backup streams. The dashed-grey edge is the rare-path restore.
The hub-restore process
This is the part that is usually written down in a runbook and never tested. The steps:
- Provision a new OCP cluster. This is the long step — 30 minutes to an hour, depending on whether you have a pre-installed cold hub or are installing from scratch.
- Install the ACM operator. Subscribe; wait for the MCH (MultiClusterHub) CR; install the cluster-backup operator on top of it.
- Apply the BackupStorageLocation pointing at the same object store that the lost hub was writing to. This is what lets cluster-backup find the latest backup.
- Apply a Restore CR.
oc create -f restore.yaml. The Restore references the latest acm-credentials, acm-resources, and acm-managed-clusters backups. - Wait. The Restore reconciles in stages: credentials first, then resources, then managed-cluster registration data. Total time on a small fleet: ~10 minutes.
- The spokes re-register. The klusterlet agents on each managed cluster were configured with a hub URL and a bootstrap kubeconfig. They keep trying to connect. The new hub presents a different signing CA than the old one, so the agents’ existing certificates are no longer trusted; the agents fall back to the bootstrap kubeconfig and resubmit a CSR; the CSR auto-approve loop on the new hub signs them; the spoke is back.
RTO budget: 30 minutes if a cold hub is already provisioned and the restore is well-tested. 2 hours if you are building the hub from scratch. Whatever it is, write it down before the disaster, not during.
RPO and RTO targets
Pick targets first, then verify the tooling supports them. Defaults that are sane for most labs:
| Surface | RPO | RTO |
|---|---|---|
| Hub state (ACM CRs, GitOps) | 1 hour | 30 min (warm DR), 4 hr (cold DR) |
| Workload manifests | 1 hour | 30 min |
| PV data (databases, queues) | 1 hour | 1 to several hours (volume copy time) |
| Object storage itself | governed by storage replication | depends on cross-region setup |
For a BFSI-scoped tenant the bar is higher: RPO <= 15 minutes, RTO <= 4 hours. That changes the design — hub backups have to run every 15 minutes, the DR pair has to be warm (a running OCP cluster that periodically oc applys the latest backup, even if it is not the active hub), and the object store has to be cross-region replicated. See the BFSI scoring at /docs/openshift-platform/foundations/bfsi-readiness-review/ for the full criteria.
The mistake to avoid is declaring RPO and RTO based on what you wish the tooling could do. Run a real drill; measure; adjust the targets to the truth.
Storage backend matters
Velero writes to S3-compatible object storage. Two ways that fact becomes a quiet disaster:
-
Same fault domain. You put the object store on the same Ceph cluster (or the same NooBaa instance, or the same DC’s MinIO) as the data being backed up. The fault that takes out your workloads also takes out their backups. Backups should live somewhere genuinely independent: a different DC, a different cloud region, an off-site appliance.
-
No replication. Even a separate object store is not enough if it itself can fail. Production-grade BFSI setups put the backup destination on a bucket that replicates to a second region, either via the S3 vendor’s own replication or via a separate cross-region copy pipeline.
The lab today uses MinIO on a dedicated VM (30.30.30.14). It is single-instance. That is fine for the lab; it is not fine for production. The work to add a replicated second-DC destination is on the roadmap.
Validating restores
A backup that has never been restored is not a backup; it is a hopeful object in a bucket. The discipline that separates real DR from theatre is the quarterly drill:
- Spin up a sandbox OCP cluster.
- Apply the BackupStorageLocation pointing at the production object store (read-only credentials).
- Create a Restore from last week’s hub backup, into the sandbox.
- Verify: do the ManagedCluster CRs come back? Does the Policy framework reconcile? Does a sample Application sync? Does the SecuredCluster show up under Central?
- Write down what failed. The failures will be small at first — a missing label, a Secret that did not get backed up, a Route hostname that conflicts — and they compound if not fixed.
The first drill takes a full day. The fifth drill takes an hour. The day a real disaster lands, you want to be on drill number five, not drill number one.
RHACS Central backup-and-restore
The ACM DR story is well-trodden. RHACS is a different shape — Central is a stateful service with its own PostgreSQL, its own certificate authority, and a small handful of credentials that ripple through the fleet’s security posture if lost. It’s not in scope for the ACM cluster-backup operator covered earlier; wire it separately.
What gets backed up
Central is, under the hood, a PostgreSQL database plus a small set of certificates and configuration. A backup is a logical dump of that PostgreSQL and captures the entire RHACS state: every Policy (defaults plus your custom ones), every integration configuration (Notifiers, scanners, SIEM forwarders), every active violation and exception with its expiry, the init-bundle materials for each SecuredCluster, role bindings, auth-provider configuration, and the Central CA with its derived certificates. What does not get backed up is image scan data — Scanner V4 treats its vulnerability cache as regeneratable and re-scans on demand after restore.
The two backup paths
roxctl central backup is the manual path: from a host that can reach Central with an admin API token, produce a tarball locally and upload to S3 yourself. Fine for ad-hoc snapshots before a risky operation.
Central’s built-in scheduled backup is the production path. Configure a backup integration on Central (Amazon S3, S3-compatible storage, or Google Cloud Storage), set a daily cron, and Central writes the same tarball to the configured destination on its own. Configuration lives in Central, so it survives upgrades; destination credentials follow the same Notifier+ESO custody flow as every other integration. Red Hat formally supports AWS S3 and GCS; S3-API-compatible third-party stores (MinIO, NooBaa, Wasabi) usually work but aren’t guaranteed — test the restore path before you trust it.
Retention
A reasonable lab default: 30 days hot, 1 year cold — daily backups in the primary bucket, monthly snapshots archived to cheaper storage. The bucket lifecycle policy enforces this; Central doesn’t manage retention itself.
For PCI-DSS audit trails, the regulator-driven number matters more than the operational one — some banking regulators require 7 years of policy-change and enforcement event history. The RHACS backup itself doesn’t have to be retained that long; the violations export forwarded to your SIEM (via the integrations section earlier) does. Treat SIEM as the audit-grade record and Central backup as the recovery-grade record. Keep them separate.
The restore procedure
A restore is a sequence, not a single command. Skip a step and the result will look healthy but have subtle gaps. The strict order:
- Stop Central. Scale
deploy/centralto zero replicas. The database must not be in use. - Restore PostgreSQL from the backup dump into Central’s database (
pg_restorefor external Postgres, or per the Operator procedure for the bundled StatefulSet). - Restore the Central encryption key. Load-bearing and easy to forget. Central encrypts sensitive integration credentials with a key in the
central-encryption-keySecret; without it, every integration’s stored credential decrypts to garbage. - Restore the Central TLS certificates (
central-tls). This is the CA the SecuredCluster sensors validate against when they reconnect. - Restart Central. Watch the startup logs — decryption errors here mean a missed or wrong encryption key.
- Re-register every SecuredCluster. The init-bundles in the restored database may have rotated since the backup was taken; sensors on each spoke may have certificates that no longer validate against the restored CA. Regenerate init-bundles via the Central API and reapply them to each spoke’s
stackroxnamespace.
A practiced restore on a small fleet runs 30–60 minutes. Under pressure for the first time, longer.
The init-bundle gotcha
RHACS init-bundles are time-limited credentials — one year is the convention. After restoring from a backup older than the bundle’s expiry, every sensor with an expired bundle fails to reconnect. Symptom: TLS handshake failures in oc -n stackrox logs deploy/sensor; the SecuredCluster CR on the spoke shows Unhealthy. The fix is the same flow as fresh onboarding: regenerate via the Central API, materialise through Vault and ESO into each spoke’s stackrox namespace, let the sensor pick up the new cert. See the lab’s init-bundle-via-ESO documentation. Document this as load-bearing — the first instinct after a restore is to blame the network; it’s usually the bundles.
The lab’s posture
RHACS Central runs on hub-dc-v6 in the stackrox namespace. Scheduled backup is not yet wired — the same MinIO and DR work that gates ACM cluster-backup gates this. The path forward is a backup integration pointed at MinIO with a daily cron, plus a second-DC replication target when that infrastructure lands. The hub-restore runbook needs an explicit “restore RHACS Central” step, after the ACM control plane is back and before the SecuredClusters are expected to reconnect. For credential rotation, see /docs/openshift-platform/operations/routine-tasks/rotate-rhacs-central-admin/ — the Vault+ESO custody flow that handles the admin password handles the backup-destination credentials too.
References
- docs.redhat.com — RHACS Backup and restore
- docs.redhat.com — RHACS Configuring automatic backups
- docs.redhat.com — RHACS init-bundle generation
The lab’s current state
Pragmatically, where the lab is today:
- ACM cluster-backup is disabled on hub-dc-v6. It will be enabled when the DR pair is built. The runbook in
/docs/openshift-platform/openshift-platform/acm-multicluster/acm-cluster-backup/walks through the enable. - OADP is installed on spoke-dc-v6 with a
DataProtectionApplicationCR pointing at the lab’s MinIO. It is exercised periodically for the workloads that have real PVs. - hub-dr-v6 is reserved in the fleet design but not yet provisioned. It is the planned restore target. See
/docs/openshift-platform/architecture-decisions/adr-0022-v6-fleet-purge/for the architectural decision. - MinIO is single-instance on a dedicated VM. Adequate for the lab; the replication work to a second DC’s object store is tracked separately.
The BFSI readiness review rates DR as 🔴 High — the missing DR pair is the load-bearing gap. See /docs/openshift-platform/foundations/bfsi-readiness-review/.
Try this
-
Read the
DataProtectionApplicationCR on spoke-dc-v6.oc --context spoke-dc-v6 -n openshift-adp get dpa -o yamlNote the
backupLocations(BackupStorageLocation, pointing at MinIO), thesnapshotLocations, and thedefaultPluginslist. The OpenShift plugin should be in there. -
Trigger a manual backup of one namespace and restore it to a different namespace.
oc -n openshift-adp create -f - <<EOF apiVersion: velero.io/v1 kind: Backup metadata: { name: drill-1, namespace: openshift-adp } spec: includedNamespaces: [ "demo-app" ] EOFWait until
oc get backup drill-1showsCompleted. Then create a Restore withnamespaceMapping: demo-app: demo-app-restored. The restored namespace should appear with the same workloads. -
Sketch the runbook for “hub-dc-v6 datacenter lost — recover on hub-dr-v6 within 4 hours.” Include: who has access to the object store, who runs the restore command, how the spokes find the new hub, and what the verification checklist is. Write it down before there is a disaster.
Common failure modes
Backup succeeds but restore fails because the destination cluster’s CSI is different. CSI snapshots are tied to a specific provisioner. A snapshot taken by csi.ovirt.org does not restore to a cluster that only has cephfs.csi.ceph.com. Mitigation: keep the DR cluster’s storage class line up with the source’s, or use restic file-level copies for cross-storage portability (slower, but vendor-agnostic).
Velero plugin version mismatch between source and destination. Velero serialises object-store schemas, and a Velero 1.13 plugin reading 1.11 data sometimes throws on a previously valid field. Mitigation: keep source and destination on the same OADP/Velero versions; bump them together.
PVs come back bound to old node selectors. If a PV was provisioned with nodeAffinity against specific node names, restoring it to a cluster with different node names leaves it Pending forever. Mitigation: use storage classes that do not pin to nodes, or strip nodeAffinity on restore via a Velero resource modifier.
ImageStreams and BuildConfigs do not come back. Vanilla Velero does not know about OpenShift-specific types. Mitigation: install the OpenShift plugin (velero-plugin-for-openshift), which is the default in OADP — but verify it is in the defaultPlugins list of the DPA.
Object-store quota silently fills. Velero retention defaults are conservative; without rotation, the bucket grows unbounded. Mitigation: set TTLs on Backup CRs, and configure a lifecycle policy on the bucket itself.
BFSI cross-link
The BFSI readiness review treats DR as a 🔴 High risk so long as the reserved DR pair (hub-dr-v6) is not built. The criteria are spelled out at /docs/openshift-platform/foundations/bfsi-readiness-review/. Until then, the lab’s posture is single-DC with rebuild-from-GitOps as the fallback — adequate for a lab, not adequate for a regulated tenant.
References
- Red Hat ACM cluster-backup-and-restore documentation:
https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/latest/html/business_continuity/ - OADP project:
https://www.oadp-operator.com/ - Velero upstream:
https://velero.io/ - Velero docs (backup, restore, troubleshooting):
https://velero.io/docs/ - velero-plugin-for-openshift:
https://github.com/openshift/openshift-velero-plugin - Open Cluster Management — backup-restore design:
https://open-cluster-management.io/
Next: Module 11 — Build a project.