Architecture: hub, managed clusters, addons

The components running on the hub, the agents running on each managed cluster, how they talk, the identity model, the addon framework, sizing constraints, and where things break.

This module is the map. It names every component you’ll see in logs and oc get output, says what each one does, and shows where they live. Once you can sketch this diagram from memory, ACM debugging gets a lot easier.

Component map

Hub cluster

multiclusterhub-operator

registration controller (cluster-manager)

work controller

observability controller + Thanos receive

search-collector + PostgreSQL

governance-policy-propagator

Managed cluster

klusterlet

work-agent

registration-agent

observability-addon

application-manager + policy-controllers

Reading the diagram:

Solid black edges are control-plane flows happening on the hub. The governance-policy-propagator, for example, takes a Policy and writes a ManifestWork for each placed cluster.
Dashed green animated edges are spoke-initiated: registration heartbeats, the work-agent watching for ManifestWork. The spoke reaches the hub; not the other way around.
Dashed grey edges are telemetry: metrics pushed to Thanos receive on the hub, resource events fed to the search index.

The hub has more components than the spoke. That’s by design: the hub holds desired state and the cross-cluster view; the spoke runs a thin, mostly stateless agent layer.

The Hub

When you install the Advanced Cluster Management for Kubernetes operator from OperatorHub and create a MultiClusterHub CR, you get the following components, in roughly this order of importance:

Component	Role
multiclusterhub-operator	Watches the `MultiClusterHub` CR and reconciles every other component to it. The “operator of operators” for ACM on the hub.
cluster-manager (registration controller)	Watches `ManagedCluster` CRs; approves the CSRs that spokes submit; tracks join state, available/unavailable, and conditions.
klusterlet-addon-controller	Decides which addons each managed cluster gets, and writes the matching `ManagedClusterAddOn` resources.
work controller	Owns `ManifestWork` CRs in each cluster’s namespace on the hub. Spokes pull from these.
governance-policy-propagator	Watches `Policy` CRs in policy namespaces; for each binding+placement, writes a per-cluster copy of the policy as a `ManifestWork` to be applied on the target.
search-collector + PostgreSQL	The search index. PostgreSQL holds the indexed events; the collector consumes from the managed clusters and keeps the index fresh. The browser console’s Search page queries this.
observability controller + Thanos receive	If you enable observability, ACM installs Thanos receive on the hub plus a stock Grafana. Every managed cluster’s observability-addon pushes metrics here.
console (MCH console plugin)	A console plugin that adds the All Clusters, Search, Governance, and Applications views to the OpenShift web console.
MultiClusterEngine (MCE)	A subset of ACM exposed as its own operator. Provides the registration + work + Hive + HyperShift bits. ACM installs MCE if it isn’t already present.

In a default install, the hub namespaces you’ll spend time in are open-cluster-management, open-cluster-management-hub, and a namespace per managed cluster named for that cluster (where its ManifestWork and addon-leases live).

What’s optional, what isn’t

Some hub components only show up if you opt in. The defaults in the MultiClusterHub CR (spec.overrides.components) tell you which:

Required-always: multiclusterhub-operator, cluster-manager, work controller, console plugin.
Default-on: policy propagator, application-manager, search.
Opt-in: observability (needs an object store like NooBaa, ODF, or S3-compatible), submariner, gitops-addon, hypershift-addon.

The lab keeps the hub storage-light deliberately and runs observability off NooBaa-backed object storage rather than full ODF. See /docs/openshift-platform/openshift-platform/cluster-topology/hub-dc-v6.

Managed clusters

Each managed cluster, after it joins the hub, runs a set of agents in the open-cluster-management-agent namespace plus a per-addon namespace for each enabled addon:

Component	Role
klusterlet	The bootstrap controller. Lives in `open-cluster-management-agent`. Creates the rest of the agent components and manages their lifecycle.
registration-agent	Submits a CSR to the hub at join time; once approved, owns the rotating client cert that the agents use to talk to the hub. Sends heartbeats.
work-agent	Watches `ManifestWork` resources in this cluster’s namespace on the hub and applies them locally.
observability-addon	If observability is enabled, scrapes the cluster’s own Prometheus and pushes metrics to the hub’s Thanos receive.
application-manager	The “old-style” subscription-channel application lifecycle controller. Used by older ACM apps; you can disable it if you only ship via ApplicationSets.
cert-policy-controller	Implements cert-related policies (certificate expiration, key length, etc.) for GRC.
config-policy-controller	The big one. Implements `ConfigurationPolicy` — enforcement/inform across arbitrary Kubernetes resources. Most of GRC runs through here.
iam-policy-controller	Implements identity-related policies.
policy-controller	The framework that wires the policy-* controllers together.
gitops-addon	If enabled by the hub, installs OpenShift GitOps onto the managed cluster automatically. (This is the component that shipped the rogue Route CRD discussed in Module 01.)

The agents are deliberately small. On a managed cluster, the total ACM footprint is on the order of half a dozen Deployments and a few hundred MiB of memory. The hub is where the volume sits.

Communication patterns

Three communication patterns matter, and they’re all spoke-initiated:

Spoke -> Hub: registration. At join time, the registration-agent reaches out to the hub’s bootstrap kubeconfig, submits a CSR, and waits for the hub’s cluster-manager to approve it. Once approved, the agent receives a working client cert that’s scoped to a single per-cluster service account on the hub. From this point forward, the spoke uses its own identity; the bootstrap kubeconfig is not used again.

Spoke -> Hub: work pull. The work-agent watches ManifestWork resources in its per-cluster namespace on the hub. When the hub creates new ManifestWork (because, say, the policy propagator placed a Policy here), the agent sees it via a watch, fetches the embedded manifests, applies them locally, and writes status back into the ManifestWork.status. The hub then sees that status.

Spoke -> Hub: telemetry. The observability-addon pushes metrics via remote-write to Thanos receive on the hub. The search-collector likewise pushes resource-event data to the hub’s search index. Heartbeats from the registration-agent keep the spoke marked Available.

mTLS everywhere. Every spoke-to-hub call uses mTLS. The certs are auto-rotated by the registration loop; they’re cluster-scoped on the hub side via a ManagedClusterRole that limits each spoke’s identity to its own namespace + a small set of read-only cluster-scope resources.

When you debug “the spoke can’t reach the hub,” the layers to check are, in order: DNS, the hub API endpoint reachable from the spoke, the bootstrap secret in open-cluster-management-agent, and finally the renewed client cert in the same namespace.

Identity model

The identity model is small but easy to get wrong if you’ve never written one of these:

ManagedCluster is a hub-scoped CRD. One per real Kubernetes cluster. The name doesn’t have to match the cluster’s actual name — it’s whatever you registered as.
ManagedClusterSet groups ManagedClusters. Sets are how you say “the prod-eu fleet” or “the dc-v6 spokes.” Sets are also the RBAC boundary for who can Placement-target which clusters.
ManagedClusterSetBinding binds a set into a specific namespace. Without a binding, you cannot Placement-target a set from that namespace. This is the gate that stops a tenant from accidentally targeting clusters they shouldn’t see.
Placement is a CR that says “select these ManagedClusters from the bound sets, applying these predicates.” A Placement produces a PlacementDecision (just the list of cluster names it resolved to).
PlacementDecision is the read side. Most consumers (ApplicationSet’s clusterDecisionResource generator, the policy propagator) watch PlacementDecision, not Placement.

The clean way to think about it: ManagedClusterSet is your grouping, Placement is your query, PlacementDecision is the result. Everything downstream consumes the result.

Addon framework

The pluggable system that makes ACM extensible: a pair of CRDs.

ClusterManagementAddOn is hub-scoped, one per addon (e.g., one for observability, one for application-manager, one for governance-policy-framework). It describes the addon’s name and where its config lives.
ManagedClusterAddOn is namespaced — one per (addon × managed cluster). When the hub places an addon onto a cluster, the klusterlet-addon-controller writes one of these into the per-cluster namespace. The spoke’s klusterlet picks it up and installs the addon’s agent locally.

Most ACM features ship as addons under the covers: governance-policy-framework, application-manager, observability-controller, search-collector, work-manager, cert-policy-controller, hypershift-addon, gitops-addon, submariner. When you toggle a component in the MultiClusterHub CR’s overrides.components, you’re really toggling the underlying addon.

Custom addons exist too: any third party (or you) can write a ClusterManagementAddOn and a controller that watches ManagedClusterAddOn resources. This is the contract ACS, OpenShift GitOps, and the lab’s own internal tooling use to plug in.

Sizing and constraints

Some numbers worth carrying in your head. These are rules of thumb, not contracts — read the official sizing guidance before you commit to a SKU:

A small hub managing up to ~25 clusters: 2 vCPU + 8 GiB + ~50 GiB SSD on each of three control-plane nodes is enough. Observability off, search on.
A medium hub managing ~100 clusters: 4-8 vCPU + 16-32 GiB + ~100 GiB SSD on each of three control-plane nodes. Observability on.
A large hub (~1000+ clusters): production sizing, separate infra nodes, ODF block storage, dedicated workers for Thanos. At this size, observability metric volume is the dominant constraint.

What scales linearly: registration (each new cluster adds a CSR and a watch). Manageable up to a few hundred clusters per hub.

What scales sub-linearly: work distribution. ManifestWork is pulled, not pushed; adding clusters adds watch overhead but the per-spoke cost is small.

What scales poorly: the search index past a few hundred clusters. PostgreSQL with the indexed-event volume becomes the bottleneck. The official Red Hat guidance is to consider splitting fleets across hubs or accepting search degradation past that point.

Storage is the most common surprise. The PostgreSQL search backend wants SSD; spinning disks make queries miserable. Thanos receive wants object storage — for the lab, that’s NooBaa over MinIO; for cloud hubs, it’s whatever S3 you have.

The lab’s actual architecture

The conceptual model above is portable. The lab’s wiring is specific:

hub-dc-v6 — compact 3-AIO management cluster, deliberately storage-light. NooBaa-backed observability; no full ODF on the hub. See /docs/openshift-platform/openshift-platform/cluster-topology/hub-dc-v6.
spoke-dc-v6 — workload cluster, 3 VM masters + 3 physical workers with ODF for tenant data. See /docs/openshift-platform/openshift-platform/cluster-topology/spoke-dc-v6.
Pull-model GitOps. The hub runs ACM and an ApplicationSet controller that fans out via clusterDecisionResource; each spoke runs OpenShift GitOps locally and reconciles from internal GitLab. The reasoning is in ADR 0018 and ADR 0019.
Storage-light hub isn’t a default — it’s a sized decision for a lab where the spoke is the data plane, not the hub. If you copy this design, do the sizing math first; observability is the constraint that bites.

For the lab-specific ACM component wiring, see the platform docs at /docs/openshift-platform/openshift-platform/acm-multicluster/multiclusterhub and /docs/openshift-platform/openshift-platform/acm-multicluster/managedcluster-registration.

Day-1 install walkthrough

Knowing the components is half the job; the other half is putting them on a cluster. Day-1 means the hour where you turn an empty OpenShift cluster into a working ACM hub. Two paths exist, the install has a few load-bearing prerequisites, and the post-install checks are short enough to memorise.

Prerequisites on the hub

Before you start, the hub OpenShift cluster has to satisfy a small list:

OpenShift version. RHACM 2.13 supports OpenShift 4.14 and later. Older 4.x will refuse to install the operator, and the failure is usually a confusing Subscription event rather than an explicit version error. Check oc get clusterversion first.
Storage class for search. If you enable the search component, its PostgreSQL needs a default StorageClass that gives it SSD-class IO. Spinning disks make the index sluggish and queries time out under load. The lab disables search outright on hub-dc-v6, but if you keep it on, point it at NVMe-backed ODF or an equivalent.
Object storage for observability. Enabling observability requires an S3-compatible bucket — Thanos receive writes blocks there. The lab uses NooBaa over MinIO; on cloud, plain S3 or Azure Blob via the S3 gateway works. No object store, no observability.
Network reachability. Every managed cluster must be able to reach the hub’s kube-apiserver on TCP 6443. The hub never connects to managed clusters; everything is spoke-initiated. DNS, firewall rules, and the hub’s API endpoint (api.<hub>.<domain>) need to be resolvable and routable from every spoke before you import anything.

Skipping the storage and reachability checks is the most common cause of an install that looks healthy on the hub but is dead the moment you try to import a cluster.

Two install paths

You can install ACM through the OpenShift console or through GitOps. The two paths produce identical resources; the difference is who owns the manifests.

Path A — Operator Hub. Open the OpenShift web console, go to OperatorHub, and filter for Advanced Cluster Management for Kubernetes. Pick the certified Red Hat tile, install cluster-scoped, choose the release-2.16 channel (or whichever minor your fleet has standardised on), and let the Subscription land in open-cluster-management. After the operator install completes, go to Installed Operators → Advanced Cluster Management → MultiClusterHub, click Create MultiClusterHub, and submit the default form. This is the fastest path for a one-off hub.

Path B — GitOps (recommended for fleet ops). Author a Namespace, OperatorGroup, Subscription, and MultiClusterHub manifest, commit them to platform-gitops under clusters/<hub>/platform/acm/, and let Argo CD apply them with sync waves so the Subscription lands before the MCH CR. This is what the lab does — the actual manifests are documented at /docs/openshift-platform/openshift-platform/acm-multicluster/multiclusterhub. The advantage is reproducibility: tear the hub down, re-install, and the same component set comes back.

Pick GitOps if you operate more than one hub or expect the install to be re-run.

MultiClusterHub CR — the shape

The MCH CR is the single declaration that turns RHACM on. The minimal shape:

apiVersion: operator.open-cluster-management.io/v1
kind: MultiClusterHub
metadata:
  name: multiclusterhub
  namespace: open-cluster-management
spec:
  availabilityConfig: High
  disableHubSelfManagement: false
  imagePullSecret: open-cluster-management-image-pull-credentials

Field by field: availabilityConfig is High (two replicas per controller) or Basic (single replica); pick Basic only for a sandbox. disableHubSelfManagement: false keeps the hub registered as a ManagedCluster named local-cluster, which the lab needs so the hub’s GitOpsCluster can target the local Argo destination. imagePullSecret is a pull secret in open-cluster-management that the operator can use to fetch ACM images — usually the same one the hub’s global pull secret uses. Optional fields worth knowing: nodeSelector and tolerations to pin controllers onto infra nodes, and customCAConfigmap to inject your private CA bundle into ACM components when the hub trusts a non-public PKI.

Watching the install

Once the MCH CR exists, the operator unrolls the rest. The oc get multiclusterhub -A output progresses from Installing to Running in roughly five to ten minutes, depending on how many components you enabled. While it’s installing, oc get pods -n open-cluster-management shows the cascade — roughly fifteen controllers come up across the registration, work, governance, and console pillars. The ones to watch by name are the cluster-manager (registration), the governance-policy-propagator, the console plugin (acm-console-*), and — if you enabled them — the search-collector and observability controller. If any one of these is stuck ContainerCreating for more than a couple of minutes, look at events on the pod; it’s usually a missing secret or storage class.

The lab’s install posture

hub-dc-v6 runs MCH with disableHubSelfManagement: false, the opposite of what you might expect for a control-plane-only hub. The reason is GitOps: the hub’s Argo CD instance targets local-cluster as a destination, so the hub must be a ManagedCluster of itself. The hub’s enabled-component set is deliberately small — registration, work, placement, app-lifecycle, GRC, console, cluster-permission, cluster-lifecycle — with search, observability, insights, and the backup operator left off until they’re needed. The full component table and the actual Subscription + MCH YAML are at /docs/openshift-platform/openshift-platform/acm-multicluster/multiclusterhub.

Post-install checks

Three checks tell you the hub is genuinely ready, not just Running:

oc get multiclusterhub -n open-cluster-management -o jsonpath='{.items[0].status.phase}{"\n"}'
oc get managedcluster local-cluster
oc get crd | grep cluster.open-cluster-management.io | wc -l

The MCH phase should read Running. The local-cluster ManagedCluster should exist with HUB ACCEPTED=true (or be absent entirely if you set disableHubSelfManagement: true). The CRD count for cluster.open-cluster-management.io should be in the dozens — the install registers a couple of dozen CRDs across registration, work, placement, addon, and governance APIs. If you see fewer than fifteen, something stalled mid-install. The console plug-in is the last thing to register; reload the OpenShift web console and the All Clusters dropdown in the top-left header is your visual confirmation.

Multicluster Engine — the underlying lifecycle controller

The single most useful thing you can internalise about ACM is that it isn’t one product — it’s two operators stacked. The lower one is multicluster engine for Kubernetes (MCE); the upper one is RHACM. Once you know which CRD belongs to which operator, half of “why is X broken” answers itself.

MCE ships from OperatorHub as the multicluster-engine operator and reconciles a MultiClusterEngine CR. Everything in MCE is about getting a cluster, registering it, and keeping it joined: the ManagedCluster API, the klusterlet, Hive’s provisioning controllers, the Assisted Service, ClusterPools, ClusterClaims, and the hosted-control-plane bits (the HypershiftAddon and the AgentServiceConfig that drives Assisted). RHACM, installed as the advanced-cluster-management operator and reconciled via a MultiClusterHub CR, sits on top and layers in the fleet-wide product surface: GRC policies, ApplicationSets and Subscriptions, multicluster observability, the Search index, the Insights pull, and the console plug-in that gives you the All Clusters dropdown. When you install ACM on a hub that doesn’t already have MCE, the ACM operator installs MCE first and then proceeds.

Why the split exists

The split is a product-packaging decision, not an architectural accident. Some customers want only cluster lifecycle — a centralised place to provision and register OpenShift clusters with no opinion about governance or applications — and Red Hat sells that to them as MCE standalone. Other customers want the full fleet management story; they buy ACM, which subsumes MCE. The boundary lets the lifecycle code ship and stabilise on its own cadence, and lets a smaller team adopt MCE without inheriting the surface area of policies and observability.

For operators of an ACM hub, the practical consequence is that two operators upgrade independently and two operators can fail independently. The MCE channel and the ACM channel are separate Subscriptions; you pin them separately and you read their release notes separately.

CRD ownership

Knowing which operator owns which API is the fastest way to triage a degraded hub. A rough split:

Owned by MCE	Owned by ACM (multiclusterhub)
`ManagedCluster`, `ManagedClusterSet`, `ManagedClusterSetBinding`	`Policy`, `PolicySet`, `PlacementBinding`
`Klusterlet`, `KlusterletConfig`, `KlusterletAddonConfig`	`Subscription`, `Channel`, `PlacementRule` (legacy app model)
`ClusterDeployment`, `ClusterImageSet`, `MachinePool` (Hive)	`ApplicationSet` (wrapper integrations for pull-model)
`ClusterPool`, `ClusterClaim`	`MultiClusterObservability`, `ObservabilityAddon`
`AgentServiceConfig`, `InfraEnv`, `BareMetalHost` (Assisted)	`SearchOperator`, `SearchCustomization`
`HypershiftAddon`, `HostedCluster`, `NodePool`	`Insights` integration plumbing
`Placement`, `PlacementDecision`	Console plug-in and the All Clusters surfaces
`ManifestWork`, `ManifestWorkReplicaSet`

Placement lives on the MCE side because it’s the primitive for cluster-lifecycle work (where to provision, where to attach an addon). ACM components consume PlacementDecision but don’t define the Placement CRD.

What this means in practice

When ACM looks unhealthy, the first triage is “which operator?” Two oc get pods calls disambiguate it:

oc get pods -n multicluster-engine
oc get pods -n open-cluster-management

If multicluster-engine is degraded, cluster lifecycle is broken. New imports stall, ManagedCluster reconciliation halts, Hive can’t provision, and the Placement controller stops producing decisions — which transitively breaks any ACM component that reads PlacementDecision. The fleet is frozen at whatever shape it had at the moment MCE went sideways.

If open-cluster-management is degraded, governance and applications are broken, but lifecycle continues. ManagedClusters stay joined, klusterlets keep heartbeating, but new Policies don’t propagate, ApplicationSets don’t generate, and the console plug-in may go dark. Existing workloads on managed clusters keep running because the spoke side doesn’t depend on the hub-side governance controllers staying up.

The same split shows up in upgrades. The MCE Subscription can roll forward on its own minor (for example, MCE 2.10 → 2.11) while ACM stays pinned to a compatible window. Read the release-notes compatibility matrix every upgrade — ACM’s multiclusterhub operator declares an MCE version range it works with, and skipping it is how you get a permanently-Installing MCH CR.

MCE as a standalone product

You can install MCE without ACM. The multicluster-engine Subscription on a fresh OpenShift cluster gives you ManagedCluster, the klusterlet bits, Hive, Assisted Service, HyperShift hosting, and cluster pools — but no Policy framework, no Applications, no Search, no Observability. The MultiClusterEngine CR exposes the same channel/version pinning you’d expect, and a small spec.overrides.components list lets you toggle the addons (HypershiftPreview, ClusterProxyAddon, ManagedServiceAccount, etc.) without ACM ever being in the picture.

For a team that only wants centralised provisioning and registration — say, a SRE group that wants a single API to spin up dev clusters but already runs governance through a different tool — MCE standalone is the cheaper buy. For everyone else, MCE-under-ACM is the path.

The lab’s posture

hub-dc-v6 runs both. The multicluster-engine operator and the advanced-cluster-management operator are installed via GitOps as separate Subscriptions; the MCE CR is reconciled first (sync wave 10), the MCH CR is reconciled in a later wave so the multiclusterhub-operator finds MCE already healthy. The actual Subscription and CR manifests are documented at /docs/openshift-platform/openshift-platform/acm-multicluster/multiclusterhub — read that page after this section to see how the lab pins both channels and which MCE components it disables.

References

Where things go wrong, architecturally

If you’re going to debug this, you’ll be looking at three places.

Klusterlet bootstrap secrets in the open-cluster-management-agent namespace on the managed cluster. When a spoke can’t join, the answer is almost always here. The bootstrap secret holds the hub kubeconfig the spoke uses for its first CSR; if it’s stale, missing, or points to the wrong hub API endpoint, registration silently retries forever. Look at bootstrap-hub-kubeconfig, the events on the Klusterlet CR, and the registration-agent logs.

ManifestWork sync status. Once registered, the spoke pulls ManifestWork. If a particular workload isn’t appearing on the cluster, the chain to walk is: Placement resolved correctly? → PlacementDecision has this cluster’s name? → ManifestWork exists in the cluster’s hub namespace? → ManifestWork.status says it was applied? Each link tells you exactly which controller dropped the ball.

The third place is the registration-agent CSR loop. The client cert the agents use has a lifetime; the agent rotates it automatically. If the rotation breaks (clock skew, hub unreachable for the rotation window, the spoke’s identity got removed on the hub), the spoke goes Unavailable and work-agent stops working with no obvious crash. The hub’s csr resources and the registration-agent’s logs tell you what happened.

These three are roughly half of all ACM tickets in practice. Internalising them now will save you hours later.

What’s next

You now have the map. The next module is Module 03 — Cluster lifecycle: import a managed cluster, provision one with Hive, and operate the upgrade flow.

Next: Module 03 — Cluster lifecycle.