Observability and Search
Aggregate metrics, alerts, and resource queries across a fleet — Multicluster Observability Operator, the Search index, Insights, and the lab's deliberate hub/spoke split.
The application landed on every spoke. The next problem is the inverse: how do you see all of them at once? Module 06 was about fan-out; this module is about fan-in. Metrics, alerts, and ad-hoc resource queries — one pane, many clusters.
Why a separate addon
Every managed OpenShift cluster already runs the in-cluster monitoring stack — Prometheus, AlertManager, node-exporter, kube-state-metrics. That’s enough for a single-cluster operator looking at one cluster’s health.
The multicluster job is different: long-retention metrics across all clusters, fleet-level dashboards, fleet-level alerts (“3 clusters have ETCD member health degraded”). For that you need an aggregator. ACM provides one: the Multicluster Observability Operator (MCO). It’s an addon precisely because the in-cluster stack stays useful on its own.
The Multicluster Observability Operator
MCO is deployed on the hub. From a single MultiClusterObservability CR it stands up:
| Component | Role |
|---|---|
| Thanos Receive | Long-lived write endpoint that managed clusters push to. |
| Thanos Store | Reads historical blocks from object storage. |
| Thanos Query / Query Frontend | Deduplicates + merges live and historical data for queries. |
| Thanos Compactor | Downsamples old blocks for cheap long-retention queries. |
| Grafana | Pre-shipped dashboards plus a multi-tenant Thanos data source. |
| AlertManager | Hub-level alert routing (Slack/PagerDuty/etc.). |
Managed clusters get the matching half: an observability-addon deploys a metrics-collector Pod that scrapes the spoke’s in-cluster Prometheus and remote_writes a curated subset to the hub’s Thanos Receive. No hub→spoke pull; the spoke initiates the push.
Reading the diagram: spoke Prometheus is the source of truth on each cluster; the addon scrapes and forwards (dashed green = spoke-initiated push) to the hub; Thanos writes blocks to object storage; query reads back through Store; Grafana and AlertManager are the read-side surfaces.
Storage and retention
MCO needs S3-compatible object storage for Thanos blocks. The lab uses MinIO (30.30.30.14:9000) for this; in cloud installs you’d point at S3 or a NooBaa OBC. Default retention is months, not days — that’s the whole point of separating ingest from storage.
Sizing rough numbers: at MCO’s default metric set (~150 series per Pod, ~2000 Pods per medium cluster, 30-second scrape) you’re looking at ~20 GiB per cluster per month of compressed block storage. A fleet of 50 clusters over a year is single-digit TB. Compactor downsampling buys back a chunk past the first 40 hours.
What kills you is not storage volume — it’s Thanos Receive memory. Each receiver replica holds an in-memory tail of ~2 hours of TSDB data per tenant. Scale replicas before scale of storage.
What gets shipped vs what stays local
By default MCO ships a curated allowlist — cluster, node, kube-state, ETCD, API server, ingress controller — not raw cAdvisor or per-container CPU. That curated set is what the shipped dashboards consume.
You can extend the list in two ways:
- The MCO
MultiClusterObservabilityCR’sallowlistConfigMap — add specific metric names. - A
MetricsCustomCollectorfor recording-rule output you want fanned in.
App workload metrics typically stay on the spoke unless you explicitly federate them. The “right” line: SLI metrics you need fleet views of go to the hub; spoke-local debugging metrics stay on the spoke. Federating everything is how teams end up with terabyte-class hub storage bills.
Alerts
Hub-level alerts evaluate against aggregated metrics in Thanos Query. Spoke-level alerts continue firing on the spoke’s local Prometheus and AlertManager. Both are useful; they answer different questions.
The pragmatic split:
- Hub — fleet-shaped SLIs. “More than 2 clusters are reporting
KubeAPIErrorBudgetBurn.” “ETCD leader changes exceed threshold across the fleet.” - Spoke — high-noise local alerts. “A specific Pod is CrashLooping.” “Disk pressure on a specific node.”
Putting noisy spoke-internal alerts on the hub turns the hub AlertManager into a fire hose. Putting fleet SLI alerts on a single spoke means you don’t see the fleet-wide pattern. The lab keeps Loki and Tempo local-per-spoke for the same reason — high-volume signals stay close to where they’re produced (/docs/openshift-platform/openshift-platform/platform-services/cluster-logging-and-loki/).
Cluster health metrics — fleet status not workload status
So far this module has been about workload observability: scrape Prometheus on each spoke, push curated metrics to the hub, draw dashboards. That’s the what’s running, how busy is it question. The other question — is the cluster itself still alive? — is observed with a different set of metrics, runs through a different alert path, and most of the time matters more.
The distinction
A pod-level CrashLoop is a workload event. A spoke whose kube-apiserver has stopped serving for 30 seconds is a cluster event: every workload on it is collateral damage, and you’d rather know about the apiserver outage than wade through the 200 pod alerts that follow. RHACM treats cluster health as a first-class signal, separate from the workload metrics MCO fans in.
| Question | Where the signal comes from |
|---|---|
| Is the spoke’s kube-apiserver up? | Spoke’s in-cluster monitoring exports up for the apiserver target |
Are all ClusterOperators Available=True? | Spoke’s cluster-version-operator exports cluster_operator_up and cluster_operator_conditions |
Is ClusterVersion healthy? | Spoke’s CVO; surfaced as cluster_version_payload and the Available condition |
| Is etcd quorate? | Spoke’s etcd-operator exports etcd_server_has_leader, member counts |
| Are certificates close to expiring? | Spoke’s kube_apiserver_pki_* series and apiserver_client_certificate_expiration_seconds |
| How many nodes are Ready? | Spoke’s kube_node_status_condition from kube-state-metrics |
| Is the spoke heartbeating to the hub? | Hub-side: acm_managed_cluster_status_condition_available |
The first six come from each spoke’s own openshift-monitoring stack and are forwarded to the hub by the same observability-addon that ships workload metrics — they’re a small subset of the curated allowlist. The last one is hub-native, emitted by the registration controller on the hub and not present on any spoke.
Alerts the hub ships
RHACM installs a small set of cluster-health alert rules into the hub’s Thanos Ruler when observability is enabled. The names worth knowing:
ManagedClusterConditionUnknown— a spoke hasn’t heartbeated for ~5 minutes. The registration controller flipped itsAvailablecondition fromTruetoUnknown.ManagedClusterImportUnavailable— the import process for a new spoke stalled. Common during fleet expansion if klusterlet rollout is blocked.MultiClusterObservabilityClusterDown— the metrics-collector on a spoke has stopped pushing. Either the addon is unhealthy or the spoke’s egress to the hub’s Thanos Receive is broken.PolicyGovernanceInfo-shaped alerts — RHACM’s GRC operator emits a counter for each non-compliant policy; alerts evaluate that counter against thresholds.
These need an Alertmanager route to your on-call channel. The default route in the hub’s alertmanager-main Secret usually sends everything to a generic webhook; for real on-call coverage, write a route block that splits cluster-health alerts to the platform-team channel and policy alerts to the security-team channel.
The “ManagedCluster Unknown” trap
The single most important alert in this set is ManagedClusterConditionUnknown. It fires whenever a managed cluster’s registration agent stops heartbeating for the staleness window. The causes, in rough order of frequency:
- Spoke kube-apiserver is down. The agent can’t authenticate to the hub if its own kube-apiserver isn’t serving. This is the case where the alert is doing its real job — telling you the cluster has a control-plane outage before any of the spoke’s own alerts can reach you (they can’t reach you; the spoke is down).
- Network partition between spoke and hub. The agent is healthy on the spoke but can’t open TCP/6443 outbound to the hub. WAN flaps, firewall changes, an Argo sync that nuked the hub’s Route — all candidates.
- Registration agent crashed. The Pod itself OOMed or hit a CrashLoop. Spoke is fine; the messenger is sick. Look at the
klusterletnamespace on the spoke. - Spoke’s client cert rotation broke. The agent rotates its identity cert periodically; if rotation fails and the existing cert expires, the agent goes Unknown silently. Module 02 covers the bootstrap-secret recovery flow.
Crucially, when this alert fires, none of the spoke’s own alerts can reach the hub for the duration of the partition — the same path that ships metrics is the path that ships alerts. The hub-side ManagedClusterConditionUnknown is your only fleet-level indicator that something is wrong cluster-wide. Treat it as a paging alert, not an email-best-effort one.
The lab’s setup
The lab keeps the workload-vs-health split deliberate. Workload observability lives in two places:
- Spoke-local Loki and Tempo, for logs and traces with cluster-local correlation. See /docs/openshift-platform/openshift-platform/platform-services/cluster-logging-and-loki/.
- SigNoz on a separate VM, the long-term observability sink for out-of-cluster workloads. See /docs/openshift-platform/lab-infrastructure/observability-vms/signoz-overview.
Cluster-health observability is hub-side. When MCO is enabled, the curated allowlist includes the cluster-health series and the hub’s Thanos Ruler evaluates the RHACM-shipped alert rules. The on-call routing for cluster-health alerts is documented at /docs/openshift-platform/operations/incidents-and-runbooks/ — that’s the home for the runbooks that the alerts page into.
For the lab today, MCO is not yet enabled on hub-dc-v6 (the object-store wiring is parked behind the same NooBaa work that gates several other features). The ManagedClusterConditionUnknown alert still fires through the hub’s cluster-monitoring-operator because the registration controller exports its conditions to the in-cluster Prometheus regardless of whether MCO is on. That’s the minimum viable alerting posture; the full MCO-backed setup is a near-term upgrade.
Try this
Three exercises, each ~10 minutes:
1. List every ClusterOperator status across the fleet via ACM Search. From the ACM Search UI: kind:ClusterOperator. Add a column for the Available condition. Spot any operator showing Degraded=True across the fleet — these are the candidates for the next round of remediation policies.
2. Read the registration controller’s exported metrics. On the hub: oc -n open-cluster-management-hub port-forward svc/registration-controller 8443:443 and curl https://localhost:8443/metrics (you’ll need the in-cluster token). Find acm_managed_cluster_status_condition_available — the gauge that drives ManagedClusterConditionUnknown. Note the labels; they’re how you’d build a hub-Grafana panel of “current heartbeat state per cluster.”
3. Sketch the Alertmanager route. Write the YAML for an Alertmanager route block that:
- Sends
ManagedClusterConditionUnknownto the platform on-call channel withseverity: critical. - Sends
MultiClusterObservabilityClusterDownto the observability team withseverity: warning. - Inhibits the second alert for a cluster when the first is already firing — no point waking observability when the cluster is down for unrelated reasons.
The exercise is the route design, not the apply.
References
- docs.redhat.com — RHACM Health metrics
- docs.redhat.com — RHACM Observability service
- docs.openshift.com — Cluster Operator monitoring
- open-cluster-management.io — Registration controller
Search
The second addon. Different job entirely: not metrics, but resource state. The Search addon runs a search-collector on each managed cluster that watches the etcd state via the kube-apiserver, then publishes resource lists to a hub PostgreSQL index. The ACM Search UI queries that index.
Use cases that pay for it on day one:
- “What version of cert-manager is installed on every cluster?”
- “Which namespaces have a Deployment named
redis?” - “Show every Pod with image containing
nginxacross the fleet.” - “Which ManagedClusters have the
compliance=pci-dsslabel?”
It’s oc get across the whole fleet without context-switching. The query language is small (key:value with simple boolean composition) but covers 80% of “where the hell is X” investigations.
Insights integration
ACM ingests Red Hat Insights findings per managed cluster — vulnerability scans, configuration recommendations, advisories. The Insights tab on each cluster surfaces these alongside the cluster’s health.
The signal-to-noise is mixed. Insights catches genuinely useful things — pending CVE upgrades, misconfigured CSRs, etcd recommendations — and also catches things that don’t apply to your environment (“upgrade to a SaaS feature you don’t use”). Treat it as a queue of suggestions to triage, not a list of must-fix items. The remediation links are the most useful part: most findings come with a one-click rollout to apply the fix as a Policy.
Performance traps
A few sharp edges that surface around fleet sizes of 25+ clusters:
- Search index scales linearly with object count. A fleet of 50 clusters × 10k objects per cluster = 500k rows. The hub PostgreSQL underneath needs SSD storage and enough RAM that its working set fits in shared buffers. Spinning rust will time out queries.
search-collectorOOMs. Big spokes with many Pods sometimes OOMKill the collector mid-sync. Bump its memory request before you debug a “stale index” mystery.- Hub Grafana is single-tenant. Multi-tenancy isolation happens at the Thanos data source level (cluster label filtering), not via Grafana user permissions. Don’t expect Grafana folder-level RBAC to gatekeep tenants.
- Thanos Query timeouts. Long time ranges against many clusters hit the default 2-minute timeout. Either narrow the range, add recording rules to pre-aggregate, or bump the query-frontend timeout and accept the latency.
The lab’s deliberate hub/spoke split
The lab uses MCO for metrics fan-in. But it keeps logs and traces local per-spoke:
- Loki runs on each spoke via the Cluster Logging Operator (
/docs/openshift-platform/openshift-platform/platform-services/cluster-logging-and-loki/). - Tempo runs on each spoke under the Tempo Operator, backed by per-spoke object storage.
- Cluster Observability Operator (COO) on the spoke runs Perses dashboards for the spoke-local view (
/docs/openshift-platform/openshift-platform/platform-services/perses-dashboards/). - SigNoz on a separate VM is the long-term observability sink for some out-of-cluster workloads (
/docs/openshift-platform/lab-infrastructure/observability-vms/signoz-overview/).
Why split it? Metrics are aggregable and small per series; the fan-in pays off. Logs and traces are bulky and best correlated with cluster-local context (Pods, namespaces, request IDs). Federating them to the hub means moving terabytes a day and losing the per-cluster correlation you actually need when debugging.
The console hides the split well — Perses tabs on the spoke, Grafana on the hub — but the data paths are different by design.
Try this
- Open the ACM Search UI on the hub. Run a query for
kind:Deployment image:nginx— every Deployment in the fleet whose image referencesnginx. Add acluster:filter to narrow to one cluster. The UI prints a count so you can spot-check vs.oc get. - Read the
MultiClusterObservabilityCR (if deployed) —oc get mco observability -o yaml. Identify theallowlistConfigMap reference. Add a single metric name to the allowlist (a CR you know your apps emit) and watch it appear in Thanos Query within a few minutes. - In the hub Grafana, write a query:
count by (cluster) (kube_pod_info). You should get one row per managed cluster with the Pod count on that cluster. Try the same fornodeinstead ofcluster— that’s your per-node pod-density chart for free.
Common failure modes
observability-addonstuckAvailable=Falseon a spoke. The collector cannot reach the hub’s Thanos Receive route. Check DNS resolution for the receive route from inside the spoke, then network egress. Receive routes use OpenShift Route with re-encrypt TLS; cert trust on the spoke is the second-most common cause.- Search index goes stale. The
search-collectorPod on one or more spokes got OOMKilled. Checkoc logs -n open-cluster-management-agent-addon deployment/search-collector --previous. Bump memory request, re-roll. - Grafana dashboards empty. Thanos Query is timing out, usually because Thanos Store can’t reach object storage. From inside the
storePod,wgetthe bucket endpoint. The classic cause is an OBC where the AWS_KEY keys never got bridged to the operand’s lowercase variants — see the lab’s NooBaa OBC→operand bridge pattern. - Hub AlertManager fire hose. Spoke-internal alerts were configured to forward to the hub. Move them back to the spoke’s local AlertManager and only forward the SLI-shaped fleet alerts.
- Insights tab empty for one cluster. The cluster lost its
cloud.openshift.comegress, or the Telemetry pull secret is missing on the spoke. Insights piggybacks on the same support pull-secret flow as Telemetry.
References
- ACM Observability docs —
https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/ - Thanos design —
https://thanos.io/tip/thanos/design.md/ - OCM Search proposal —
https://open-cluster-management.io/ - OpenShift Cluster Monitoring —
https://docs.openshift.com/container-platform/latest/monitoring/monitoring-overview.html - Red Hat Insights —
https://docs.redhat.com/en/documentation/red_hat_insights/ - Lab —
/docs/openshift-platform/openshift-platform/platform-services/cluster-logging-and-loki/ - Lab —
/docs/openshift-platform/openshift-platform/platform-services/perses-dashboards/ - Lab —
/docs/openshift-platform/lab-infrastructure/observability-vms/signoz-overview/
Next: Module 08 — Hosted Control Planes and Cluster Pools covers two ways to provision more clusters quickly — HyperShift’s hosted control planes and ACM’s warm-pool ClusterPool.