Observability and Search

Aggregate metrics, alerts, and resource queries across a fleet — Multicluster Observability Operator, the Search index, Insights, and the lab's deliberate hub/spoke split.

The application landed on every spoke. The next problem is the inverse: how do you see all of them at once? Module 06 was about fan-out; this module is about fan-in. Metrics, alerts, and ad-hoc resource queries — one pane, many clusters.

Why a separate addon

Every managed OpenShift cluster already runs the in-cluster monitoring stack — Prometheus, AlertManager, node-exporter, kube-state-metrics. That’s enough for a single-cluster operator looking at one cluster’s health.

The multicluster job is different: long-retention metrics across all clusters, fleet-level dashboards, fleet-level alerts (“3 clusters have ETCD member health degraded”). For that you need an aggregator. ACM provides one: the Multicluster Observability Operator (MCO). It’s an addon precisely because the in-cluster stack stays useful on its own.

The Multicluster Observability Operator

MCO is deployed on the hub. From a single MultiClusterObservability CR it stands up:

Component	Role
Thanos Receive	Long-lived write endpoint that managed clusters push to.
Thanos Store	Reads historical blocks from object storage.
Thanos Query / Query Frontend	Deduplicates + merges live and historical data for queries.
Thanos Compactor	Downsamples old blocks for cheap long-retention queries.
Grafana	Pre-shipped dashboards plus a multi-tenant Thanos data source.
AlertManager	Hub-level alert routing (Slack/PagerDuty/etc.).

Managed clusters get the matching half: an observability-addon deploys a metrics-collector Pod that scrapes the spoke’s in-cluster Prometheus and remote_writes a curated subset to the hub’s Thanos Receive. No hub→spoke pull; the spoke initiates the push.

spoke Prometheus (OCP built-in)

observability-addon (metrics-collector)

spoke Prometheus (other clusters...)

Thanos receive (hub ingress)

object storage (MinIO / NooBaa)

Thanos store + query

hub Grafana + console

AlertManager (hub)

Reading the diagram: spoke Prometheus is the source of truth on each cluster; the addon scrapes and forwards (dashed green = spoke-initiated push) to the hub; Thanos writes blocks to object storage; query reads back through Store; Grafana and AlertManager are the read-side surfaces.

Storage and retention

MCO needs S3-compatible object storage for Thanos blocks. The lab uses MinIO (30.30.30.14:9000) for this; in cloud installs you’d point at S3 or a NooBaa OBC. Default retention is months, not days — that’s the whole point of separating ingest from storage.

Sizing rough numbers: at MCO’s default metric set (~150 series per Pod, ~2000 Pods per medium cluster, 30-second scrape) you’re looking at ~20 GiB per cluster per month of compressed block storage. A fleet of 50 clusters over a year is single-digit TB. Compactor downsampling buys back a chunk past the first 40 hours.

What kills you is not storage volume — it’s Thanos Receive memory. Each receiver replica holds an in-memory tail of ~2 hours of TSDB data per tenant. Scale replicas before scale of storage.

What gets shipped vs what stays local

By default MCO ships a curated allowlist — cluster, node, kube-state, ETCD, API server, ingress controller — not raw cAdvisor or per-container CPU. That curated set is what the shipped dashboards consume.

You can extend the list in two ways:

The MCO MultiClusterObservability CR’s allowlist ConfigMap — add specific metric names.
A MetricsCustomCollector for recording-rule output you want fanned in.

App workload metrics typically stay on the spoke unless you explicitly federate them. The “right” line: SLI metrics you need fleet views of go to the hub; spoke-local debugging metrics stay on the spoke. Federating everything is how teams end up with terabyte-class hub storage bills.

Alerts

Hub-level alerts evaluate against aggregated metrics in Thanos Query. Spoke-level alerts continue firing on the spoke’s local Prometheus and AlertManager. Both are useful; they answer different questions.

The pragmatic split:

Hub — fleet-shaped SLIs. “More than 2 clusters are reporting KubeAPIErrorBudgetBurn.” “ETCD leader changes exceed threshold across the fleet.”
Spoke — high-noise local alerts. “A specific Pod is CrashLooping.” “Disk pressure on a specific node.”

Putting noisy spoke-internal alerts on the hub turns the hub AlertManager into a fire hose. Putting fleet SLI alerts on a single spoke means you don’t see the fleet-wide pattern. The lab keeps Loki and Tempo local-per-spoke for the same reason — high-volume signals stay close to where they’re produced (/docs/openshift-platform/openshift-platform/platform-services/cluster-logging-and-loki/).

Cluster health metrics — fleet status not workload status

So far this module has been about workload observability: scrape Prometheus on each spoke, push curated metrics to the hub, draw dashboards. That’s the what’s running, how busy is it question. The other question — is the cluster itself still alive? — is observed with a different set of metrics, runs through a different alert path, and most of the time matters more.

The distinction

A pod-level CrashLoop is a workload event. A spoke whose kube-apiserver has stopped serving for 30 seconds is a cluster event: every workload on it is collateral damage, and you’d rather know about the apiserver outage than wade through the 200 pod alerts that follow. RHACM treats cluster health as a first-class signal, separate from the workload metrics MCO fans in.

Question	Where the signal comes from
Is the spoke’s kube-apiserver up?	Spoke’s in-cluster monitoring exports `up` for the apiserver target
Are all `ClusterOperator`s `Available=True`?	Spoke’s `cluster-version-operator` exports `cluster_operator_up` and `cluster_operator_conditions`
Is `ClusterVersion` healthy?	Spoke’s CVO; surfaced as `cluster_version_payload` and the `Available` condition
Is etcd quorate?	Spoke’s etcd-operator exports `etcd_server_has_leader`, member counts
Are certificates close to expiring?	Spoke’s `kube_apiserver_pki_*` series and `apiserver_client_certificate_expiration_seconds`
How many nodes are Ready?	Spoke’s `kube_node_status_condition` from kube-state-metrics
Is the spoke heartbeating to the hub?	Hub-side: `acm_managed_cluster_status_condition_available`

The first six come from each spoke’s own openshift-monitoring stack and are forwarded to the hub by the same observability-addon that ships workload metrics — they’re a small subset of the curated allowlist. The last one is hub-native, emitted by the registration controller on the hub and not present on any spoke.

Alerts the hub ships

RHACM installs a small set of cluster-health alert rules into the hub’s Thanos Ruler when observability is enabled. The names worth knowing:

ManagedClusterConditionUnknown — a spoke hasn’t heartbeated for ~5 minutes. The registration controller flipped its Available condition from True to Unknown.
ManagedClusterImportUnavailable — the import process for a new spoke stalled. Common during fleet expansion if klusterlet rollout is blocked.
MultiClusterObservabilityClusterDown — the metrics-collector on a spoke has stopped pushing. Either the addon is unhealthy or the spoke’s egress to the hub’s Thanos Receive is broken.
PolicyGovernanceInfo-shaped alerts — RHACM’s GRC operator emits a counter for each non-compliant policy; alerts evaluate that counter against thresholds.

These need an Alertmanager route to your on-call channel. The default route in the hub’s alertmanager-main Secret usually sends everything to a generic webhook; for real on-call coverage, write a route block that splits cluster-health alerts to the platform-team channel and policy alerts to the security-team channel.

The “ManagedCluster Unknown” trap

The single most important alert in this set is ManagedClusterConditionUnknown. It fires whenever a managed cluster’s registration agent stops heartbeating for the staleness window. The causes, in rough order of frequency:

Spoke kube-apiserver is down. The agent can’t authenticate to the hub if its own kube-apiserver isn’t serving. This is the case where the alert is doing its real job — telling you the cluster has a control-plane outage before any of the spoke’s own alerts can reach you (they can’t reach you; the spoke is down).
Network partition between spoke and hub. The agent is healthy on the spoke but can’t open TCP/6443 outbound to the hub. WAN flaps, firewall changes, an Argo sync that nuked the hub’s Route — all candidates.
Registration agent crashed. The Pod itself OOMed or hit a CrashLoop. Spoke is fine; the messenger is sick. Look at the klusterlet namespace on the spoke.
Spoke’s client cert rotation broke. The agent rotates its identity cert periodically; if rotation fails and the existing cert expires, the agent goes Unknown silently. Module 02 covers the bootstrap-secret recovery flow.

Crucially, when this alert fires, none of the spoke’s own alerts can reach the hub for the duration of the partition — the same path that ships metrics is the path that ships alerts. The hub-side ManagedClusterConditionUnknown is your only fleet-level indicator that something is wrong cluster-wide. Treat it as a paging alert, not an email-best-effort one.

The lab’s setup

The lab keeps the workload-vs-health split deliberate. Workload observability lives in two places:

Spoke-local Loki and Tempo, for logs and traces with cluster-local correlation. See /docs/openshift-platform/openshift-platform/platform-services/cluster-logging-and-loki/.
SigNoz on a separate VM, the long-term observability sink for out-of-cluster workloads. See /docs/openshift-platform/lab-infrastructure/observability-vms/signoz-overview.

Cluster-health observability is hub-side. When MCO is enabled, the curated allowlist includes the cluster-health series and the hub’s Thanos Ruler evaluates the RHACM-shipped alert rules. The on-call routing for cluster-health alerts is documented at /docs/openshift-platform/operations/incidents-and-runbooks/ — that’s the home for the runbooks that the alerts page into.

For the lab today, MCO is not yet enabled on hub-dc-v6 (the object-store wiring is parked behind the same NooBaa work that gates several other features). The ManagedClusterConditionUnknown alert still fires through the hub’s cluster-monitoring-operator because the registration controller exports its conditions to the in-cluster Prometheus regardless of whether MCO is on. That’s the minimum viable alerting posture; the full MCO-backed setup is a near-term upgrade.

Try this

Three exercises, each ~10 minutes:

1. List every ClusterOperator status across the fleet via ACM Search. From the ACM Search UI: kind:ClusterOperator. Add a column for the Available condition. Spot any operator showing Degraded=True across the fleet — these are the candidates for the next round of remediation policies.

2. Read the registration controller’s exported metrics. On the hub: oc -n open-cluster-management-hub port-forward svc/registration-controller 8443:443 and curl https://localhost:8443/metrics (you’ll need the in-cluster token). Find acm_managed_cluster_status_condition_available — the gauge that drives ManagedClusterConditionUnknown. Note the labels; they’re how you’d build a hub-Grafana panel of “current heartbeat state per cluster.”

3. Sketch the Alertmanager route. Write the YAML for an Alertmanager route block that:

Sends ManagedClusterConditionUnknown to the platform on-call channel with severity: critical.
Sends MultiClusterObservabilityClusterDown to the observability team with severity: warning.
Inhibits the second alert for a cluster when the first is already firing — no point waking observability when the cluster is down for unrelated reasons.

The exercise is the route design, not the apply.

References

Search

The second addon. Different job entirely: not metrics, but resource state. The Search addon runs a search-collector on each managed cluster that watches the etcd state via the kube-apiserver, then publishes resource lists to a hub PostgreSQL index. The ACM Search UI queries that index.

Use cases that pay for it on day one:

“What version of cert-manager is installed on every cluster?”
“Which namespaces have a Deployment named redis?”
“Show every Pod with image containing nginx across the fleet.”
“Which ManagedClusters have the compliance=pci-dss label?”

It’s oc get across the whole fleet without context-switching. The query language is small (key:value with simple boolean composition) but covers 80% of “where the hell is X” investigations.

Insights integration

ACM ingests Red Hat Insights findings per managed cluster — vulnerability scans, configuration recommendations, advisories. The Insights tab on each cluster surfaces these alongside the cluster’s health.

The signal-to-noise is mixed. Insights catches genuinely useful things — pending CVE upgrades, misconfigured CSRs, etcd recommendations — and also catches things that don’t apply to your environment (“upgrade to a SaaS feature you don’t use”). Treat it as a queue of suggestions to triage, not a list of must-fix items. The remediation links are the most useful part: most findings come with a one-click rollout to apply the fix as a Policy.

Performance traps

A few sharp edges that surface around fleet sizes of 25+ clusters:

Search index scales linearly with object count. A fleet of 50 clusters × 10k objects per cluster = 500k rows. The hub PostgreSQL underneath needs SSD storage and enough RAM that its working set fits in shared buffers. Spinning rust will time out queries.
search-collector OOMs. Big spokes with many Pods sometimes OOMKill the collector mid-sync. Bump its memory request before you debug a “stale index” mystery.
Hub Grafana is single-tenant. Multi-tenancy isolation happens at the Thanos data source level (cluster label filtering), not via Grafana user permissions. Don’t expect Grafana folder-level RBAC to gatekeep tenants.
Thanos Query timeouts. Long time ranges against many clusters hit the default 2-minute timeout. Either narrow the range, add recording rules to pre-aggregate, or bump the query-frontend timeout and accept the latency.

The lab’s deliberate hub/spoke split

The lab uses MCO for metrics fan-in. But it keeps logs and traces local per-spoke:

Loki runs on each spoke via the Cluster Logging Operator (/docs/openshift-platform/openshift-platform/platform-services/cluster-logging-and-loki/).
Tempo runs on each spoke under the Tempo Operator, backed by per-spoke object storage.
Cluster Observability Operator (COO) on the spoke runs Perses dashboards for the spoke-local view (/docs/openshift-platform/openshift-platform/platform-services/perses-dashboards/).
SigNoz on a separate VM is the long-term observability sink for some out-of-cluster workloads (/docs/openshift-platform/lab-infrastructure/observability-vms/signoz-overview/).

Why split it? Metrics are aggregable and small per series; the fan-in pays off. Logs and traces are bulky and best correlated with cluster-local context (Pods, namespaces, request IDs). Federating them to the hub means moving terabytes a day and losing the per-cluster correlation you actually need when debugging.

The console hides the split well — Perses tabs on the spoke, Grafana on the hub — but the data paths are different by design.

Try this

Open the ACM Search UI on the hub. Run a query for kind:Deployment image:nginx — every Deployment in the fleet whose image references nginx. Add a cluster: filter to narrow to one cluster. The UI prints a count so you can spot-check vs. oc get.
Read the MultiClusterObservability CR (if deployed) — oc get mco observability -o yaml. Identify the allowlist ConfigMap reference. Add a single metric name to the allowlist (a CR you know your apps emit) and watch it appear in Thanos Query within a few minutes.
In the hub Grafana, write a query: count by (cluster) (kube_pod_info). You should get one row per managed cluster with the Pod count on that cluster. Try the same for node instead of cluster — that’s your per-node pod-density chart for free.

Common failure modes

observability-addon stuck Available=False on a spoke. The collector cannot reach the hub’s Thanos Receive route. Check DNS resolution for the receive route from inside the spoke, then network egress. Receive routes use OpenShift Route with re-encrypt TLS; cert trust on the spoke is the second-most common cause.
Search index goes stale. The search-collector Pod on one or more spokes got OOMKilled. Check oc logs -n open-cluster-management-agent-addon deployment/search-collector --previous. Bump memory request, re-roll.
Grafana dashboards empty. Thanos Query is timing out, usually because Thanos Store can’t reach object storage. From inside the store Pod, wget the bucket endpoint. The classic cause is an OBC where the AWS_KEY keys never got bridged to the operand’s lowercase variants — see the lab’s NooBaa OBC→operand bridge pattern.
Hub AlertManager fire hose. Spoke-internal alerts were configured to forward to the hub. Move them back to the spoke’s local AlertManager and only forward the SLI-shaped fleet alerts.
Insights tab empty for one cluster. The cluster lost its cloud.openshift.com egress, or the Telemetry pull secret is missing on the spoke. Insights piggybacks on the same support pull-secret flow as Telemetry.

References

ACM Observability docs — https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/
Thanos design — https://thanos.io/tip/thanos/design.md/
OCM Search proposal — https://open-cluster-management.io/
OpenShift Cluster Monitoring — https://docs.openshift.com/container-platform/latest/monitoring/monitoring-overview.html
Red Hat Insights — https://docs.redhat.com/en/documentation/red_hat_insights/
Lab — /docs/openshift-platform/openshift-platform/platform-services/cluster-logging-and-loki/
Lab — /docs/openshift-platform/openshift-platform/platform-services/perses-dashboards/
Lab — /docs/openshift-platform/lab-infrastructure/observability-vms/signoz-overview/

Next: Module 08 — Hosted Control Planes and Cluster Pools covers two ways to provision more clusters quickly — HyperShift’s hosted control planes and ACM’s warm-pool ClusterPool.