ADR 0012 — Monitoring observability learning VM

A component-built observability stack on one Ubuntu VM — Grafana, Prometheus, Alertmanager, Loki, Tempo, Pyroscope, Grafana Alloy, Blackbox Exporter — so the lab can learn the primitives, not just the product bundle.

Date: 2026-05-08 Status: Accepted; base VM and M1–M7 stack deployed 2026-05-08.

Context

The operator requested a comprehensive independent monitoring VM for learning and for monitoring lab applications, VMs, and future OpenShift-facing services. The VM runs on Ubuntu cloud-init, follows the default lab zahid credential convention, and remains separate from the core disconnected OpenShift rebuild.

This is broader than the SigNoz VM tracker (ADR 0010). SigNoz remains a standalone self-hosted observability product track for comparison or later collapse. This ADR accepts an explicit component-based learning stack so the lab can learn the primitives directly: metrics, alerts, logs, traces, profiles, exporters, synthetic checks, dashboards, SLOs, and collector topologies. Building it one piece at a time forces the operator to understand each component, which a single all-in-one product hides.

Decision

Deploy monitoring-0 as one standalone Ubuntu 24.04 cloud-init VM on br30.

Property	Value
VM name	`monitoring-0`
Private DNS	`monitoring-0.sub.comptech-lab.com`, `monitoring.sub.comptech-lab.com`
Public edge DNS	`monitoring.apps.sub.comptech-lab.com` (optionally `grafana.apps.sub.comptech-lab.com` as UI alias)
Public edge TLS	existing Let’s Encrypt wildcard for `*.apps.sub.comptech-lab.com`
Default admin	`zahid`
Credential custody	`secrets/monitoring-vm/`
vCPU / RAM	8 / 32768 MiB
OS disk	100 GiB
Data disk	1 TiB at `/var/lib/monitoring` (or component-specific bind mounts beneath it)

Component stack

Install as native Linux/systemd services (unless a later prerequisite phase records a better-supported packaging choice):

Component	Role
Grafana	dashboards, exploration
Prometheus	metrics scrape, rules, alert evaluation
Alertmanager	alert grouping, routing, silences, notifications
Loki	logs
Tempo	traces
Pyroscope	continuous profiling experiments
Grafana Alloy	preferred collector / agent runtime
Blackbox Exporter	HTTP / TCP / DNS / TLS probing
Node Exporter	Linux host metrics (on monitored VMs)
Service exporters / native metrics	Vault, Kafka, Redis, Jenkins, HAProxy, PowerDNS, Nexus, MinIO, Trivy, SigNoz, WSO2, future OpenShift endpoints

The first deployment (2026-05-08) installed Grafana, Prometheus, Alertmanager, Loki, Tempo, Pyroscope, Grafana Alloy, Blackbox Exporter, and Node Exporter as native systemd services. Fuller service-specific exporters and dashboards remain in M8; operational hardening remains in M9.

OpenTelemetry topology

Use a two-layer OpenTelemetry model after the base metrics/logging stack is working:

Local agent layer on app/VM nodes.
- Alloy or OTel Collector receives app OTLP on localhost.
- Tails local logs, adds consistent resource labels, batches, retries, and forwards to the monitoring VM.
Central gateway layer on monitoring-0.
- Receives private OTLP gRPC/HTTP on 4317 / 4318.
- Applies shared filtering, redaction, sampling, routing, and future auth.
- Sends traces to Tempo, logs to Loki, and selected OTel metrics to the accepted metrics path.

Kiali

Kiali is explicitly excluded from this VM. If OpenShift Service Mesh / Istio is deployed later, Kiali should be installed inside the relevant OpenShift service-mesh environment, not on this independent VM.

Alternatives considered

Use SigNoz alone. Already on the menu (ADR 0010). Rejected because the operator wants to also learn the primitives, and a product bundle hides them. Both VMs exist intentionally.

Use the OpenShift Cluster Monitoring stack + Grafana federation. Rejected for the same reasons as in ADR 0010: OCP user-workload monitoring covers cluster observability but not the broader app/VM observability the lab needs.

Install all components inside Docker Compose. Possible. Rejected for the learning angle: native systemd services force the operator to write systemd units, define ExecStart lines, and understand each component’s binary, configuration, and journal output without Docker abstracting them away.

Put each component in its own VM. Cleanest separation. Rejected because the lab is a single-operator environment and the resource cost (8+ VMs for one stack) doesn’t pay back. The 8 vCPU / 32 GiB / 1 TiB single VM holds the stack comfortably.

Phase gates

Scope, architecture, IP plan. Accept this ADR. Create the GitHub milestone (#19) and phase issues. Record proposed VM allocation and relationship to the SigNoz track.
Prerequisite validation. Confirm br30, Ubuntu base image, SSH keys, DNS resolver, available IP/MAC, PowerDNS, HAProxy, outbound package access, storage capacity. Confirm no live VM already uses monitoring-0 or the proposed MAC.
Cloud-init plan and credential custody. Prepare VM provisioning inputs, disk layout, firewall contract, local-only zahid admin custody.
VM provisioning. Create the Ubuntu VM. Validate cloud-init completion, SSH access, resolver config, data-disk mount, baseline firewalling.
Base metrics and alerting stack. Install Grafana, Prometheus, Alertmanager, Node Exporter, Blackbox Exporter; baseline dashboards, scrape configs, first alert rules.
Logging stack. Install Loki and Alloy log collection. Define journald / app-log labels, redaction guardrails, retention.
Tracing and OTel gateway. Install Tempo. Configure the central private OTLP gateway. Define local agent templates for app/VM nodes.
Profiling and advanced instrumentation. Install Pyroscope. Document optional Beyla / eBPF instrumentation as a learning path.
Service integrations and dashboards. Add exporters, probes, dashboards, alert rules for Vault, Kafka, Redis, Jenkins, HAProxy, PowerDNS, Nexus, MinIO, Trivy, SigNoz, WSO2, future OpenShift endpoints.
SLOs, hardening, backup, handoff. Define SLOs, alert routing, retention, backups, restore drills, restart validation, credential hardening, final handoff evidence.

Guardrails

Do not make live VM, PowerDNS, HAProxy, OpenShift, or GitOps changes before prerequisite validation passes.
Keep this track separate from OpenShift rebuild, Kafka, Vault, Redis, WSO2, Jenkins, SigNoz, and Trivy milestones.
Do not duplicate metrics unintentionally. For each signal, define one accepted path before onboarding it.
Keep HAProxy exposure UI-focused. OTLP, Prometheus, Loki, Tempo, Pyroscope, and exporter ports should remain private unless an explicit phase approves a TLS/auth/source-restricted exposure path.
Do not store Grafana admin passwords, datasource credentials, Alertmanager notifier secrets, OTLP tokens, API keys, pull secrets, kubeconfigs, or scan target credentials in Git or chat.
Store the default zahid admin password and generated tokens only under ignored local custody such as secrets/monitoring-vm/. Migrate durable secrets into Vault once the Vault consumer contract is ready.

Consequences

The lab gets an independent learning platform for observability primitives rather than relying only on a product bundle or in-cluster OpenShift stack.
Future OpenShift clusters can use this VM as an external learning and bootstrap observability target. Production-grade OpenShift monitoring is still a separate cluster-design decision.
The first deployment is lab/bootstrap-oriented. Production readiness still requires retention sizing, backup/restore validation, Vault-backed credentials, TLS/auth/source controls for telemetry, alert routing, and upgrade rehearsal.
Signal-source choice matters. Because the same metric can land via Prometheus scrape OR via OTel-Collector OR via SigNoz OTLP, the operator must record which path is the accepted one for each service, then enforce that in the dashboards and alert rules.

References

Source: opp-full-plat/adr/0012-monitoring-observability-learning-vm.md
Sister VM: ADR 0010 — SigNoz standalone VM
Edge wiring rules: ADR 0005
GitHub milestone: zeshaq/opp-full-plat #19