ADR 0012 — Monitoring observability learning VM

A component-built observability stack on one Ubuntu VM — Grafana, Prometheus, Alertmanager, Loki, Tempo, Pyroscope, Grafana Alloy, Blackbox Exporter — so the lab can learn the primitives, not just the product bundle.

Date: 2026-05-08 Status: Accepted; base VM and M1–M7 stack deployed 2026-05-08.

Context

The operator requested a comprehensive independent monitoring VM for learning and for monitoring lab applications, VMs, and future OpenShift-facing services. The VM runs on Ubuntu cloud-init, follows the default lab zahid credential convention, and remains separate from the core disconnected OpenShift rebuild.

This is broader than the SigNoz VM tracker (ADR 0010). SigNoz remains a standalone self-hosted observability product track for comparison or later collapse. This ADR accepts an explicit component-based learning stack so the lab can learn the primitives directly: metrics, alerts, logs, traces, profiles, exporters, synthetic checks, dashboards, SLOs, and collector topologies. Building it one piece at a time forces the operator to understand each component, which a single all-in-one product hides.

Decision

Deploy monitoring-0 as one standalone Ubuntu 24.04 cloud-init VM on br30.

PropertyValue
VM namemonitoring-0
Private DNSmonitoring-0.sub.comptech-lab.com, monitoring.sub.comptech-lab.com
Public edge DNSmonitoring.apps.sub.comptech-lab.com (optionally grafana.apps.sub.comptech-lab.com as UI alias)
Public edge TLSexisting Let’s Encrypt wildcard for *.apps.sub.comptech-lab.com
Default adminzahid
Credential custodysecrets/monitoring-vm/
vCPU / RAM8 / 32768 MiB
OS disk100 GiB
Data disk1 TiB at /var/lib/monitoring (or component-specific bind mounts beneath it)

Component stack

Install as native Linux/systemd services (unless a later prerequisite phase records a better-supported packaging choice):

ComponentRole
Grafanadashboards, exploration
Prometheusmetrics scrape, rules, alert evaluation
Alertmanageralert grouping, routing, silences, notifications
Lokilogs
Tempotraces
Pyroscopecontinuous profiling experiments
Grafana Alloypreferred collector / agent runtime
Blackbox ExporterHTTP / TCP / DNS / TLS probing
Node ExporterLinux host metrics (on monitored VMs)
Service exporters / native metricsVault, Kafka, Redis, Jenkins, HAProxy, PowerDNS, Nexus, MinIO, Trivy, SigNoz, WSO2, future OpenShift endpoints

The first deployment (2026-05-08) installed Grafana, Prometheus, Alertmanager, Loki, Tempo, Pyroscope, Grafana Alloy, Blackbox Exporter, and Node Exporter as native systemd services. Fuller service-specific exporters and dashboards remain in M8; operational hardening remains in M9.

OpenTelemetry topology

Use a two-layer OpenTelemetry model after the base metrics/logging stack is working:

  1. Local agent layer on app/VM nodes.
    • Alloy or OTel Collector receives app OTLP on localhost.
    • Tails local logs, adds consistent resource labels, batches, retries, and forwards to the monitoring VM.
  2. Central gateway layer on monitoring-0.
    • Receives private OTLP gRPC/HTTP on 4317 / 4318.
    • Applies shared filtering, redaction, sampling, routing, and future auth.
    • Sends traces to Tempo, logs to Loki, and selected OTel metrics to the accepted metrics path.

Kiali

Kiali is explicitly excluded from this VM. If OpenShift Service Mesh / Istio is deployed later, Kiali should be installed inside the relevant OpenShift service-mesh environment, not on this independent VM.

Alternatives considered

Use SigNoz alone. Already on the menu (ADR 0010). Rejected because the operator wants to also learn the primitives, and a product bundle hides them. Both VMs exist intentionally.

Use the OpenShift Cluster Monitoring stack + Grafana federation. Rejected for the same reasons as in ADR 0010: OCP user-workload monitoring covers cluster observability but not the broader app/VM observability the lab needs.

Install all components inside Docker Compose. Possible. Rejected for the learning angle: native systemd services force the operator to write systemd units, define ExecStart lines, and understand each component’s binary, configuration, and journal output without Docker abstracting them away.

Put each component in its own VM. Cleanest separation. Rejected because the lab is a single-operator environment and the resource cost (8+ VMs for one stack) doesn’t pay back. The 8 vCPU / 32 GiB / 1 TiB single VM holds the stack comfortably.

Phase gates

  1. Scope, architecture, IP plan. Accept this ADR. Create the GitHub milestone (#19) and phase issues. Record proposed VM allocation and relationship to the SigNoz track.
  2. Prerequisite validation. Confirm br30, Ubuntu base image, SSH keys, DNS resolver, available IP/MAC, PowerDNS, HAProxy, outbound package access, storage capacity. Confirm no live VM already uses monitoring-0 or the proposed MAC.
  3. Cloud-init plan and credential custody. Prepare VM provisioning inputs, disk layout, firewall contract, local-only zahid admin custody.
  4. VM provisioning. Create the Ubuntu VM. Validate cloud-init completion, SSH access, resolver config, data-disk mount, baseline firewalling.
  5. Base metrics and alerting stack. Install Grafana, Prometheus, Alertmanager, Node Exporter, Blackbox Exporter; baseline dashboards, scrape configs, first alert rules.
  6. Logging stack. Install Loki and Alloy log collection. Define journald / app-log labels, redaction guardrails, retention.
  7. Tracing and OTel gateway. Install Tempo. Configure the central private OTLP gateway. Define local agent templates for app/VM nodes.
  8. Profiling and advanced instrumentation. Install Pyroscope. Document optional Beyla / eBPF instrumentation as a learning path.
  9. Service integrations and dashboards. Add exporters, probes, dashboards, alert rules for Vault, Kafka, Redis, Jenkins, HAProxy, PowerDNS, Nexus, MinIO, Trivy, SigNoz, WSO2, future OpenShift endpoints.
  10. SLOs, hardening, backup, handoff. Define SLOs, alert routing, retention, backups, restore drills, restart validation, credential hardening, final handoff evidence.

Guardrails

  • Do not make live VM, PowerDNS, HAProxy, OpenShift, or GitOps changes before prerequisite validation passes.
  • Keep this track separate from OpenShift rebuild, Kafka, Vault, Redis, WSO2, Jenkins, SigNoz, and Trivy milestones.
  • Do not duplicate metrics unintentionally. For each signal, define one accepted path before onboarding it.
  • Keep HAProxy exposure UI-focused. OTLP, Prometheus, Loki, Tempo, Pyroscope, and exporter ports should remain private unless an explicit phase approves a TLS/auth/source-restricted exposure path.
  • Do not store Grafana admin passwords, datasource credentials, Alertmanager notifier secrets, OTLP tokens, API keys, pull secrets, kubeconfigs, or scan target credentials in Git or chat.
  • Store the default zahid admin password and generated tokens only under ignored local custody such as secrets/monitoring-vm/. Migrate durable secrets into Vault once the Vault consumer contract is ready.

Consequences

  • The lab gets an independent learning platform for observability primitives rather than relying only on a product bundle or in-cluster OpenShift stack.
  • Future OpenShift clusters can use this VM as an external learning and bootstrap observability target. Production-grade OpenShift monitoring is still a separate cluster-design decision.
  • The first deployment is lab/bootstrap-oriented. Production readiness still requires retention sizing, backup/restore validation, Vault-backed credentials, TLS/auth/source controls for telemetry, alert routing, and upgrade rehearsal.
  • Signal-source choice matters. Because the same metric can land via Prometheus scrape OR via OTel-Collector OR via SigNoz OTLP, the operator must record which path is the accepted one for each service, then enforce that in the dashboards and alert rules.

References

Last reviewed: 2026-05-11