ADR 0012 — Monitoring observability learning VM
A component-built observability stack on one Ubuntu VM — Grafana, Prometheus, Alertmanager, Loki, Tempo, Pyroscope, Grafana Alloy, Blackbox Exporter — so the lab can learn the primitives, not just the product bundle.
Date: 2026-05-08 Status: Accepted; base VM and M1–M7 stack deployed 2026-05-08.
Context
The operator requested a comprehensive independent monitoring VM for learning and for monitoring lab applications, VMs, and future OpenShift-facing services. The VM runs on Ubuntu cloud-init, follows the default lab zahid credential convention, and remains separate from the core disconnected OpenShift rebuild.
This is broader than the SigNoz VM tracker (ADR 0010). SigNoz remains a standalone self-hosted observability product track for comparison or later collapse. This ADR accepts an explicit component-based learning stack so the lab can learn the primitives directly: metrics, alerts, logs, traces, profiles, exporters, synthetic checks, dashboards, SLOs, and collector topologies. Building it one piece at a time forces the operator to understand each component, which a single all-in-one product hides.
Decision
Deploy monitoring-0 as one standalone Ubuntu 24.04 cloud-init VM on br30.
| Property | Value |
|---|---|
| VM name | monitoring-0 |
| Private DNS | monitoring-0.sub.comptech-lab.com, monitoring.sub.comptech-lab.com |
| Public edge DNS | monitoring.apps.sub.comptech-lab.com (optionally grafana.apps.sub.comptech-lab.com as UI alias) |
| Public edge TLS | existing Let’s Encrypt wildcard for *.apps.sub.comptech-lab.com |
| Default admin | zahid |
| Credential custody | secrets/monitoring-vm/ |
| vCPU / RAM | 8 / 32768 MiB |
| OS disk | 100 GiB |
| Data disk | 1 TiB at /var/lib/monitoring (or component-specific bind mounts beneath it) |
Component stack
Install as native Linux/systemd services (unless a later prerequisite phase records a better-supported packaging choice):
| Component | Role |
|---|---|
| Grafana | dashboards, exploration |
| Prometheus | metrics scrape, rules, alert evaluation |
| Alertmanager | alert grouping, routing, silences, notifications |
| Loki | logs |
| Tempo | traces |
| Pyroscope | continuous profiling experiments |
| Grafana Alloy | preferred collector / agent runtime |
| Blackbox Exporter | HTTP / TCP / DNS / TLS probing |
| Node Exporter | Linux host metrics (on monitored VMs) |
| Service exporters / native metrics | Vault, Kafka, Redis, Jenkins, HAProxy, PowerDNS, Nexus, MinIO, Trivy, SigNoz, WSO2, future OpenShift endpoints |
The first deployment (2026-05-08) installed Grafana, Prometheus, Alertmanager, Loki, Tempo, Pyroscope, Grafana Alloy, Blackbox Exporter, and Node Exporter as native systemd services. Fuller service-specific exporters and dashboards remain in M8; operational hardening remains in M9.
OpenTelemetry topology
Use a two-layer OpenTelemetry model after the base metrics/logging stack is working:
- Local agent layer on app/VM nodes.
- Alloy or OTel Collector receives app OTLP on
localhost. - Tails local logs, adds consistent resource labels, batches, retries, and forwards to the monitoring VM.
- Alloy or OTel Collector receives app OTLP on
- Central gateway layer on
monitoring-0.- Receives private OTLP gRPC/HTTP on
4317/4318. - Applies shared filtering, redaction, sampling, routing, and future auth.
- Sends traces to Tempo, logs to Loki, and selected OTel metrics to the accepted metrics path.
- Receives private OTLP gRPC/HTTP on
Kiali
Kiali is explicitly excluded from this VM. If OpenShift Service Mesh / Istio is deployed later, Kiali should be installed inside the relevant OpenShift service-mesh environment, not on this independent VM.
Alternatives considered
Use SigNoz alone. Already on the menu (ADR 0010). Rejected because the operator wants to also learn the primitives, and a product bundle hides them. Both VMs exist intentionally.
Use the OpenShift Cluster Monitoring stack + Grafana federation. Rejected for the same reasons as in ADR 0010: OCP user-workload monitoring covers cluster observability but not the broader app/VM observability the lab needs.
Install all components inside Docker Compose. Possible. Rejected for the learning angle: native systemd services force the operator to write systemd units, define ExecStart lines, and understand each component’s binary, configuration, and journal output without Docker abstracting them away.
Put each component in its own VM. Cleanest separation. Rejected because the lab is a single-operator environment and the resource cost (8+ VMs for one stack) doesn’t pay back. The 8 vCPU / 32 GiB / 1 TiB single VM holds the stack comfortably.
Phase gates
- Scope, architecture, IP plan. Accept this ADR. Create the GitHub milestone (#19) and phase issues. Record proposed VM allocation and relationship to the SigNoz track.
- Prerequisite validation. Confirm
br30, Ubuntu base image, SSH keys, DNS resolver, available IP/MAC, PowerDNS, HAProxy, outbound package access, storage capacity. Confirm no live VM already usesmonitoring-0or the proposed MAC. - Cloud-init plan and credential custody. Prepare VM provisioning inputs, disk layout, firewall contract, local-only
zahidadmin custody. - VM provisioning. Create the Ubuntu VM. Validate cloud-init completion, SSH access, resolver config, data-disk mount, baseline firewalling.
- Base metrics and alerting stack. Install Grafana, Prometheus, Alertmanager, Node Exporter, Blackbox Exporter; baseline dashboards, scrape configs, first alert rules.
- Logging stack. Install Loki and Alloy log collection. Define journald / app-log labels, redaction guardrails, retention.
- Tracing and OTel gateway. Install Tempo. Configure the central private OTLP gateway. Define local agent templates for app/VM nodes.
- Profiling and advanced instrumentation. Install Pyroscope. Document optional Beyla / eBPF instrumentation as a learning path.
- Service integrations and dashboards. Add exporters, probes, dashboards, alert rules for Vault, Kafka, Redis, Jenkins, HAProxy, PowerDNS, Nexus, MinIO, Trivy, SigNoz, WSO2, future OpenShift endpoints.
- SLOs, hardening, backup, handoff. Define SLOs, alert routing, retention, backups, restore drills, restart validation, credential hardening, final handoff evidence.
Guardrails
- Do not make live VM, PowerDNS, HAProxy, OpenShift, or GitOps changes before prerequisite validation passes.
- Keep this track separate from OpenShift rebuild, Kafka, Vault, Redis, WSO2, Jenkins, SigNoz, and Trivy milestones.
- Do not duplicate metrics unintentionally. For each signal, define one accepted path before onboarding it.
- Keep HAProxy exposure UI-focused. OTLP, Prometheus, Loki, Tempo, Pyroscope, and exporter ports should remain private unless an explicit phase approves a TLS/auth/source-restricted exposure path.
- Do not store Grafana admin passwords, datasource credentials, Alertmanager notifier secrets, OTLP tokens, API keys, pull secrets, kubeconfigs, or scan target credentials in Git or chat.
- Store the default
zahidadmin password and generated tokens only under ignored local custody such assecrets/monitoring-vm/. Migrate durable secrets into Vault once the Vault consumer contract is ready.
Consequences
- The lab gets an independent learning platform for observability primitives rather than relying only on a product bundle or in-cluster OpenShift stack.
- Future OpenShift clusters can use this VM as an external learning and bootstrap observability target. Production-grade OpenShift monitoring is still a separate cluster-design decision.
- The first deployment is lab/bootstrap-oriented. Production readiness still requires retention sizing, backup/restore validation, Vault-backed credentials, TLS/auth/source controls for telemetry, alert routing, and upgrade rehearsal.
- Signal-source choice matters. Because the same metric can land via Prometheus scrape OR via OTel-Collector OR via SigNoz OTLP, the operator must record which path is the accepted one for each service, then enforce that in the dashboards and alert rules.
References
- Source:
opp-full-plat/adr/0012-monitoring-observability-learning-vm.md - Sister VM: ADR 0010 — SigNoz standalone VM
- Edge wiring rules: ADR 0005
- GitHub milestone:
zeshaq/opp-full-plat#19