Monitoring VM (LGTM Sandbox)
The monitoring-0 VM — native systemd LGTM stack (Grafana, Prometheus, Alertmanager, Loki, Tempo, Pyroscope, Alloy) deployed as a learning sandbox per ADR 0012, distinct from production SigNoz.
monitoring-0 is the lab’s observability learning sandbox: a single VM running the Grafana LGTM-plus stack (Loki, Grafana, Tempo, Mimir/Prometheus, with Alloy as the collector and Pyroscope for profiling) as native systemd services. Per ADR 0012 and the lab infrastructure memory, this VM exists to learn observability primitives directly — components, exporters, dashboards, alert rules, collector topologies — not to host production telemetry. SigNoz holds the production observability role.
This page covers monitoring-0, why it exists alongside SigNoz, and the *.mon.sub.comptech-lab.com exposure pattern unique to this VM.
What it is
| Property | Value |
|---|---|
| VM | monitoring-0 |
| Private FQDN | monitoring-0.sub.comptech-lab.com |
| Private alias | monitoring.sub.comptech-lab.com |
| Public hostnames | *.mon.sub.comptech-lab.com (Grafana etc.) |
| Public edge alias | monitoring.apps.sub.comptech-lab.com |
| OS | Ubuntu 24.04 LTS cloud-init |
| Packaging | Native systemd services (one unit per component) |
| Sizing | 8 vCPU, 32 GiB RAM, 100 GB OS disk, 1 TB data disk under /var/lib/monitoring |
| Default admin user | zahid (lab convention) |
| Admin password custody | secrets/monitoring-vm/admin.env |
The native-systemd choice is intentional. Rather than the Grafana LGTM Docker bundle, each component runs as a discrete unit with its own config under /etc/<component>/. The point is learning, and packaging components in containers hides too much.
Component inventory
| Component | Role | Port (private) |
|---|---|---|
| Grafana | Dashboards & exploration UI | 3000 |
| Prometheus | Metrics scrape, rules, alert evaluation | 9090 |
| Alertmanager | Alert grouping, routing, silences, notifications | 9093 |
| Loki | Log store + query | 3100 |
| Tempo | Trace store + query | 3200 |
| Pyroscope | Continuous profiling experiments | 4040 |
| Grafana Alloy | Collector (OTLP receiver, log tail, scrape fan-out) | OTLP 4317/4318, multiple internal |
| Blackbox Exporter | HTTP/TCP/DNS/TLS probing | 9115 |
| Node Exporter | Linux host metrics on monitored VMs | 9100 |
Service exporters or native metrics for Vault, Kafka, Redis, Jenkins, HAProxy, PowerDNS, Nexus, MinIO, Trivy, SigNoz, WSO2, and future OpenShift endpoints are progressively added under the M8 work track (per ADR 0012).
Why a separate VM from SigNoz
The two VMs serve different intents and the lab keeps them strictly separate:
| Concern | signoz | monitoring-0 |
|---|---|---|
| Intent | Production telemetry destination | Learning sandbox |
| Status | Stable, single-product | Evolving, multi-component |
| Backup posture | Open (roadmap) | Operator-managed; loss is acceptable |
| OTLP receiver | SigNoz Receiver directly | Alloy fans out to Tempo / Loki / Prometheus |
| Default audience | Application owners | Platform team learning |
The user directive (reference_lab_infrastructure.md): “Don’t propose production-grade work against monitoring-0; treat as a labbench.”
OTLP on monitoring-0
OTLP ingestion on :4317 (gRPC) and :4318 (HTTP) is owned by Grafana Alloy (v1.16+), not by Tempo directly. Alloy fans out:
- Traces → Tempo
- Logs → Loki
- Metrics → Prometheus
remote_write - Plus tails systemd journal to Loki
Alloy’s config is a single-file at /etc/alloy/config.alloy. The configuration uses the new HCL-ish “Alloy syntax” rather than the older static-config approach.
Apps shipping OTLP to monitoring-0 use the same ports as SigNoz (:4317, :4318). Apps must therefore choose deliberately which target they’re shipping to. The two VMs are not interchangeable destinations — they accept the same protocol but store and present the data very differently.
*.mon.sub.comptech-lab.com exposure pattern
Unique to monitoring-0: a separate DNS sub-zone and a separate LE wildcard, so that monitoring-0 services that need a public hostname don’t have to share the *.apps.* namespace with cluster apps.
The pattern (added 2026-05-09 per memory):
- PowerDNS:
*.mon.sub.comptech-lab.comrecords point at the lab public ingress. - Wildcard cert:
/etc/haproxy/certs/wildcard-mon.pemissued viaacme.sh --dns dns_pdnsfrom the pdns VM. The acme.sh state lives under/root/.acme.sh/*.mon.sub.comptech-lab.com_ecc/. - HAProxy bind:
127.0.0.1:8443carries bothwildcard-apps.pemandwildcard-mon.pemfor SNI-based cert selection. - Per-service route: each new
<svc>.mon.*host requires 5 HAProxy spots — an SNI rule, a vm-tls host-header → backend mapping, a public-apps-http redirect ACL, a vm-tls deny-unless ACL, plus the new backend if needed.
Concrete example: grafana.mon.sub.comptech-lab.com routes through HAProxy and lands on the monitoring-0 VM’s Grafana port 3000. Adding prometheus.mon.sub.comptech-lab.com would follow the same five-step HAProxy edit pattern.
Why a separate sub-zone: the *.mon.* plane keeps experimental observability dashboards distinct from the *.apps.* plane, which is the application-routable plane. If a mon cert renewal fails or a mon backend goes hot, it can’t take down apps services.
Alloy as the central collector
The collector layer matters more than any individual component in this architecture. Alloy is the seam between emitters and stores:
App pod / app process
↓ OTLP HTTP :4318 or gRPC :4317
Alloy (on monitoring-0 or co-located with app)
↓ traces → Tempo (3200)
↓ logs → Loki (3100)
↓ metrics → Prometheus remote_write
↓ journal tail → Loki (3100)
↓ scrape configs → Prometheus
Two-layer model per ADR 0012:
- Local agent layer on app/VM nodes — Alloy or OTel Collector receives app OTLP on
localhost, tails local logs, adds consistent resource labels, batches, retries, and forwards to monitoring-0. - Central gateway layer on monitoring-0 — receives OTLP gRPC/HTTP on
:4317/:4318, applies shared filtering, redaction, sampling, routing, and (future) auth, then sends each signal to its store.
In the current state, most lab VMs do not run a local Alloy agent and send OTLP directly to monitoring-0. The two-layer model is the target; the single-layer model is the starting point.
What’s explicitly excluded
- Kiali. ADR 0012 explicitly excludes Kiali from this VM. If OpenShift Service Mesh/Istio is deployed later, Kiali should be installed inside the relevant OpenShift service-mesh environment, not on this independent VM.
- Production app telemetry destination. Send production traces/logs to SigNoz, not here.
- OpenShift cluster-internal monitoring. OpenShift has its own cluster monitoring stack; monitoring-0 doesn’t replace it.
Operational guidance
- One signal, one accepted path. For each metric/log/trace pipe, define one accepted destination before onboarding it. Don’t duplicate Prometheus scrape and Alloy OTLP for the same metric.
- Keep HAProxy exposure UI-focused. OTLP, Prometheus, Loki, Tempo, Pyroscope, and exporter ports stay private unless an explicit phase approves TLS/auth/source-restricted exposure.
- Custody on the VM. Grafana admin, datasource credentials, Alertmanager notifier secrets stay under
secrets/monitoring-vm/. Migrate durable secrets into Vault once the Vault consumer contract is ready. - Don’t promote
monitoring-0to production. The natural temptation when something works in a sandbox is to call it production. Resist. SigNoz holds the production role; promotion of components from monitoring-0 to production observability is a separate decision that goes through ADR.
Failure modes
Symptom: traces sent to monitoring-0 don’t show up in Grafana
Root cause. Alloy didn’t route them to Tempo, or the Tempo data source in Grafana isn’t configured correctly.
Fix. Check Alloy’s config (/etc/alloy/config.alloy) and journal. Check Tempo’s API directly (curl monitoring-0:3200/api/echo). Verify Grafana data source.
Prevention. Test each signal type end-to-end after Alloy config changes.
Symptom: Grafana login fails
Root cause. Admin credential changed without updating local custody, or the systemd unit is restarting.
Fix. Check /etc/grafana/grafana.ini for current admin user. Reset password via Grafana CLI if needed. Update custody.
Prevention. Track credential rotation under the secret custody runbook.
Symptom: dashboards / alert rules disappear after VM reboot
Root cause. Grafana provisioned dashboards live in /etc/grafana/provisioning/dashboards/*.yaml, not in the database. Manually-created UI dashboards live in /var/lib/grafana/grafana.db (SQLite). A reboot doesn’t delete either, but a rm -rf /var/lib/grafana (or volume-mount mishap) does.
Fix. Restore from snapshot, or recreate manually from documented intent.
Prevention. Snapshot /var/lib/grafana/grafana.db periodically. Provision important dashboards via config files in source control.
Symptom: Prometheus is OOM-killed
Root cause. Series cardinality exceeded what the VM can hold in memory.
Fix. Identify the high-cardinality metric (often a misconfigured label like instance=<random-pod-id>). Drop the metric or relabel-drop the offending label.
Prevention. Monitor Prometheus’s own prometheus_tsdb_head_series metric; alert on growth.
Symptom: grafana.mon.sub.comptech-lab.com returns TLS error
Root cause. Wildcard mon cert expired or HAProxy isn’t loading both certs. ACME automation lives on the pdns VM; renewal might have failed.
Fix. Verify cert expiry. Force acme.sh renewal on the pdns VM, copy renewed cert to HAProxy, reload HAProxy.
Prevention. Monitor *.mon.* cert expiry; alert at 14 days before expiry.
Roadmap
Per ADR 0012, the M1–M7 base stack is deployed. M8 (service integrations and dashboards) and M9 (operational hardening) remain open:
- M8: exporters, probes, dashboards, alert rules for Vault, Kafka, Redis, Jenkins, HAProxy, PowerDNS, Nexus, MinIO, Trivy, SigNoz, WSO2.
- M9: SLOs, alert routing, retention sizing, backups, restore drills, restart validation, Vault-backed credentials.
These are explicitly learning-track work; the goal is to develop intuition for production observability, not to declare production readiness on this VM.
References
opp-full-plat/adr/0012-monitoring-observability-learning-vm.md— VM design, component list, phase gates.- SigNoz Overview — the production-side counterpart.