Monitoring and audit
What Vault tells you about itself — audit devices for every request, Prometheus-format metrics on the listener, Raft autopilot state, and the lab's planned monitoring path.
Vault produces three streams of operational signal: audit logs (every request, every response), Prometheus metrics (process and Raft state), and the autopilot health state. This page covers what’s emitted, how to read it, and the gaps the lab still has.
Audit devices
Vault’s audit device records every request and response. Without an enabled audit device, Vault refuses to start writing secrets in production-ready mode (or rather, it allows it but the operator should not).
The lab enables a file-based audit device on each voter:
vault audit enable file file_path=/var/log/vault/audit.log
After enable:
- Every API call is appended to
/var/log/vault/audit.logas a single JSON line. - The line includes request data (method, path, namespace) and response data — but values are HMAC’d, not plaintext. You can see that a secret was read, not what its value was.
- Log rotation is required: the file grows unbounded otherwise.
Log rotation
/etc/logrotate.d/vault:
/var/log/vault/audit.log {
daily
rotate 30
compress
delaycompress
missingok
notifempty
create 0640 vault vault
postrotate
kill -HUP $(pidof vault) > /dev/null 2>&1 || true
endscript
}
Vault re-opens the audit log on SIGHUP (kill -HUP), so logrotate can move the active file aside without losing entries.
Why audit is mandatory
| Requirement | Audit gives |
|---|---|
| Compliance — “show me who read which secret when” | Yes (HMAC’d values; identity from auth context) |
| Incident response — was this secret accessed after the suspected compromise? | Yes |
| Operator-error recovery — what did we touch in the last hour? | Yes |
| Performance debugging | Some — request rate by path, latency |
The gate in vault-oss-vm-plan.md is explicit: “At least one enabled audit device before production secret use.”
Reading audit log
Each line is JSON. Two passes for typical investigation:
# Count requests by path in the last hour
sudo grep -F "$(date -u -d '1 hour ago' '+%Y-%m-%dT%H')" /var/log/vault/audit.log \
| jq -r '.request.path' \
| sort | uniq -c | sort -rn
# Find all KV reads that returned "permission denied"
sudo jq -c 'select(.response.error == "permission denied" and (.request.path | startswith("secret/data/"))) | [.time, .request.path, .auth.display_name] | @tsv' /var/log/vault/audit.log
# All distinct principals (display names) that authenticated in the last 24h
sudo jq -r 'select(.time > "'"$(date -u -d '24 hours ago' '+%Y-%m-%dT%H:%M:%SZ')"'") | .auth.display_name' /var/log/vault/audit.log | sort -u
The audit log is the source of truth for “who-did-what.” Read access to it is operator-restricted (file mode 0640, root + vault group).
Future: shipping audit to a centralized store
Currently logs live on each voter’s local disk. A planned improvement is to ship them to Loki (so the lab observability VM can query them) and/or to a dedicated audit S3 bucket on MinIO with WORM-like retention. Neither is wired today.
Prometheus metrics
Vault exposes metrics in Prometheus format at /v1/sys/metrics?format=prometheus. To read them without a token requires enabling the unauthenticated_metrics_access = true flag on the listener, which the lab does not do — metrics access requires the metrics policy.
export VAULT_TOKEN=<token-with-metrics-policy>
curl -ks --header "X-Vault-Token: $VAULT_TOKEN" \
https://vault.sub.comptech-lab.com:8200/v1/sys/metrics?format=prometheus | head -40
The metrics policy:
path "sys/metrics" {
capabilities = ["read", "list"]
}
Granted to a dedicated metrics service token used by the lab’s Prometheus / Alloy scrape job.
Key metrics
| Metric | Why it matters |
|---|---|
vault_core_active | Is this node the active (leader)? 1 = yes |
vault_core_unsealed | Is this node unsealed? 1 = yes |
vault_runtime_num_goroutines | Goroutine count — abnormal growth = bug |
vault_runtime_alloc_bytes | Heap allocated — should be stable |
vault_raft_leader_lastContact | How long since this node last heard from the leader (ms) |
vault_raft_replication_appendentries_logs | Raft replication rate |
vault_secret_kv_count | Total KV entries (after enable) |
vault_token_create_count | Token issuance rate (auth load) |
vault_audit_log_request_failure | Audit failures — should be zero |
vault_consul_storage_* | Should be zero (we use Raft, not Consul) |
Where they’re scraped
Planned: monitoring-0 VM (Prometheus + Alloy) scrapes each Vault voter through the /v1/sys/metrics endpoint with the metrics token. Currently not wired in production.
Raft autopilot
Raft autopilot is Vault’s automated cluster-health helper. Read its state:
vault operator raft autopilot state
Healthy output looks like:
Healthy: true
Failure Tolerance: 1
Voters:
ID Address Active Healthy Last Contact
vault-0 vault-0.sub.comptech-lab.com:8201 true true 0s
vault-1 vault-1.sub.comptech-lab.com:8201 false true ...
vault-2 vault-2.sub.comptech-lab.com:8201 false true ...
What to watch:
Healthy: true— cluster is fine.falsemeans at least one voter is in trouble.Failure Tolerance: 1— with three voters, the cluster survives one failure. If this drops to 0, the next voter loss takes the cluster down.Last Contact— how recently each voter’s heartbeat landed. Large gaps mean network or load issues.
vault operator raft list-peers is the simpler view (just the voter list).
Health endpoint
/v1/sys/health returns a small JSON with the node’s seal state, version, replication mode, and active/standby status. Useful for cheap Blackbox probes:
curl -ks https://vault.sub.comptech-lab.com:8200/v1/sys/health
# {"initialized":true,"sealed":false,"standby":false,"performance_standby":false,"replication_performance_mode":"disabled","replication_dr_mode":"disabled","server_time_utc":...,"version":"1.21.1","cluster_name":"...","cluster_id":"..."}
HTTP status codes:
| Status | Means |
|---|---|
| 200 | Initialized, unsealed, active |
| 429 | Standby (HA passive) |
| 472 | DR replication secondary (not used in this lab) |
| 473 | Performance standby (Enterprise; not used) |
| 501 | Not initialized |
| 503 | Sealed |
Blackbox exporter probes should accept 200 and 429 (standby is a valid state).
Synthetic probe (planned)
The lab plans a small synthetic check from the monitoring VM:
- Authenticate with a long-lived scoped token (
monitoring-readonlypolicy). - Read a known canary secret (
secret/apps/platform/monitoring-canary/dev/value). - Assert the value matches an expected fingerprint.
- Emit a Prometheus metric:
vault_synthetic_read_ok{path="..."}.
Failure of the synthetic check is the strongest signal — it catches not just “Vault is up” but “Vault can serve a real secret to a real reader.”
Currently the synthetic check is not deployed; it’s on the monitoring backlog.
Alerting (planned)
Each of the following should trigger an alert when it’s wired:
| Condition | Severity |
|---|---|
| Any voter sealed for > 1m | Critical |
Healthy: false from autopilot | Critical |
| Failure Tolerance = 0 | Critical |
| Last contact > 5s for any voter | Warning |
| Audit log file growth stops | Warning (audit device broken) |
| Snapshot job failure | Warning |
| Cert expiry in < 14 days | Warning |
| Goroutine count > 5x baseline | Warning |
Alerts route to Alertmanager on monitoring-0 → notification channel (TBD).
Operator visual: the UI
https://vault.sub.comptech-lab.com:8200/ui/ is the official UI. Useful for:
- Browsing the path tree (read-only operator use).
- Token introspection (
vault token lookupequivalent in the UI). - Re-running a UI auth flow against a specific mount.
Not useful for:
- Bulk operations.
- Anything Git-able (use the CLI + GitOps).
- Audit log review (the UI doesn’t render the audit log; use the file).
Audit + ESO interaction
When ESO syncs a secret, every read shows up in the audit log under the ESO SA’s display name. Pattern:
"display_name": "kubernetes-spoke-dc-v6-app-eso"
"path": "secret/data/apps/platform/eso-smoke/dev"
This means:
- ESO churn (a flapping
ExternalSecretre-reading the same secret in a loop) is visible in audit and metrics. - An accidental cross-tenant policy expansion shows up immediately as reads against unexpected paths.
The audit log is the lab’s main lever against silent policy drift.
Failure modes
| Symptom | Root cause | Fix | Prevention |
|---|---|---|---|
Audit log grew to fill /var/log | Logrotate not running or misconfigured | Verify systemctl status logrotate.timer; run logrotate manually | Test logrotate on every install |
| Audit log empty after enable | vault audit enable ran on a single voter; other voters need their own enable | Run vault audit enable file ... on every voter | Treat audit enable as a per-node operation (it is) |
| Prometheus scrape returns 403 | Metrics token expired or revoked | Re-issue token with metrics policy | Long-lived metrics token + rotation runbook |
Healthy: false after a planned reboot | Voter still rejoining Raft after restart | Wait a minute; check list-peers | Don’t alert on autopilot for the first 60s after a known restart |
| Synthetic probe fails | Canary path doesn’t exist, or canary secret was rotated without updating the probe | Re-seed canary, restart probe | Document canary value source-of-truth |
References
opp-full-plat/plans/disconnected-rebuild/environments/dc-lab/vault-oss-vm-plan.md(production-readiness gates)- HashiCorp audit devices: developer.hashicorp.com/vault/docs/audit
- HashiCorp Telemetry / Prometheus: developer.hashicorp.com/vault/docs/internals/telemetry
- HashiCorp Raft Autopilot: developer.hashicorp.com