Redis + Sentinel

The three-node Redis cluster with Sentinel quorum — private lab service, ACLs and AOF persistence, controlled-failover validated, hardening gates open per ADR 0006.

Redis is deployed as a standalone private VM service for the clean rebuild: three Redis nodes plus Sentinel quorum, private on the lab br30 network, used today for lab/platform smoke testing. Per ADR 0006, Redis is not yet production-ready for BFSI applications — six hardening gates separate the current state from production-ready, and they must close before real workloads consume the cluster as a production dependency.

What it is

PropertyValue
Nodesredis-0, redis-1, redis-2
Sentinel endpointredis-sentinel.sub.comptech-lab.com:26379
Non-Sentinel debug endpointredis.sub.comptech-lab.com:6379
PersistenceAOF
ReplicationYes; controlled failover from redis-0 to redis-2 validated 2026-05-08
ACLsYes, current state
Host firewallYes
Quorum size3
Public exposureNone — private lab only
TLSNo (gate)
Vault-backed credsNo (gate)
BackupsNo (gate)
MonitoringNo (gate)

Architecture

Three Redis nodes with one acting as master at any time; Sentinel agents on each (or co-located) provide leader election and client redirection. Production clients use a Sentinel-aware library that queries the Sentinel endpoint list, gets back the current master, and redirects on failover.

Why a VM cluster (not OpenShift)

Per ADR 0006: Redis is intended as a VM-hosted private platform service for the clean rebuild. The reasoning aligns with the broader ADR pattern — keep stateful platform dependencies on VMs until OpenShift-side gates (ESO, namespace baseline, NetworkPolicy) close, then revisit. Running Redis as a Kubernetes Operator in OpenShift is a future decision; the current state is VM.

Client contract

Production clients must use a Sentinel-aware library and must connect through the Sentinel endpoint list. The redis.sub.comptech-lab.com DNS round-robin is allowed for diagnostics and break-glass access only; it is not the production client contract.

Sentinel client pattern (pseudocode):

sentinel_endpoints = [
  "redis-0.sub.comptech-lab.com:26379",
  "redis-1.sub.comptech-lab.com:26379",
  "redis-2.sub.comptech-lab.com:26379",
]
client = SentinelClient(
  sentinels=sentinel_endpoints,
  service_name="<configured-master-name>",
  user="<workload-acl-user>",
  password="<workload-acl-password-from-vault>",
)
# Client discovers current master from Sentinels, redirects on failover

The non-Sentinel redis.sub.comptech-lab.com:6379 endpoint is for diagnostics: confirming reachability, checking the AOF tail, reading replication state. Not a production client path.

Hardening gates (per ADR 0006)

Six gates separate current state from production-ready:

GateStatusWhat it requires
1. TLS and listener hardeningOpenRedis cert from internal CA; TLS on client, replication, Sentinel traffic; validate failover through TLS
2. Credential and ACL hardeningOpenMove app creds to Vault; replace app-default smoke user with workload-scoped ACLs; deny dangerous command categories to app users
3. Network hardeningOpenNarrow firewall from lab /16 to approved client networks once OpenShift consumer nodes and app namespaces are known
4. Backup and restore hardeningOpenCurrent-master backup script; checksumming; encryption; MinIO + offline copy; isolated restore drill
5. Observability hardeningOpenExport Redis + Sentinel metrics; alert on master changes, quorum loss, replica link failure, replication lag, evictions, memory pressure, persistence failures, disk saturation
6. Operational resilienceOpenRe-run controlled failover after TLS + ACL hardening; per-node restart tests; client reconnect tests; upgrade/rollback procedure

Each gate has explicit acceptance criteria in ADR 0006. None are casually closed.

Operational guardrails

  • No public IPs, no public DNS, no HAProxy wildcard for Redis. Private only until a more specific allowlist is approved.
  • TLS is required before BFSI application onboarding. TLS implementation may use the lab internal CA first, then Vault PKI later after Vault PKI is accepted as an online intermediate.
  • app-default is a smoke-test user only. Production uses one ACL user per app or per trust boundary, with least privilege and rotation ownership recorded.
  • Persistence alone is not a backup strategy. AOF gives durability across restarts; it does not protect against logical corruption, accidental deletion, or VM-level loss.
  • Don’t re-expose publicly. Any request to expose Redis publicly or through HAProxy needs a new ADR. The current decision is private-only.

Validation

# DNS
dig @<lab-dns> redis-0.sub.comptech-lab.com A +short
dig @<lab-dns> redis-1.sub.comptech-lab.com A +short
dig @<lab-dns> redis-2.sub.comptech-lab.com A +short
dig @<lab-dns> redis-sentinel.sub.comptech-lab.com A +short
dig @<lab-dns> redis.sub.comptech-lab.com A +short

# Sentinel discovery (replace user/password from custody)
redis-cli -h redis-0.sub.comptech-lab.com -p 26379 \
  sentinel masters

redis-cli -h redis-0.sub.comptech-lab.com -p 26379 \
  sentinel get-master-addr-by-name <master-name>

Custody:

  • ACL user passwords: under secrets/redis-vm/ until Vault delivery is wired.
  • Sentinel service password: same.
  • Future TLS cert custody: TBD when gate 1 closes.

Failure modes

Symptom: client fails after a Redis master flap

Root cause. Client is using the non-Sentinel redis.sub.comptech-lab.com:6379 round-robin endpoint, not the Sentinel-aware path. The round-robin can return a non-master node, or a node that becomes non-master mid-connection.

Fix. Switch the client to a Sentinel-aware library and the Sentinel endpoint list.

Prevention. Document Sentinel-only as the production client contract. Reject non-Sentinel client configs in code review.

Symptom: replica falls behind master and lag alerts fire

Root cause. Replica disk pressure, network throughput, or memory pressure on the replica process.

Fix. Identify which replica; check disk/network/memory on that VM; restart the replica process or VM if needed; let Sentinel rebalance.

Prevention. Once monitoring is wired (gate 5), alert on replication lag before clients notice.

Symptom: master and replicas all show “fail-over in progress” but never resolve

Root cause. Quorum is partial — two Sentinels disagree, or only two Sentinels are reachable. Or the master-name is misconfigured.

Fix. Check Sentinel logs on each node. Verify each Sentinel can reach each Redis node. Restart the failed Sentinel; re-validate quorum.

Prevention. Network monitoring across the three Redis VMs; alert on Sentinel-quorum drops.

Symptom: backup question — can we recover after disk loss?

Root cause. Today, no — backup gate is open. AOF persistence on a single node doesn’t survive node loss.

Fix. Replication to another node is the operational mitigation. Rebuild the lost node and let it catch up.

Prevention. Close gate 4 (backup) before production.

References

  • opp-full-plat/adr/0006-redis-sentinel-hardening.md — hardening gates.
  • opp-full-plat/plans/disconnected-rebuild/environments/dc-lab/redis-sentinel-vm-plan.md — VM plan.

Last reviewed: 2026-05-11