Redis + Sentinel
The three-node Redis cluster with Sentinel quorum — private lab service, ACLs and AOF persistence, controlled-failover validated, hardening gates open per ADR 0006.
Redis is deployed as a standalone private VM service for the clean rebuild: three Redis nodes plus Sentinel quorum, private on the lab br30 network, used today for lab/platform smoke testing. Per ADR 0006, Redis is not yet production-ready for BFSI applications — six hardening gates separate the current state from production-ready, and they must close before real workloads consume the cluster as a production dependency.
What it is
| Property | Value |
|---|---|
| Nodes | redis-0, redis-1, redis-2 |
| Sentinel endpoint | redis-sentinel.sub.comptech-lab.com:26379 |
| Non-Sentinel debug endpoint | redis.sub.comptech-lab.com:6379 |
| Persistence | AOF |
| Replication | Yes; controlled failover from redis-0 to redis-2 validated 2026-05-08 |
| ACLs | Yes, current state |
| Host firewall | Yes |
| Quorum size | 3 |
| Public exposure | None — private lab only |
| TLS | No (gate) |
| Vault-backed creds | No (gate) |
| Backups | No (gate) |
| Monitoring | No (gate) |
Architecture
Three Redis nodes with one acting as master at any time; Sentinel agents on each (or co-located) provide leader election and client redirection. Production clients use a Sentinel-aware library that queries the Sentinel endpoint list, gets back the current master, and redirects on failover.
Why a VM cluster (not OpenShift)
Per ADR 0006: Redis is intended as a VM-hosted private platform service for the clean rebuild. The reasoning aligns with the broader ADR pattern — keep stateful platform dependencies on VMs until OpenShift-side gates (ESO, namespace baseline, NetworkPolicy) close, then revisit. Running Redis as a Kubernetes Operator in OpenShift is a future decision; the current state is VM.
Client contract
Production clients must use a Sentinel-aware library and must connect through the Sentinel endpoint list. The redis.sub.comptech-lab.com DNS round-robin is allowed for diagnostics and break-glass access only; it is not the production client contract.
Sentinel client pattern (pseudocode):
sentinel_endpoints = [
"redis-0.sub.comptech-lab.com:26379",
"redis-1.sub.comptech-lab.com:26379",
"redis-2.sub.comptech-lab.com:26379",
]
client = SentinelClient(
sentinels=sentinel_endpoints,
service_name="<configured-master-name>",
user="<workload-acl-user>",
password="<workload-acl-password-from-vault>",
)
# Client discovers current master from Sentinels, redirects on failover
The non-Sentinel redis.sub.comptech-lab.com:6379 endpoint is for diagnostics: confirming reachability, checking the AOF tail, reading replication state. Not a production client path.
Hardening gates (per ADR 0006)
Six gates separate current state from production-ready:
| Gate | Status | What it requires |
|---|---|---|
| 1. TLS and listener hardening | Open | Redis cert from internal CA; TLS on client, replication, Sentinel traffic; validate failover through TLS |
| 2. Credential and ACL hardening | Open | Move app creds to Vault; replace app-default smoke user with workload-scoped ACLs; deny dangerous command categories to app users |
| 3. Network hardening | Open | Narrow firewall from lab /16 to approved client networks once OpenShift consumer nodes and app namespaces are known |
| 4. Backup and restore hardening | Open | Current-master backup script; checksumming; encryption; MinIO + offline copy; isolated restore drill |
| 5. Observability hardening | Open | Export Redis + Sentinel metrics; alert on master changes, quorum loss, replica link failure, replication lag, evictions, memory pressure, persistence failures, disk saturation |
| 6. Operational resilience | Open | Re-run controlled failover after TLS + ACL hardening; per-node restart tests; client reconnect tests; upgrade/rollback procedure |
Each gate has explicit acceptance criteria in ADR 0006. None are casually closed.
Operational guardrails
- No public IPs, no public DNS, no HAProxy wildcard for Redis. Private only until a more specific allowlist is approved.
- TLS is required before BFSI application onboarding. TLS implementation may use the lab internal CA first, then Vault PKI later after Vault PKI is accepted as an online intermediate.
app-defaultis a smoke-test user only. Production uses one ACL user per app or per trust boundary, with least privilege and rotation ownership recorded.- Persistence alone is not a backup strategy. AOF gives durability across restarts; it does not protect against logical corruption, accidental deletion, or VM-level loss.
- Don’t re-expose publicly. Any request to expose Redis publicly or through HAProxy needs a new ADR. The current decision is private-only.
Validation
# DNS
dig @<lab-dns> redis-0.sub.comptech-lab.com A +short
dig @<lab-dns> redis-1.sub.comptech-lab.com A +short
dig @<lab-dns> redis-2.sub.comptech-lab.com A +short
dig @<lab-dns> redis-sentinel.sub.comptech-lab.com A +short
dig @<lab-dns> redis.sub.comptech-lab.com A +short
# Sentinel discovery (replace user/password from custody)
redis-cli -h redis-0.sub.comptech-lab.com -p 26379 \
sentinel masters
redis-cli -h redis-0.sub.comptech-lab.com -p 26379 \
sentinel get-master-addr-by-name <master-name>
Custody:
- ACL user passwords: under
secrets/redis-vm/until Vault delivery is wired. - Sentinel service password: same.
- Future TLS cert custody: TBD when gate 1 closes.
Failure modes
Symptom: client fails after a Redis master flap
Root cause. Client is using the non-Sentinel redis.sub.comptech-lab.com:6379 round-robin endpoint, not the Sentinel-aware path. The round-robin can return a non-master node, or a node that becomes non-master mid-connection.
Fix. Switch the client to a Sentinel-aware library and the Sentinel endpoint list.
Prevention. Document Sentinel-only as the production client contract. Reject non-Sentinel client configs in code review.
Symptom: replica falls behind master and lag alerts fire
Root cause. Replica disk pressure, network throughput, or memory pressure on the replica process.
Fix. Identify which replica; check disk/network/memory on that VM; restart the replica process or VM if needed; let Sentinel rebalance.
Prevention. Once monitoring is wired (gate 5), alert on replication lag before clients notice.
Symptom: master and replicas all show “fail-over in progress” but never resolve
Root cause. Quorum is partial — two Sentinels disagree, or only two Sentinels are reachable. Or the master-name is misconfigured.
Fix. Check Sentinel logs on each node. Verify each Sentinel can reach each Redis node. Restart the failed Sentinel; re-validate quorum.
Prevention. Network monitoring across the three Redis VMs; alert on Sentinel-quorum drops.
Symptom: backup question — can we recover after disk loss?
Root cause. Today, no — backup gate is open. AOF persistence on a single node doesn’t survive node loss.
Fix. Replication to another node is the operational mitigation. Rebuild the lost node and let it catch up.
Prevention. Close gate 4 (backup) before production.
References
opp-full-plat/adr/0006-redis-sentinel-hardening.md— hardening gates.opp-full-plat/plans/disconnected-rebuild/environments/dc-lab/redis-sentinel-vm-plan.md— VM plan.