Secrets custody drift check
A fast probe-before-you-work protocol that catches silent drift between local-mirror credentials and what the lab VMs actually accept, before you spend ten minutes fighting bad creds mid-task.
This page is the operator runbook for confirming, before any non-trivial work, that credentials stored in the local platform-credential mirror still match what the corresponding lab VM actually accepts. Local mirror files can drift from live state when a service’s password is rotated on the VM but not pushed back into the workspace, or when a file was provisioned before the corresponding service account existed.
Drift is silent: TCP-level reachability and protocol-level handshakes succeed, but authentication fails with WRONGPASS, 401, or equivalent. A single fast probe before the work starts prevents fighting bad creds mid-task. Pair this page with Rotate secrets and tokens — that one rotates a credential; this one decides whether rotation is needed at all.
Symptom
- A task that depends on a credential under
opp-full-plat/secrets/<service>/...is queued. - Authentication is failing against a live VM service and the credential file is the obvious next suspect.
- A session report or chat thread mentions that a service password was rotated.
- A credential has been flagged as broken in an admin-notes pile without a probe — confirm first; if it actually works, that is an auth-call shape issue, not a drift issue.
The diagnostic signature is service-specific:
| Service | Signature of drift |
|---|---|
| Redis | WRONGPASS or NOAUTH from redis-cli |
| Nexus | HTTP 401 from a REST probe; admin can still confirm the user exists |
| WSO2 | HTTP 401 from a Carbon UI service endpoint |
| Vault | HTTP 403 or permission denied from vault token lookup-self |
| MinIO | mc exits non-zero with Access Denied |
| Generic HTTPS basic-auth | HTTP 401 with a body indicating bad credentials |
Root cause
Three sources of drift:
- Password rotated on the VM but not written back to the local mirror. A sibling file like
auth.env.pre-password-reset-YYYYMMDDHHMMSSis the giveaway. - The service account in the file does not exist on the live service. The file was provisioned before the user was created, or the user was deleted later.
- The env-var names in the secret file do not map 1:1 to the actual ACL user names. Redis is the canonical case —
REDIS_APP_PASSWORDis for ACL userapp-default, notapp. The probe with the wrong user name looks like full drift.
The fix is not “rotate” — it is “probe correctly, then decide”. Reaching for rotation when the probe was wrong burns ten minutes of recovery work that did not need doing.
Fix
A probe MUST be:
- protocol-level (not just
tcp/22reachability); - authentication-completing (not just a
PINGthat goes through before auth is required); - fast (a few seconds, no destructive write).
Pre-action checklist
-
Identify the secret file in scope:
ls -la /home/ze/opp-full-plat/secrets/<service>/Note any sibling files with names like
auth.env.pre-password-reset-YYYYMMDDHHMMSS— they indicate a known prior rotation that may or may not have been written back. -
Identify which usernames the credential is supposed to authenticate. For services with ACLs (Redis, Nexus, WSO2), env-var names in the secret file do NOT always map 1:1 to actual ACL user names. Read the live VM’s authoritative config first:
- Redis:
/etc/redis/users.acl(the literal ACL users) - Nexus: REST
/service/rest/v1/security/users(with admin creds) - WSO2: the
super_adminblock inrepository/conf/deployment.toml - PostgreSQL / CNPG:
pg_hba.confor the operator’s user CR
- Redis:
-
Treat the file as suspect until proven current. Do not yet propose architecture or admin work that depends on this credential.
Probes
Redis with Sentinel (per-user ACL)
ENV_FILE=/home/ze/opp-full-plat/secrets/redis-sentinel/auth.env
. "$ENV_FILE"
REDIS_HOST=30.30.30.31
ssh ze@$REDIS_HOST 'sudo cat /etc/redis/users.acl' | awk '{print $2}'
redis-cli -h $REDIS_HOST --user app-default -a "$REDIS_APP_PASSWORD" PING
redis-cli -h $REDIS_HOST --user replica-user -a "$REDIS_REPLICA_PASSWORD" PING
redis-cli -h $REDIS_HOST --user sentinel-user -a "$REDIS_SENTINEL_PASSWORD" PING
redis-cli -h $REDIS_HOST --user admin -a "$REDIS_ADMIN_PASSWORD" PING
Expected: each probe returns PONG. NOAUTH or WRONGPASS indicates drift.
Nexus user (REST whoami)
USER_FILE=/home/ze/opp-full-plat/secrets/nexus-<user>-password
PASS=$(tr -d '\r\n' < "$USER_FILE")
curl -sS -u "<user>:$PASS" \
https://nexus-mirror.apps.sub.comptech-lab.com/service/rest/v1/security/users \
-o /dev/null -w "%{http_code}\n"
Expected: 200 (or 403 if the user lacks the API role but the credential is valid). 401 indicates drift OR the user does not exist on the live Nexus.
Disambiguate “wrong password” from “user does not exist”:
ADMIN_PASS=$(tr -d '\r\n' \
< /home/ze/opp-full-plat/secrets/nexus-admin-password)
curl -sS -u "admin:$ADMIN_PASS" \
https://nexus-mirror.apps.sub.comptech-lab.com/service/rest/v1/security/users \
| jq -r '.[].userId' | grep -Fx "<user>" \
&& echo "user exists" || echo "user does NOT exist"
WSO2 (APIM or IS) super admin
. /home/ze/opp-full-plat/secrets/wso2-apim-is/auth.env
curl -sS -k -u "admin:$WSO2_SUPER_ADMIN_PASS" \
"https://wso2-apim-is.sub.comptech-lab.com:9443/services/UserAdmin?wsdl" \
-o /dev/null -w "%{http_code}\n"
Expected: 200. Anything else, treat as drift (subject to the WSO2 URL encoding trap — see WSO2 APIM JMS URL encoding).
Generic HTTP(S) basic-auth
PASS=$(grep -E '^<KEY>=' "<env-file>" | cut -d= -f2-)
curl -sS -u "<user>:$PASS" "<service-url>/<auth-probing-endpoint>" \
-o /dev/null -w "%{http_code}\n"
Validation
The secret file is current when the probe completes one of:
HTTP 200/204from an authentication-completing endpoint, OR- protocol-level success token (
PONGfrom Redis,+OKfrom SMTP afterAUTH, etc.).
The secret file is stale (drift confirmed) when the probe returns one of:
401/403with a body indicating bad credentials (NOT just lack of authorization for the action);WRONGPASS,NOAUTH,WRONG_PASSWORD, or the service-specific equivalent;- the service-rejects-auth-and-closes pattern (
connection resetorEOFafter the auth step).
When drift is confirmed
- Stop the task that depends on the credential. Do not work around with admin overrides; that only deepens the drift.
- Surface the drift in chat or the active issue immediately. Do not silently struggle.
- Rotate the secret correctly via Rotate secrets and tokens. Convention: 24 chars, alphanumeric, no
&,@,%,?,:,#(these characters trigger URL-encoding traps across multiple subsystems). - Re-run the probe to confirm the new credential is current.
- Record the rotation in the active session report or close-out issue with file path, timestamp, who rotated on the VM side, and the probe evidence.
Forbidden actions
- Do NOT use the admin credential to “punch through” a failing per-service auth probe. The point of the probe is to confirm the specific credential file in scope is current; admin override defeats that.
- Do NOT update the local mirror file with a password you did not also set on the live VM. The file is a mirror, not the source of truth.
- Do NOT commit secret files to any Git repository. Filesystem-only custody is the convention; the workspace
.gitignoreis the safety net. - Do NOT propose architecture or admin work that depends on a specific credential without first verifying it.
Example: 2026-05-09 probe pass and false alarm
Three credential files probed at the start of a smoke-test session. The flow that worked, and the false alarm that cost ten minutes:
-
secrets/redis-sentinel/auth.env. Initial probe used the env-var suffixes (app,replica,sentinel) as ACL user names. All four returnedWRONGPASS— looked like full drift. Reading/etc/redis/users.aclrevealed the real ACL users wereapp-default,replica-user,sentinel-user,admin. Re-probed with the correct user names — all four authenticated. Not drift; probe was wrong. Lesson: read the live ACL/user table before trusting env-var names. -
secrets/nexus-mavenbot-password. Probe returned401. Admin-side disambiguation showed themavenbotuser did not exist on Nexus. Real drift; remediated by creating the user with rolenexus-maven-pulland regenerating a 24-char alphanumeric password (deliberately avoiding&,@,%,?,:,#). Updated the secret file, re-probed,200on/v2/and manifest pull. -
secrets/wso2-apim-is/auth.env.WSO2_SUPER_ADMIN_PASSrejected. A sibling.pre-password-reset-20260508200845file confirmed a prior rotation that was never written back. Remediated by rewriting the working value intoauth.env. See also the WSO2 URL-encoding trap for the downstream consequence of&characters in passwords.
Prevention
Three guardrails reduce drift over time:
- Probe before you work. Every session’s warm-up runs the probe matrix against any credential the session needs. The cost is seconds; the savings are minutes.
- 24-char alphanumeric convention. Generated passwords avoid URL-special characters, eliminating the WSO2-class downstream traps.
- Vault is the canonical source for application credentials. Per-division Vault paths (
secret/apps/<division>/<app>/<env>/*) propagate to every consumer via ESO automatically. The local mirror remains the backstop for platform credentials not yet in Vault — kubeadmin passwords, platform PATs, RHACS Central admin — not a parallel source of truth.
References
opp-full-plat/runbooks/secrets-custody-drift-check.md— operator-facing source.- Rotate secrets and tokens — the rotation procedure once drift is confirmed.
- WSO2 APIM JMS URL encoding — downstream trap for passwords with URL-special characters.
opp-full-plat/connection-details/vault-app-secrets.md— Vault is the canonical source for application credentials.- ADR coverage for credential custody (filesystem mirror, no in-repo commit).