Secrets custody drift check

A fast probe-before-you-work protocol that catches silent drift between local-mirror credentials and what the lab VMs actually accept, before you spend ten minutes fighting bad creds mid-task.

This page is the operator runbook for confirming, before any non-trivial work, that credentials stored in the local platform-credential mirror still match what the corresponding lab VM actually accepts. Local mirror files can drift from live state when a service’s password is rotated on the VM but not pushed back into the workspace, or when a file was provisioned before the corresponding service account existed.

Drift is silent: TCP-level reachability and protocol-level handshakes succeed, but authentication fails with WRONGPASS, 401, or equivalent. A single fast probe before the work starts prevents fighting bad creds mid-task. Pair this page with Rotate secrets and tokens — that one rotates a credential; this one decides whether rotation is needed at all.

Symptom

  • A task that depends on a credential under opp-full-plat/secrets/<service>/... is queued.
  • Authentication is failing against a live VM service and the credential file is the obvious next suspect.
  • A session report or chat thread mentions that a service password was rotated.
  • A credential has been flagged as broken in an admin-notes pile without a probe — confirm first; if it actually works, that is an auth-call shape issue, not a drift issue.

The diagnostic signature is service-specific:

ServiceSignature of drift
RedisWRONGPASS or NOAUTH from redis-cli
NexusHTTP 401 from a REST probe; admin can still confirm the user exists
WSO2HTTP 401 from a Carbon UI service endpoint
VaultHTTP 403 or permission denied from vault token lookup-self
MinIOmc exits non-zero with Access Denied
Generic HTTPS basic-authHTTP 401 with a body indicating bad credentials

Root cause

Three sources of drift:

  1. Password rotated on the VM but not written back to the local mirror. A sibling file like auth.env.pre-password-reset-YYYYMMDDHHMMSS is the giveaway.
  2. The service account in the file does not exist on the live service. The file was provisioned before the user was created, or the user was deleted later.
  3. The env-var names in the secret file do not map 1:1 to the actual ACL user names. Redis is the canonical case — REDIS_APP_PASSWORD is for ACL user app-default, not app. The probe with the wrong user name looks like full drift.

The fix is not “rotate” — it is “probe correctly, then decide”. Reaching for rotation when the probe was wrong burns ten minutes of recovery work that did not need doing.

Fix

A probe MUST be:

  • protocol-level (not just tcp/22 reachability);
  • authentication-completing (not just a PING that goes through before auth is required);
  • fast (a few seconds, no destructive write).

Pre-action checklist

  1. Identify the secret file in scope:

    ls -la /home/ze/opp-full-plat/secrets/<service>/

    Note any sibling files with names like auth.env.pre-password-reset-YYYYMMDDHHMMSS — they indicate a known prior rotation that may or may not have been written back.

  2. Identify which usernames the credential is supposed to authenticate. For services with ACLs (Redis, Nexus, WSO2), env-var names in the secret file do NOT always map 1:1 to actual ACL user names. Read the live VM’s authoritative config first:

    • Redis: /etc/redis/users.acl (the literal ACL users)
    • Nexus: REST /service/rest/v1/security/users (with admin creds)
    • WSO2: the super_admin block in repository/conf/deployment.toml
    • PostgreSQL / CNPG: pg_hba.conf or the operator’s user CR
  3. Treat the file as suspect until proven current. Do not yet propose architecture or admin work that depends on this credential.

Probes

Redis with Sentinel (per-user ACL)

ENV_FILE=/home/ze/opp-full-plat/secrets/redis-sentinel/auth.env
. "$ENV_FILE"
REDIS_HOST=30.30.30.31

ssh ze@$REDIS_HOST 'sudo cat /etc/redis/users.acl' | awk '{print $2}'

redis-cli -h $REDIS_HOST --user app-default      -a "$REDIS_APP_PASSWORD"      PING
redis-cli -h $REDIS_HOST --user replica-user     -a "$REDIS_REPLICA_PASSWORD"  PING
redis-cli -h $REDIS_HOST --user sentinel-user    -a "$REDIS_SENTINEL_PASSWORD" PING
redis-cli -h $REDIS_HOST --user admin            -a "$REDIS_ADMIN_PASSWORD"    PING

Expected: each probe returns PONG. NOAUTH or WRONGPASS indicates drift.

Nexus user (REST whoami)

USER_FILE=/home/ze/opp-full-plat/secrets/nexus-<user>-password
PASS=$(tr -d '\r\n' < "$USER_FILE")
curl -sS -u "<user>:$PASS" \
  https://nexus-mirror.apps.sub.comptech-lab.com/service/rest/v1/security/users \
  -o /dev/null -w "%{http_code}\n"

Expected: 200 (or 403 if the user lacks the API role but the credential is valid). 401 indicates drift OR the user does not exist on the live Nexus.

Disambiguate “wrong password” from “user does not exist”:

ADMIN_PASS=$(tr -d '\r\n' \
  < /home/ze/opp-full-plat/secrets/nexus-admin-password)
curl -sS -u "admin:$ADMIN_PASS" \
  https://nexus-mirror.apps.sub.comptech-lab.com/service/rest/v1/security/users \
  | jq -r '.[].userId' | grep -Fx "<user>" \
  && echo "user exists" || echo "user does NOT exist"

WSO2 (APIM or IS) super admin

. /home/ze/opp-full-plat/secrets/wso2-apim-is/auth.env
curl -sS -k -u "admin:$WSO2_SUPER_ADMIN_PASS" \
  "https://wso2-apim-is.sub.comptech-lab.com:9443/services/UserAdmin?wsdl" \
  -o /dev/null -w "%{http_code}\n"

Expected: 200. Anything else, treat as drift (subject to the WSO2 URL encoding trap — see WSO2 APIM JMS URL encoding).

Generic HTTP(S) basic-auth

PASS=$(grep -E '^<KEY>=' "<env-file>" | cut -d= -f2-)
curl -sS -u "<user>:$PASS" "<service-url>/<auth-probing-endpoint>" \
  -o /dev/null -w "%{http_code}\n"

Validation

The secret file is current when the probe completes one of:

  • HTTP 200 / 204 from an authentication-completing endpoint, OR
  • protocol-level success token (PONG from Redis, +OK from SMTP after AUTH, etc.).

The secret file is stale (drift confirmed) when the probe returns one of:

  • 401 / 403 with a body indicating bad credentials (NOT just lack of authorization for the action);
  • WRONGPASS, NOAUTH, WRONG_PASSWORD, or the service-specific equivalent;
  • the service-rejects-auth-and-closes pattern (connection reset or EOF after the auth step).

When drift is confirmed

  1. Stop the task that depends on the credential. Do not work around with admin overrides; that only deepens the drift.
  2. Surface the drift in chat or the active issue immediately. Do not silently struggle.
  3. Rotate the secret correctly via Rotate secrets and tokens. Convention: 24 chars, alphanumeric, no &, @, %, ?, :, # (these characters trigger URL-encoding traps across multiple subsystems).
  4. Re-run the probe to confirm the new credential is current.
  5. Record the rotation in the active session report or close-out issue with file path, timestamp, who rotated on the VM side, and the probe evidence.

Forbidden actions

  • Do NOT use the admin credential to “punch through” a failing per-service auth probe. The point of the probe is to confirm the specific credential file in scope is current; admin override defeats that.
  • Do NOT update the local mirror file with a password you did not also set on the live VM. The file is a mirror, not the source of truth.
  • Do NOT commit secret files to any Git repository. Filesystem-only custody is the convention; the workspace .gitignore is the safety net.
  • Do NOT propose architecture or admin work that depends on a specific credential without first verifying it.

Example: 2026-05-09 probe pass and false alarm

Three credential files probed at the start of a smoke-test session. The flow that worked, and the false alarm that cost ten minutes:

  1. secrets/redis-sentinel/auth.env. Initial probe used the env-var suffixes (app, replica, sentinel) as ACL user names. All four returned WRONGPASS — looked like full drift. Reading /etc/redis/users.acl revealed the real ACL users were app-default, replica-user, sentinel-user, admin. Re-probed with the correct user names — all four authenticated. Not drift; probe was wrong. Lesson: read the live ACL/user table before trusting env-var names.

  2. secrets/nexus-mavenbot-password. Probe returned 401. Admin-side disambiguation showed the mavenbot user did not exist on Nexus. Real drift; remediated by creating the user with role nexus-maven-pull and regenerating a 24-char alphanumeric password (deliberately avoiding &, @, %, ?, :, #). Updated the secret file, re-probed, 200 on /v2/ and manifest pull.

  3. secrets/wso2-apim-is/auth.env. WSO2_SUPER_ADMIN_PASS rejected. A sibling .pre-password-reset-20260508200845 file confirmed a prior rotation that was never written back. Remediated by rewriting the working value into auth.env. See also the WSO2 URL-encoding trap for the downstream consequence of & characters in passwords.

Prevention

Three guardrails reduce drift over time:

  1. Probe before you work. Every session’s warm-up runs the probe matrix against any credential the session needs. The cost is seconds; the savings are minutes.
  2. 24-char alphanumeric convention. Generated passwords avoid URL-special characters, eliminating the WSO2-class downstream traps.
  3. Vault is the canonical source for application credentials. Per-division Vault paths (secret/apps/<division>/<app>/<env>/*) propagate to every consumer via ESO automatically. The local mirror remains the backstop for platform credentials not yet in Vault — kubeadmin passwords, platform PATs, RHACS Central admin — not a parallel source of truth.

References

  • opp-full-plat/runbooks/secrets-custody-drift-check.md — operator-facing source.
  • Rotate secrets and tokens — the rotation procedure once drift is confirmed.
  • WSO2 APIM JMS URL encoding — downstream trap for passwords with URL-special characters.
  • opp-full-plat/connection-details/vault-app-secrets.md — Vault is the canonical source for application credentials.
  • ADR coverage for credential custody (filesystem mirror, no in-repo commit).

Last reviewed: 2026-05-12