Secrets management — Vault, External Secrets Operator, and the rotation patterns that actually work
Why Kubernetes Secrets aren't secret, the four delivery patterns, Vault + ESO as the dominant on-prem stack, and rotation patterns that survive contact with production.
Module 07 covered the runtime layer — what watches the workload after it ships. This module is about something the workload needs to ship: secrets. Database passwords, API keys, TLS keys, registry credentials, signing keys, OIDC client secrets. Get this wrong and every other layer becomes meaningless — a perfectly-scanned, perfectly-monitored container that reads its master DB password from a ConfigMap is not secure.
The argument here is that how a secret reaches a pod is itself a security design, and the right design has central custody, runtime delivery, per-namespace isolation, and a rotation story. The dominant on-prem pattern is HashiCorp Vault plus the External Secrets Operator (ESO); the cloud-native equivalent uses AWS/GCP/Azure secret stores. This module covers both and the failure modes both share.
The secrets-in-Kubernetes problem
A Kubernetes Secret is not encrypted. The value is base64-encoded, which is decoding, not encryption — echo "cGFzc3dvcmQ=" | base64 -d returns “password” with no key required. Anyone with get secrets RBAC on a namespace can read every secret in it; anyone with cluster-scoped get secrets can read every secret on the cluster.
Encryption-at-rest is a separate setting — EncryptionConfiguration on the kube-apiserver, which encrypts the secret bytes inside etcd. Without that config, etcd’s on-disk database contains every secret in plaintext-equivalent. With it, etcd is encrypted but the bytes are decrypted on every API read; the RBAC story is unchanged.
That leaves two unsolved problems even on a hardened cluster:
Where did the secret come from? Did someone kubectl create secret manually (no audit trail)? Was it committed to git in an encrypted form? Is it pulled from a central store at runtime?
How does it rotate? A static secret in a YAML file rotates when somebody remembers; the median rotation period in a fleet with that pattern is “never.” PCI-DSS Req 8.3.7 asks for max-90-day rotation; SOC 2 expects documented rotation procedures; no auditor accepts “we’ll get to it.”
The four patterns below are the universe of answers.
The four patterns — pick one
| Pattern | Custody | Rotation | When to use |
|---|---|---|---|
Manual kubectl create secret | None | None | Toy clusters, throwaway demos. |
| Sealed Secrets | Per-cluster public key | Manual (re-seal) | Single-cluster GitOps, low blast-radius. |
| External Secrets Operator (ESO) | Central store (Vault, AWS/GCP/Azure SM) | Central rotation + refresh | Multi-cluster on-prem with central custody. The lab’s pattern. |
| CSI Secrets Store driver | Central store, mounted as files | Pod restart picks up new value | Cloud-managed secret stores; less on-prem maturity. |
Manual is the unmanaged path. It’s how almost every lab starts and how no production cluster should run. If your auditor asks “where do the secrets come from?”, “I created them manually” is not an answer.
Sealed Secrets (the Bitnami controller, now part of CNCF Sandbox) encrypts secrets at rest in git with a per-cluster public key. You write a SealedSecret CR with the ciphertext; the controller on the cluster decrypts it with its private key into a regular Secret. The good: secrets can live in git like any other resource, and the encryption is symmetric-strong. The bad: the private key is per-cluster, so the blast radius of a key compromise is “every secret on this cluster ever sealed”; rotation is a re-seal of every SealedSecret, which is painful at scale. Sealed Secrets is a fine choice for a small single-cluster deployment and not the right answer for a 30-cluster fleet.
External Secrets Operator keeps the secret material in a central store (Vault, AWS Secrets Manager, GCP Secret Manager, Azure Key Vault, 1Password, Akeyless, etc.) and pulls it into Kubernetes at runtime. No secret material in git, ever. Rotation is a write to the central store; ESO refreshes the materialised Kubernetes Secret on its next interval. This is the lab’s pattern — see /docs/openshift-platform/secrets-eso/architecture.
CSI Secrets Store driver mounts secrets directly as files inside the pod via the CSI subsystem — no Kubernetes Secret object is created at all (unless you opt in to syncing). Good for cloud-managed stores where the cloud provider has a CSI driver (AWS SSM, GCP Secret Manager, Azure Key Vault). Less mature on-prem; the Vault CSI provider works but the operational story is heavier than ESO.
For multi-cluster on-prem environments — the lab, BFSI, regulated industries — ESO + Vault is the dominant pattern and the rest of this module follows that path.
Vault — the canonical secrets store
HashiCorp Vault, available as OSS or Enterprise, is the default backing store for the ESO pattern. The lab runs Vault OSS on 30.30.30.20-23 (one seal-helper plus a three-node Raft cluster). The capabilities relevant here:
KV-v2 (key-value, versioned). The simplest engine. vault kv put secret/apps/payments/db username=app password=...; readers vault kv get secret/apps/payments/db. Every write is versioned; you can read the previous N versions for emergency rollback. KV-v2 is what 80% of secret usage looks like.
Database secrets engine. Vault generates short-lived database credentials on demand. You configure a “role” that says “create a Postgres user with INSERT/SELECT on schema payments, expire in 2 hours.” A pod requests credentials at startup; Vault creates v-payments-pod42-a8d3e7 with the role, returns the password, sets a 2h TTL. When the lease expires, Vault revokes the user. This is the strongest pattern — there is no long-lived DB password anywhere; rotation is automatic.
PKI engine. Vault as a Certificate Authority. Pods request short-lived certs (TTL hours to days) via the PKI API; Vault issues them; the pod rotates before expiry. Cuts cert-manager’s manual ACME dance for internal mTLS.
Transit engine. Encryption-as-a-service. The application calls Vault to encrypt or decrypt a payload; the key never leaves Vault. Used for things like row-level encryption in a database where the application has the ciphertext and Vault has the key.
Auth methods. Multiple ways for clients to prove who they are: Kubernetes ServiceAccount tokens (the ESO pattern below), AppRole (role-id + secret-id, useful for non-K8s clients), OIDC (humans), JWT, AWS IAM, GCP IAM, AppRole, TLS certs. The K8s SA path is the one ESO uses.
The lab uses KV-v2 universally and is moving select tenants to the database engine for the dynamic-credentials pattern. PKI is used for the cluster-internal mTLS in a few namespaces. Transit is reserved for the small number of apps that have application-level encryption needs.
The ESO architecture
Reading the diagram:
- A tenant ServiceAccount in a tenant namespace sits idle until needed.
- The SecretStore CR (namespace-scoped) declares “Vault is at
https://vault.local:8200, auth via the K8s method, the SA token to present is X.” - An ExternalSecret CR says “create a Secret named
payments-dbin this namespace, pulling fields from Vault pathsecret/apps/payments/db, refresh every 30 minutes.” - The ESO operand (running in the
external-secretsnamespace) sees the ExternalSecret, validates the SecretStore, asks Vault to authenticate the SA token via TokenReview, receives a Vault token, reads the path, and materialises the Kubernetes Secret. - The pod mounts the Secret as a volume or env var.
- Every Vault read is logged to the audit backend (the lab pipes Vault audit to the SIEM).
The interval-driven refresh is what makes rotation work. Set refreshInterval: 30m; rotate in Vault; within 30 minutes the materialised Secret has the new value. The pod still has the old value cached in its filesystem mount until you restart it (or use the Reloader operator to do that automatically) — secret rotation is a two-step dance: rotate the source, restart the consumer.
The lab uses per-namespace SecretStore, never ClusterSecretStore. Reason below.
Per-tenant secret stores
A ClusterSecretStore lets any ExternalSecret in any namespace pull from it. That sounds convenient and is a security disaster: tenant A can write an ExternalSecret pulling from secret/apps/tenantB/*, and unless Vault’s policy is bound to namespace metadata at the Vault side, it succeeds. Cluster-scoped stores collapse multi-tenancy.
The lab pattern is one SecretStore per tenant namespace, each authenticating with a tenant-specific Vault role. The Vault role’s policy is scoped to that tenant’s KV path. Tenant A’s SecretStore can only read secret/apps/tenantA/*; tenant B’s only secret/apps/tenantB/*. The blast radius of a compromised SA is one tenant’s secrets, not the fleet’s.
The wiring lives at /docs/openshift-platform/secrets-eso/tenant-secretstore-pattern. Three resources per tenant: a ServiceAccount, a Vault role mapping that SA to a policy, and a SecretStore CR. Templated, parameterised, instantiated by the tenant onboarding flow.
A minimal ExternalSecret
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: payments-db
namespace: app-payments
spec:
refreshInterval: 30m
secretStoreRef:
name: vault-apps
kind: SecretStore
target:
name: payments-db
creationPolicy: Owner
data:
- secretKey: DB_USER
remoteRef: { key: secret/apps/payments/db, property: username }
- secretKey: DB_PASSWORD
remoteRef: { key: secret/apps/payments/db, property: password }
Three fields carry the load. refreshInterval is how often ESO re-reads the source. secretStoreRef points at the per-namespace SecretStore. data[].remoteRef maps Vault fields to Kubernetes Secret keys. The pod consumes the materialised payments-db Secret like any other.
When Vault contains shape-different data — the OBC-to-LokiStack case from the lab’s bridge pattern — you use dataFrom plus a template. The OBC writes AWS_ACCESS_KEY_ID; Loki expects access_key_id. The template field renames keys at materialisation time so the operand consumer sees the shape it wants. See /docs/openshift-platform/secrets-eso/obc-to-operand-bridge.
Rotation patterns
Rotation is the single most-asked-about secrets question and the most-skipped one in production.
Static secrets, manual rotation. The base case: humans rotate the value in Vault on a schedule (every 90 days, the PCI Req 8.3.7 ceiling for stored secrets). ESO refreshes the materialised K8s Secret on its next interval; consumer pods restart to pick up the new value. The lab’s RHACS Central admin rotation is exactly this pattern — see /docs/openshift-platform/operations/routine-tasks/rotate-rhacs-central-admin. The mistake is forgetting the restart; the cached secret in the running pod stays the old value until the process re-reads it, which for most apps is “next deploy.”
Dynamic credentials. Vault’s database engine generates a unique user per pod with a short TTL. The pod authenticates to Vault at startup, gets v-pod42-a8d3e7 / random password, uses it until the lease expires, then either renews or re-requests. There is no long-lived credential anywhere — every credential is bound to a specific consumer with a specific TTL. The cleanest answer to PCI Req 8.3.7 because the rotation period can be hours, not 90 days.
Workload identity. IRSA (AWS), GCP Workload Identity, Azure Workload Identity. The pod authenticates as its IAM role via a federated token; there is no secret in the cluster at all. The cluster has a token signer, the cloud provider verifies the signed token, the pod gets cloud-IAM credentials directly. Strongest pattern when the workload is cloud-native and consuming cloud-managed services. On-prem equivalent: SPIFFE/SPIRE, which is more work to operate but the same idea.
The decision tree: if the workload consumes a cloud service, use workload identity. If the workload consumes a database, use dynamic credentials. If the workload consumes a static external API (Stripe, Twilio), use static + manual rotation. Most clusters end up with all three patterns coexisting; that’s normal.
Auth method — Kubernetes ServiceAccount
The ESO-to-Vault auth handshake is what makes the whole pattern work. The steps:
- ESO holds a tenant SA’s token (mounted into its operand pod, scoped to the tenant via projection).
- ESO POSTs
{"role": "<tenant>", "jwt": "<sa-token>"}to Vault’s/v1/auth/kubernetes/loginendpoint. - Vault calls
TokenReviewon the cluster’s API server, asking “is this SA token valid, and which SA does it represent?”. - The cluster’s
TokenReviewerSA (configured Vault-side) does the validation and returns the SA identity. - Vault matches the SA (
system:serviceaccount:app-payments:eso-reader) against the bound roles; if matched, returns a Vault token scoped to that role’s policy. - ESO uses the Vault token to read the requested paths; token TTL is short (minutes), renewed on next call.
The wiring is fiddly to set up the first time and stable thereafter. Cross-link /docs/openshift-platform/secrets-eso/vault-clustersecretstore for the lab’s exact configuration. The pieces that catch people: the TokenReviewer SA must have cluster-scoped system:auth-delegator ClusterRoleBinding; the Vault role’s bound_service_account_names and bound_service_account_namespaces must match the tenant SA exactly; the Vault server must be reachable from the ESO operand pod’s network (the lab’s NetworkPolicy gap was the cause of the recurring incident captured in /docs/openshift-platform/operations/incidents-and-runbooks/eso-egress-to-vault).
Encryption at rest — etcd
Even with Vault and ESO doing all the right things, the materialised Kubernetes Secret sits in etcd. Without etcd encryption at rest, anyone with etcd backup access can dump every Secret.
OpenShift 4.x supports etcd encryption via the APIServer CR: set spec.encryption.type: aescbc or aesgcm and the cluster rolls a key, encrypts every existing Secret, and encrypts new ones on write. The key is stored in a Kubernetes Secret inside openshift-config, which sounds circular but is reasonable — the key rotates separately and the etcd encryption protects against on-disk theft, not against an attacker who already has cluster-admin.
KMS-backed encryption is stronger. The kube-apiserver supports kms-provider plugins that delegate the encryption key to an external KMS (AWS KMS, Google Cloud KMS, Azure Key Vault, or Vault’s Transit engine). The cluster never sees the master key; it sends an encrypt/decrypt request to the KMS for every write/read. Performance overhead is a few milliseconds per Secret operation; the security upgrade is that even a full etcd dump is useless without KMS access.
The lab has etcd encryption enabled (PCI Req 1.10 — see /docs/openshift-platform/compliance/pci-dss-profile-baseline). KMS-backed encryption is a Tier-2 BFSI readiness item — the in-cluster encryption is acceptable for the lab today but graduates to KMS in a production BFSI deployment.
HSM-backed Vault
For BFSI, PCI-DSS Level 1, and FIPS 140-2 Level 3 environments, Vault’s master key must be HSM-protected. The reason is straightforward: Vault encrypts its data with a master key; if the master key is on disk (sealed by another key on disk), a sufficiently-motivated attacker who gets the disk gets everything. An HSM holds the master key in tamper-evident hardware; the attacker who steals the disk gets ciphertext they cannot decrypt without physical access to the HSM.
Patterns: Vault Enterprise + AWS CloudHSM, Vault Enterprise + Thales nShield, Vault Enterprise + Entrust nShield, or Vault Enterprise + Google Cloud HSM. All are FIPS 140-2 Level 3 or higher. The Enterprise license is required because the HSM-seal feature isn’t in Vault OSS.
The lab runs Vault OSS without HSM. The BFSI readiness review at /docs/openshift-platform/foundations/bfsi-readiness-review flags this as a Tier-1 gap — the kind of thing that has to be closed before a production BFSI deployment. For the lab’s current scope (POC, demo, internal tooling), Vault OSS with file-sealed storage is acceptable; for a real BFSI customer, the upgrade to Vault Enterprise + HSM is one of the first compliance items.
The 5 worst patterns — what NOT to do
Five anti-patterns that show up over and over.
Plain-text secret in a ConfigMap. ConfigMaps have looser RBAC defaults than Secrets — view role can list ConfigMaps by default. “I’ll store the API key in a ConfigMap so it’s easier to read” is the start of a breach.
Secret value in an env var name. A pod with env: [{ name: "STRIPE_KEY_sk_live_abc123", value: "1" }] makes the secret visible to anyone who can kubectl describe pod. The value field can be sourced from a Secret; the name field cannot, and describe always shows it. Sounds dumb, happens routinely.
Static API key shared across teams. “Six services use the same Stripe key.” Rotation requires coordinating six teams, four of which won’t respond in any reasonable timeframe; the key is effectively unrotatable. Either issue per-service keys at the provider or wrap a shared key in a per-service proxy that lets you rotate the upstream without touching consumers.
Vault root token committed to git. This happens at least once per organisation that adopts Vault. The root token is for emergencies; daily operations use scoped tokens. Generate the root token, write it down on paper for the break-glass scenario, revoke it from active use, and never let it touch a CI runner or a developer laptop.
Service account auto-mount + over-broad RBAC. The default pod spec mounts the namespace’s default SA token. The default SA has whatever RBAC the namespace’s RoleBindings give it. If anyone bound the namespace’s default SA to cluster-admin (it happens — for debugging, for a quick fix, for a misread tutorial), every pod in that namespace is cluster-admin. Set automountServiceAccountToken: false on every pod that doesn’t need the token; create a named SA with only the RBAC the workload actually needs.
Lab posture
The lab’s current secrets stack:
- Vault OSS on
30.30.30.20-23— one seal-helper at .20, three-node Raft cluster at .21-.23. File-sealed (no HSM). Tokens at known custody paths (see thereference_vault_oss_topologymemory). - ESO 1.1.0 (Red Hat operator),
external-secretsnamespace. Per-tenantSecretStore, neverClusterSecretStore. Kubernetes auth method. - Egress NetworkPolicy — ESO’s default-deny is overridden by an allow-egress NP targeting
30.30.30.20/30:8200. See/docs/openshift-platform/operations/incidents-and-runbooks/eso-egress-to-vault. - etcd encryption at rest — enabled (PCI Req 1.10). Not KMS-backed yet.
- HSM — not deployed. Tier-1 BFSI gap.
- Rotation — manual + ESO refresh for static secrets; dynamic Postgres credentials in the works for two tenant namespaces (POC); workload identity not yet (no cloud IAM in scope).
The shape is good for the lab’s current threat model and explicitly insufficient for a regulated BFSI customer. The gap list is documented; closing it is a known work-plan.
Try this
- Install ESO + Vault on a kind cluster. Vault in dev mode, ESO via Helm. Create a
SecretStorewithTokenauth (dev-mode trick — production uses the K8s auth method). - Write an
ExternalSecretthat materialises adb-credsSecret from Vault pathsecret/myapp/db. Verify the Secret appears with the right keys. - Configure Vault’s database engine to generate short-lived Postgres credentials. Create a role with TTL=2m; request credentials; observe Vault create a Postgres user; wait for TTL expiry; observe the user vanish from Postgres.
- Audit Vault for any secret paths read by ServiceAccounts that no longer exist in the cluster. The audit log is the truth; reconcile orphaned paths and remove them.
Common failure modes
ExternalSecret stuck SecretSyncError. The most common cause is the SecretStore can’t reach Vault. Check the ESO operand’s logs; usually it’s a NetworkPolicy too strict, a wrong Vault URL, or DNS resolution failing inside the operand pod. The lab pattern is the explicit egress NP cross-linked above.
Vault auth permission denied. The Vault role’s bound_service_account_names or bound_service_account_namespaces doesn’t match the SA presenting the token. Run vault read auth/kubernetes/role/<role> and confirm the bindings; common mistake is default SA presented when the role binds a named SA.
Materialised Secret has wrong key names. The Vault path has keys A, B, C; the consumer wants key_A, key_B, key_C. Add a template block to the ExternalSecret that maps fields at materialisation time. The OBC-to-LokiStack bridge in the lab is the canonical example.
Sealed Secrets key compromised. Every secret encrypted with the compromised key is now suspect — you must rotate every Sealed Secret on the cluster. The work is mechanical (re-seal every CR with the new public key, commit, sync) but tedious; if your Sealed Secrets count is in the hundreds, expect a multi-day rotation. The hardening is: rotate the Sealed Secrets keypair on schedule (not just on compromise) so the muscle memory is there when you need it.
Secret rotated in Vault but app still sees old value. ESO refreshed the K8s Secret; the pod’s mounted volume updated; the application has the old value cached in memory. Two fixes: install the Reloader operator (auto-restarts pods when their consumed Secret changes), or design the application to re-read its secret on a signal (HUP, periodic check). The Reloader path is cheaper and works for most workloads.
Vault unsealed but Raft cluster split-brain. Three Vault nodes; one drifts from quorum; reads succeed against the leader but writes fail because quorum is two; after a network blip, two nodes think they’re leaders. The fix is documented in Vault’s operations guide — step the failed node down, re-join, re-seal. The harder fix is monitoring: Alertmanager rule on vault_core_active over the three-node cluster, fire if quorum drops below 2 for more than 60 seconds.
References
- HashiCorp Vault: developer.hashicorp.com/vault
- Vault Kubernetes auth: developer.hashicorp.com/vault/docs/auth/kubernetes
- Vault database secrets engine: developer.hashicorp.com/vault/docs/secrets/databases
- External Secrets Operator: external-secrets.io
- ESO Vault provider: external-secrets.io/latest/provider/hashicorp-vault
- Sealed Secrets: github.com/bitnami-labs/sealed-secrets
- Secrets Store CSI Driver: secrets-store-csi-driver.sigs.k8s.io
- Reloader: github.com/stakater/Reloader
- Kubernetes etcd encryption: kubernetes.io/docs/tasks/administer-cluster/encrypt-data
- OpenShift etcd encryption: docs.openshift.com — encrypting etcd data
- SPIFFE / SPIRE (on-prem workload identity): spiffe.io
Next: Module 09 — Supply-chain signing — cosign, sigstore, attestations, and the policy controllers that enforce them.