ESO operand egress blocked to Vault
Red Hat ESO 1.1.0 ships default-deny NetworkPolicies in the external-secrets namespace; the operand silently hangs reconciling against the VM Vault until an allow-egress policy is added.
This is the silent-failure incident every operator hits the first time they install Red Hat External Secrets Operator (ESO) 1.1.0 against the VM-hosted Vault. The operator pod is healthy, the ClusterSecretStore looks well-formed, and the controller-manager can reach Vault — but the operand reconciler runs in a different namespace, the namespace has default-deny NetworkPolicies, and the Vault login times out.
The fix is a single NetworkPolicy at sync-wave 20.
Symptom
-
ESO operator install completes cleanly. Subscription
AtLatestKnown, CSVSucceeded, controller-manager podRunningwith no restarts. -
ClusterSecretStorereportsReady=Falsewith reasonInvalidProviderConfig:K=/home/ze/.kube/configs/<cluster>.kubeconfig oc --kubeconfig "$K" get clustersecretstore -o yaml \ | grep -A5 conditions # status: # conditions: # - lastTransitionTime: "..." # message: 'unable to create client: ...' # reason: InvalidProviderConfig # status: "False" # type: Ready -
The operand log shows a context deadline on Vault Kubernetes auth:
oc --kubeconfig "$K" -n external-secrets logs deploy/external-secrets \ | grep -i vault # ERROR ... unable to log in to auth method: # unable to log in with Kubernetes auth: # context deadline exceeded -
A direct TCP probe from inside the operand pod times out:
oc --kubeconfig "$K" -n external-secrets exec deploy/external-secrets \ -- curl -skf https://vault.sub.comptech-lab.com:8200/v1/sys/health # (hangs, exits 124 after default timeout) -
The operator-side controller-manager (in a different namespace) can reach Vault. That is the confusing part — the operator works, the operand does not.
The two namespaces:
| Namespace | What runs there |
|---|---|
external-secrets-operator | The CSV’s controller-manager pod (operator) |
external-secrets | The reconciler pods that actually evaluate ClusterSecretStores and ExternalSecrets (operand) |
Default OpenShift NetworkPolicy allows all egress unless overridden. ESO 1.1.0 ships overrides on the operand namespace and nothing on the operator namespace, which is why the operator looks fine.
Root cause
Red Hat ESO 1.1.0 (openshift-external-secrets-operator.v1.1.0) ships four NetworkPolicies in the external-secrets namespace by default:
| NetworkPolicy | What it allows |
|---|---|
deny-all-traffic | Catch-all default-deny (the baseline) |
allow-api-server-egress-for-main-controller | TCP/6443 to the kube-apiserver only |
allow-api-server-egress-for-cert-controller | TCP/6443 to the kube-apiserver only |
allow-api-server-egress-for-webhook | TCP/6443 to the kube-apiserver only |
allow-to-dns | UDP/53 to in-cluster DNS only |
Egress to anything else is implicitly denied. The VM Vault at vault.sub.comptech-lab.com:8200 is outside the cluster, on the lab /24 network — none of the shipped policies permit it.
The reconciler in external-secrets namespace runs vault.NewClient(...) and calls auth.Login(ctx, ...). The TCP SYN to the Vault VM is dropped by the namespace’s default-deny. The Go client waits for the context deadline, the deadline elapses (default 30s or 60s depending on call site), and the error surfaces as context deadline exceeded.
The operator-side controller in external-secrets-operator namespace is not subject to these NetworkPolicies, so a direct probe from there succeeds — leading operators down the wrong diagnostic path of “Vault auth misconfiguration” or “wrong role binding”.
Fix
Add an allow-egress NetworkPolicy in the operand namespace, targeting the Vault VM(s) on TCP/8200, scoped to pods labelled app.kubernetes.io/name=external-secrets. The fix is already in GitOps at:
clusters/hub-dc-v6/secrets/eso/networkpolicy-vault-egress.yaml
clusters/spoke-dc-v6/secrets/eso/networkpolicy-vault-egress.yaml
Annotated sync-wave: "20" so it lands after the operator install (waves 10-11) and before any ClusterSecretStore reconcile.
The shape:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-egress-to-vault-vm
namespace: external-secrets
annotations:
argocd.argoproj.io/sync-wave: "20"
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: external-secrets
policyTypes:
- Egress
egress:
- to:
- ipBlock:
# Vault VM range on the lab /24 (the specific /30 is in
# opp-full-plat/connection-details/vault-app-secrets.md)
cidr: <vault-vm-cidr>/30
ports:
- protocol: TCP
port: 8200
(The exact cidr value is internal-only; it lives in connection-details/vault-app-secrets.md. This page deliberately does not publish the raw IP range.)
After the MR merges and Argo applies the NetworkPolicy, restart the operand deployment to refresh the connection — the ESO controller does not re-trigger reconcile on NetworkPolicy changes:
oc --kubeconfig "$K" -n external-secrets rollout restart deployment external-secrets
Validation:
# Operand can now reach Vault:
oc --kubeconfig "$K" -n external-secrets exec deploy/external-secrets \
-- curl -skf https://vault.sub.comptech-lab.com:8200/v1/sys/health
# ClusterSecretStore is Ready:
oc --kubeconfig "$K" get clustersecretstore \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}'
# every entry should show True
# Per-secret ExternalSecrets reconcile:
oc --kubeconfig "$K" -n <app-ns> get externalsecret
# All entries SecretSynced=True
Prevention
Three layers of prevention have been adopted in the lab:
-
The NetworkPolicy lives in GitOps at sync-wave 20 alongside the ESO install. Any future cluster onboarding that follows the same kustomize tree inherits the policy automatically — there is no per-cluster manual step. Issue #229 captures this as a routine cluster-onboarding contract item.
-
Periodic check in the every-session warm-up. A one-line probe confirms ESO can still reach Vault:
oc --kubeconfig "$K" -n external-secrets exec deploy/external-secrets \ -- curl -skf https://vault.sub.comptech-lab.com:8200/v1/sys/health \ > /dev/null && echo "ESO -> Vault OK" || echo "ESO -> Vault FAILED"Add to the day-1 handoff cluster-health snapshot.
-
NetworkPolicy widening on Vault endpoint change. If ESO is later expanded to read from a new Vault endpoint, a different Vault VM, or a different port, the NetworkPolicy’s
ipBlockmust be widened in the same MR. The pattern is: one MR, one change to ESO config and one change to the egress policy. Never land one without the other.
A note on the alternative — labelling the operand namespace and using a namespace-level egress allow:
- Labelling pattern (
egress.policy/vault: allowed) is cleaner architecturally but requires the operator’s CSV to set the label on every namespace it creates. ESO 1.1.0 does not do this. - An explicit per-namespace NetworkPolicy is the lab’s current shape because the operand namespace set is small (one —
external-secrets) and the policy is identical across clusters.
If the lab adds a second operand namespace consuming Vault (e.g., a tenant-scoped External Secrets instance), revisit the design before scattering identical policies.
References
opp-full-plat/connection-details/vault-app-secrets.mdclusters/hub-dc-v6/secrets/eso/networkpolicy-vault-egress.yamlclusters/spoke-dc-v6/secrets/eso/networkpolicy-vault-egress.yaml- ADR:
0019-nexus-only-image-supply-chain(image policy; ESO image comes via the same supply chain) - Red Hat ESO 1.1.0 release notes (ships default-deny NetworkPolicies)