MCO stuck-node recovery (desiredConfig annotation patch)
When a bad MachineConfig has been reverted but MCO's max-unavailable safety guard refuses to roll an already-unavailable node, a one-line desiredConfig annotation patch per stuck node unsticks the drain -> apply -> reboot cycle.
This page is the procedural companion to the IPv6 disable / OVN-K incident (#135): the recovery method for unsticking a node that is Ready=False after a bad MachineConfig rollout, when the source MachineConfig has already been reverted in GitOps but MCO is refusing to roll the affected node back to the new rendered config. The pattern recurs anywhere an MC rollout leaves a node unavailable before MCO’s max-unavailable count permits the next reconcile.
The fix is a one-line annotation patch per stuck node. Drain -> apply -> reboot then proceeds normally.
Symptom
All of the following are true together:
- A MachineConfig change was rolled out and left one or more nodes
Ready=False(orReady=Truewith persistent application-layer failure caused by the config change, such asovnkube-nodeCrashLoopBackOff). - The triggering MachineConfig has been reverted in GitOps and the revert has merged.
- The owning
MachineConfigPool’s.spec.configuration.namereflects the new (post-revert) rendered config name. - The affected nodes’
machineconfiguration.openshift.io/desiredConfigannotation still points at the OLD (bad) rendered config name, and nooc get mcpevent indicates MCO is trying to roll them. - The MachineConfigPool is NOT
Degraded, and the render controller shows no error in the MCO controller log.
If the MachineConfigPool is Degraded, fix the upstream cause first (render-controller failure, conflicting MachineConfigs, unsatisfiable selector). This page only addresses the safety-guard-refuses-to-roll case.
Diagnostic that pinpoints the cause:
K=/home/ze/.kube/configs/<cluster>.kubeconfig
oc --kubeconfig "$K" get nodes \
-o custom-columns=\
NAME:.metadata.name,\
ROLE:.metadata.labels.node-role\.kubernetes\.io/worker,\
READY:.status.conditions[?(@.type=="Ready")].status,\
CURRENT:.metadata.annotations.machineconfiguration\.openshift\.io/currentConfig,\
DESIRED:.metadata.annotations.machineconfiguration\.openshift\.io/desiredConfig
A stuck node shows READY=False (or unhealthy app-layer symptoms) with CURRENT == DESIRED pointing at the OLD rendered config name.
Root cause
MCO has a max-unavailable safety guard. When a node is already Ready=False, the controller declines to update its desiredConfig because the per-pool unavailable count is already at the guard’s ceiling — rolling the unhealthy node could push it over. The MachineConfigPool .spec.configuration.name moves forward to the new (good) rendered config, but the unhealthy node is left pointing at the old one.
Result: the node sits permanently with the OLD currentConfig == desiredConfig, while every other node in the pool happily rolls to the new config. The pool reports the right desired state; the stuck node never gets the message.
The one-line annotation patch overrides the safety guard for the stuck node. MCD on the node reads the new desiredConfig, detects the disruption type (kargs:true, files:true), and runs the drain -> apply -> reboot cycle. Once the node returns Ready, MCO’s max-unavailable count drops back below the guard and normal reconciliation resumes.
Fix
Patch one node at a time. Concurrent patches risk pushing the pool over its max-unavailable count and triggering a second class of failure.
Pre-action checklist
-
Identify the new (good) rendered-config name for the owning pool:
oc --kubeconfig "$K" get mcp <pool> \ -o jsonpath='{.spec.configuration.name}{"\n"}'This is the value the stuck node’s
desiredConfigannotation needs to be patched to. -
Confirm the pool is NOT
Degraded:oc --kubeconfig "$K" get mcp <pool> \ -o jsonpath='{.status.conditions[*].type}{"\n"}{.status.conditions[*].status}{"\n"}'Degraded=Falseis required. IfDegraded=True, abort and investigate the render-controller log:oc --kubeconfig "$K" -n openshift-machine-config-operator \ logs deploy/machine-config-controller --tail=200 -
Confirm MCD is reachable on the stuck node:
oc --kubeconfig "$K" -n openshift-machine-config-operator get pods \ -o wide | grep <stuck-node>The
machine-config-daemon-<...>pod for the stuck node must beRunning. If it is not, this page does not apply — see Console / SSH fallback below. -
Open or update the incident issue with stuck-node names, old vs new rendered-config names, and pool
Degradedstatus. This becomes the audit record.
Action steps
For each stuck node, in series:
-
Capture starting state for audit:
mkdir -p /tmp/mco-recovery-$(date -u +%Y%m%dT%H%M%SZ) D=/tmp/mco-recovery-* oc --kubeconfig "$K" get node <stuck-node> -o yaml > $D/before-<stuck-node>.yaml oc --kubeconfig "$K" get mcp <pool> -o yaml > $D/before-mcp-<pool>.yaml -
Apply the annotation patch:
TS=$(date -u +%Y-%m-%dT%H:%M:%SZ) GOOD=$(oc --kubeconfig "$K" get mcp <pool> \ -o jsonpath='{.spec.configuration.name}') echo "$TS actor=$USER node=<stuck-node> desiredConfig=$GOOD" \ | tee -a $D/commands.log oc --kubeconfig "$K" annotate node <stuck-node> \ machineconfiguration.openshift.io/desiredConfig=$GOOD \ --overwrite -
Watch the MCD pod log — it should log
node desiredConfig changedand begin the disruption cycle within seconds:MCD=$(oc --kubeconfig "$K" -n openshift-machine-config-operator \ get pods -o wide \ --field-selector spec.nodeName=<stuck-node> \ -l k8s-app=machine-config-daemon \ -o jsonpath='{.items[0].metadata.name}') oc --kubeconfig "$K" -n openshift-machine-config-operator \ logs -f $MCDExpected sequence:
node <node> changed: desiredConfig -> <new-rendered-config> Disruption type: <kargs|files|both> Draining node <node> ... Applying config <new-rendered-config> Rebooting node <node>After the reboot, MCD confirms the new
currentConfigmatches thedesiredConfigand the node returnsReady. -
Capture post-recovery state:
oc --kubeconfig "$K" get node <stuck-node> -o yaml > $D/after-<stuck-node>.yaml -
Repeat steps 1-4 for each remaining stuck node. Patch one, wait for it to return
Ready, patch the next.
Validation
Recovery is complete when ALL of these are true:
oc get nodesshows every previously-stuck nodeReady=True.currentConfiganddesiredConfigon each recovered node both point at the new (good) rendered-config name.oc get mcp <pool>showsUPDATED=True,UPDATING=False,DEGRADED=False,readyMachineCount == machineCount.- Application-layer health that depended on the rollout is restored: for OVN-related rollouts,
oc -n openshift-ovn-kubernetes get pods -l app=ovnkube-nodeshows all containersReadywith no recent restarts. - Workloads that timed out during the stuck period have recovered (check the relevant pod logs for stale failure modes that need a manual restart).
Console / SSH fallback
If MCD itself is unreachable on the stuck node (CrashLoopBackOff, unreachable kubelet, network partition), the annotation patch will not be picked up and a console-level recovery is required:
- Console into the affected node (out-of-band management or KVM).
- Drop to a root shell on RHEL CoreOS.
- Edit
/boot/loader/entries/ostree-*-rhcos.confto revert the kargs change manually (forkargs:disruption type) OR restore the previous config file from/etc/machine-config-daemon/orig/(forfiles:disruption type). - Reboot the node manually.
- After the node returns
Ready, run the annotation patch from “Action steps” to re-sync MCD’s view.
The console-level fallback is rare. The annotation patch covers >95% of stuck-node-on-revert cases in practice.
Forbidden actions
- Do NOT patch
desiredConfigto a rendered-config name that does NOT exist in the cluster’sMachineConfiglist. MCD will refuse to apply it and the node remains stuck. - Do NOT use
oc patchon the rendered MachineConfig directly. Rendered MachineConfigs are managed by the MCO controller; direct edits are reverted on the next render and break the audit chain. - Do NOT skip the before/after capture. The audit record is the evidence the recovery was clean.
- Do NOT patch multiple stuck nodes in parallel. One at a time. Concurrent patches risk a second failure class.
- Do NOT use this procedure when the pool is
Degraded. Fix the render-controller cause first; the annotation patch will not help and may mask the real failure. - Do NOT use this procedure to “force” a node onto a config the pool has not adopted (to skip a problematic MachineConfig still in Git). The annotation must point at a rendered-config name the pool’s
.spec.configuration.namereferences; anything else gets reverted on the next render.
Example: 2026-05-10 #135 recovery
Context: MR !2 on platform-gitops added ipv6.disable=1 to a master and worker MachineConfig. After the MR merged, OVN-K could not establish geneve overlay on the affected nodes (see IPv6 disable / OVN-K) and they returned Ready=False.
Recovery sequence on spoke-dc-v6:
-
MR
!3reverted the IPv6 MachineConfigs. -
The
masterandworkerMachineConfigPools each updated.spec.configuration.nameto a new rendered config within a minute of the revert merge. -
spoke-dc-v6-master-1andspoke-dc-v6-worker-1remainedReady=FalsewithdesiredConfigstill pointing at the OLD rendered config. The MachineConfigPools were NOTDegraded. -
For each stuck node, applied the annotation patch:
GOOD=$(oc get mcp master -o jsonpath='{.spec.configuration.name}') oc annotate node spoke-dc-v6-master-1 \ machineconfiguration.openshift.io/desiredConfig=$GOOD --overwrite -
MCD on
spoke-dc-v6-master-1loggednode desiredConfig changedwithin five seconds, ran the drain -> kargs change -> reboot cycle, and the node returnedReadyabout eight minutes later. -
Repeated for
spoke-dc-v6-worker-1. -
Validated
ovnkube-nodepodsReadyon all six nodes; theovnkube-nodeCrashLoopBackOffs cleared as soon as the reverted MachineConfig was active.
Prevention
The recovery procedure is itself the prevention contract — but it should run rarely. Two reinforcements:
- MachineConfig changes land via a small canary first. A pool of size 1 (a worker labelled for canary) absorbs the failure mode before the broader rollout. The lab has not yet wired this; tracked under #229 follow-ups.
- The architectural lesson lives next door. The IPv6 / OVN-K incident is the canonical case where MC changes break the network plugin. The IPv6 disable / OVN-K page and ADR 0026 are the standing references for what not to attempt; this page is the recovery if you attempt it anyway.
References
- IPv6 disable / OVN-K — the architectural-lesson companion.
- Break-glass procedure — overall policy when this recovery has to bypass normal GitOps flow.
opp-full-plat/runbooks/mco-stuck-node-recovery.md— operator-facing source.opp-full-plat/connection-details/platform-admin-handoff.mdsection “Known Gotchas From This Rebuild”.- GitHub issue #135 (PCI-1.13) — the live incident this page was extracted from.
- Internal GitLab MRs
!2,!3,!4,!7oncomptech-platform/openshift-ops/openshift-platform-gitops.