Upgrade and channel management
How operator upgrades happen in the dc-lab fleet: when channels change, when startingCSV bumps, and the multi-step procedure that keeps clusters consistent.
Upgrades are the rare events. With installPlanApproval: Manual and min == max pinning, an operator only moves forward when we deliberately choose to move it. This page documents how that choice is implemented end to end — the mirror update, the GitOps changes, the cluster apply, and the validation.
Why an upgrade is a tracked event
Day-to-day operations on the fleet do not upgrade operators. Operators run on the pinned version listed in operator-version-lock.md until an explicit decision to bump. The triggers that turn into upgrades:
- A security CVE in the running operator or its operand that has a fix in a newer version.
- An OpenShift minor upgrade (e.g. 4.20 → 4.22) that requires aligning operator versions with the new OCP release.
- A new feature in a later operator version that has been scoped and approved.
- End-of-life on the current operator version within the Red Hat support window.
Each of these is a tracked GitHub issue per ADR 0016. The issue lays out:
- the from-version and to-version;
- the CVE or feature reference;
- the OCP versions involved;
- the affected clusters;
- the planned downtime / impact;
- the rollback path.
Two upgrade shapes
| Shape | Example | Subscription change | Catalog change |
|---|---|---|---|
| Patch upgrade in same channel | ESO 1.1.0 → 1.1.1 (channel stable-v1) | bump startingCSV | re-mirror min==max=1.1.1 |
| Channel change | ACM 2.16 → 2.17 (release-2.16 → release-2.17) | bump channel AND startingCSV | new channel index path in IDMS / catalog |
Patch upgrades are simpler — the catalog index path is unchanged, the CatalogSource doesn’t change, only the CSV does. Channel changes touch more layers.
The six-step procedure
For every upgrade, the procedure runs in order. Skipping a step is the most common cause of partial upgrade and post-upgrade firefights.
Step 1 — Update the canonical table
Edit opp-full-plat/plans/disconnected-rebuild/environments/dc-lab/operator-version-lock.md to reflect the new target version. Open the tracking issue. Cite ADRs (0019 for pinning, 0025 for GitOps-only operations, 0018 for the pull model — whichever apply).
Step 2 — Edit the ImageSetConfiguration
In imageset-config.yaml, change the package’s minVersion / maxVersion to the new target. For a channel change, also change the channel name.
# Before
- name: openshift-external-secrets-operator
defaultChannel: stable-v1
channels:
- name: stable-v1
minVersion: 1.1.0
maxVersion: 1.1.0
# After (patch upgrade)
- name: openshift-external-secrets-operator
defaultChannel: stable-v1
channels:
- name: stable-v1
minVersion: 1.1.1
maxVersion: 1.1.1
Step 3 — Mirror the new content
cd /home/ze/ocp-mirror-workspaces/dc-lab
# Validate the new ImageSet first
oc mirror --v2 \
--config imageset-config.yaml \
--workspace file://full-operators-dryrun-workspace \
docker://mirror-registry.apps.sub.comptech-lab.com \
--authfile pull-secret.merged.json \
--dry-run
# Compare mapping count; the diff should reflect added/removed images
diff <(sort full-operators-dryrun-workspace/working-dir/dry-run/mapping.txt) \
<(sort previous-mapping.txt)
# If acceptable, run the real mirror
tmux new-session -d -s oc-mirror-upgrade ./tools/run-oc-mirror-fast.sh
For a single-version patch upgrade, the diff is usually a handful of bundle images. For an OCP minor upgrade or a channel change, the diff can be tens of images.
Step 4 — Regenerate cluster resources
oc mirror --v2 regenerates the cluster-resources/ tarball. Compare:
diff -u previous/cluster-resources/idms-oc-mirror.yaml current/cluster-resources/idms-oc-mirror.yaml
diff -u previous/cluster-resources/cs-redhat-operator-index-v4-20.yaml current/cluster-resources/cs-redhat-operator-index-v4-20.yaml
For a patch upgrade:
- IDMS rarely changes (same source registries).
- CatalogSource image digest changes — that’s the central change.
For a channel change:
- IDMS may add new
source:entries if the new channel pulls from a new registry path. - CatalogSource image digest changes.
Commit the updated IDMS and CatalogSource manifests to platform-gitops. This is one MR.
Step 5 — Update the Subscription(s)
Edit the affected operator’s subscription.yaml:
# Before (patch upgrade)
spec:
channel: stable-v1
installPlanApproval: Manual
startingCSV: openshift-external-secrets-operator.v1.1.0
# After
spec:
channel: stable-v1
installPlanApproval: Manual
startingCSV: openshift-external-secrets-operator.v1.1.1
For a channel change, both channel and startingCSV move. Commit as a separate MR from the catalog update, in case the catalog update needs to roll back.
Step 6 — Approve the InstallPlan
Once the new Subscription is Synced/Healthy and OLM resolves the upgrade:
K=/path/to/cluster.kubeconfig
oc --kubeconfig "$K" -n <ns> get installplan
# NAME CSV APPROVAL APPROVED
# install-abc12 openshift-external-secrets-operator.v1.1.1 Manual false
# Review the planned change
oc --kubeconfig "$K" -n <ns> get installplan install-abc12 -o yaml \
| yq '.spec.clusterServiceVersionNames'
Then approve via GitOps — commit an installplan-1.1.1.yaml file in platform-gitops with spec.approved: true. Argo applies it; OLM continues; new CSV reconciles.
Approving via oc patch is break-glass only. The default path is the GitOps MR because it captures who approved what and when.
Validation after upgrade
For each operator upgrade:
K=/path/to/cluster.kubeconfig
NS=<operator-namespace>
OP=<operator-package>
# CSV reached Succeeded
oc --kubeconfig "$K" -n "$NS" get csv | grep "$OP"
# expect: $OP.vNEW_VERSION Succeeded
# No more replaceable predecessor
oc --kubeconfig "$K" -n "$NS" get csv -o jsonpath='{range .items[?(@.spec.replaces!="")]}{.metadata.name}{" replaces "}{.spec.replaces}{"\n"}{end}'
# expect: empty after a clean upgrade
# Subscription source matches target
oc --kubeconfig "$K" -n "$NS" get subscription "$OP" -o jsonpath='{.spec.startingCSV}{"\n"}'
# expect: $OP.vNEW_VERSION
# Operator pod healthy
oc --kubeconfig "$K" -n "$NS" get deploy
# Operand health (operator-specific)
# e.g. for ESO:
oc --kubeconfig "$K" -n "$NS" get pods -l app.kubernetes.io/instance=external-secrets-operator
# ACM policy compliance (if a policy exists for this operator):
HUB=/path/to/hub.kubeconfig
oc --kubeconfig "$HUB" -n open-cluster-management-policies \
get policy <policy-name> -o jsonpath='{.status.compliant}'
# expect: Compliant
For OperatorPolicy-governed operators, the policy compliance becomes the long-lived audit record.
Rollback
A clean rollback for OLM operators is not generally supported. CSV upgrades may apply CRD migrations that don’t reverse. Plan accordingly:
| Risk | Mitigation |
|---|---|
| CRD schema change in new CSV | back up operand CRs before upgrade; if rollback needed, restore CRs from backup after re-installing the old CSV |
| Old version no longer in the catalog | keep the previous mirror state on the Nexus VM for at least one upgrade cycle |
Argo selfHeal interferes with the rollback | temporarily disable auto-sync on the operator’s Application during rollback |
If rollback is required, the procedure is:
- Revert the GitOps Subscription change (re-MR with the old
startingCSV). - Re-mirror the old version into Nexus if it was purged.
- Delete the new CSV and let OLM re-install the old one.
- Verify operand CRs reconcile correctly against the old operator.
In practice this is painful, which is the reason the manual-approval gate exists: it’s cheaper to delay an upgrade than to undo a bad one.
OCP minor upgrades and operator alignment
An OCP minor upgrade (e.g. 4.20.18 → 4.21.x) typically requires moving several operators to versions that ship in the 4.21 catalog. The procedure:
- Decide the new OCP target version (tracking issue, ADR if needed).
- Update
imageset-config.yamlplatform.channels[].minVersion/maxVersionto the new release. - For each operator that needs to move, update its package entry in
imageset-config.yaml. - Update CatalogSource manifests to point at the new
redhat-operator-index:v4.21(the index image path changes —v4.20→v4.21). - Re-mirror. Validate.
- Upgrade OCP via the standard cluster-upgrade procedure (separate from operator upgrades).
- Upgrade operators in waves matching the new versions.
An OCP minor upgrade is a multi-day event with significant testing. It’s outside the scope of “operator upgrade” and gets its own runbook.
Failure modes during upgrade
| Symptom | Root cause | Fix |
|---|---|---|
oc mirror upgrade run fails partway | upstream tag/digest changed during mirror | re-run; if persistent, freeze the target version explicitly |
Subscription stuck UpgradePending after CSV bump | new CSV’s replaces chain doesn’t go through the current CSV | verify the upgrade path; may need to install intermediate CSV first |
InstallPlan exists but RequirementsNotMet | dependent operator hasn’t been upgraded yet | upgrade dependencies first (e.g. MCE before ACM) |
| CSV upgrade succeeds but operand CR breaks | CRD schema migration; old operand CR no longer valid | manual operand reconfiguration; rare on patch upgrades, real risk on major version jumps |
| Cluster nodes start rolling on upgrade (unexpected) | a MachineConfig change went out alongside the operator upgrade | inspect MCO history; this should be expected for storage / network operators |
Old CSV stuck in Replacing state | OLM hasn’t garbage-collected; non-fatal | oc delete csv <old> after confirming the new one is healthy |
Channel-jump quick reference
A few common channel jumps and what changes:
| Operator | From | To | What else changes |
|---|---|---|---|
advanced-cluster-management | release-2.16 | release-2.17 | check multicluster-engine channel compatibility |
odf-operator | stable-4.20 | stable-4.21 | requires OCP 4.21 first; ODF version aligned with OCP minor |
openshift-gitops-operator | latest (1.20.x) | latest (1.21.x) | Argo CD CRD changes possible; review ApplicationSets |
openshift-pipelines-operator-rh | pipelines-1.22 | pipelines-1.23 | Tekton CRDs evolve; check TaskRun / PipelineRun compatibility |
loki-operator + cluster-logging | stable-6.5 | stable-6.6 | upgrade in pair; cluster-logging depends on Loki schema |
tempo-product | stable (0.20.0-3) | stable (next) | Tempo CR schema sometimes adds required fields |
References
- 02-version-pinning-strategy — the pinning rules an upgrade temporarily breaks.
- 04-operatorpolicy-via-acm — what governance reports during the upgrade window.
- 02-oc-mirror-workflow — the mirror side of the upgrade.
- 03-idms-itms-and-cluster-pull — IDMS regeneration after a catalog change.
opp-full-plat/plans/disconnected-rebuild/environments/dc-lab/operator-version-lock.md— canonical version table.- ADR 0019 — Nexus-only image supply chain.
- ADR 0025 — GitOps-only operations.