Rotation and disaster recovery

Raft snapshots to MinIO, isolated restore drills, lost-node replacement, root token rotation, and the lab's open work on backup/restore validation.

Vault on three Raft voters tolerates the loss of one voter without downtime. Beyond that — loss of two voters, loss of the seal Vault, full lab rebuild — recovery depends on snapshots, custody of unseal shares, and a tested restore drill. This page covers the procedures.

Snapshot cadence and target

Vault’s integrated Raft storage exports a point-in-time snapshot via:

vault operator raft snapshot save <file>

The snapshot is a single tar.gz with the entire Raft state at the moment of capture. Encryption and storage in MinIO are layered on top.

The snapshot job

A small systemd timer + script on one of the voters runs the snapshot once daily (and on demand):

#!/bin/bash
set -euo pipefail

VAULT_TOKEN="$(cat /etc/vault.d/snapshot-token)"
export VAULT_TOKEN
export VAULT_ADDR=https://127.0.0.1:8200
export VAULT_CACERT=/etc/vault.d/tls/ca.crt

TMP=$(mktemp -d)
trap 'rm -rf "$TMP"' EXIT

OUT="$TMP/raft-$(date -u +%Y%m%d-%H%M%S).snap"
vault operator raft snapshot save "$OUT"

# Verify the snapshot is a valid tarball
file "$OUT" | grep -q 'gzip compressed data'

# Upload to MinIO
mc cp "$OUT" lab/vault-snapshots/

# Local copy retention (operator decides)
cp "$OUT" /var/lib/vault-snapshots/local/
find /var/lib/vault-snapshots/local/ -mtime +14 -delete

Two retention layers:

  • MinIO vault-snapshots/ — long-term archive, operator-managed.
  • Local /var/lib/vault-snapshots/local/ — last 14 days, useful for quick rollback without an S3 round-trip.

The snapshot token is scoped (snapshot-readonly policy — read on sys/storage/raft/snapshot and nothing else). Never use a root token here.

Cadence

WhatWhen
Scheduled snapshotOnce per day via systemd timer
Manual snapshotBefore any potentially destructive operation (policy bulk edit, mount move, upgrade)
Pre-upgrade snapshotFirst step of any Vault binary upgrade

Restore drill

The restore drill validates that a snapshot is actually usable. Required before Vault is treated as production-trusted per the readiness gates in vault-oss-vm-plan.md.

Steps

  1. Spin up an isolated Vault VM. A fresh Ubuntu with Vault 1.21.1 installed but uninitialized. Different network segment from production — restore must not interfere with the live cluster.

  2. Initialize it with the same seal strategy as production (vault operator init against a stand-in transit-seal). Save the new init output to scratch custody.

  3. Pull a recent snapshot from MinIO:

    mc cp lab/vault-snapshots/<recent>.snap /tmp/raft.snap
  4. Restore:

    export VAULT_TOKEN=<scratch-root-token>
    vault operator raft snapshot restore -force /tmp/raft.snap
  5. Re-seal cycle. The restored Vault will be sealed (the restored Raft includes the seal config but the local node hasn’t been keyed up). Provide the unseal material (this is what makes the drill realistic — restore depends on having the unseal flow).

  6. Verify:

    vault status
    vault auth list
    vault secrets list
    vault kv get secret/apps/<known-test>/dev/hello
  7. Destroy the drill VM. Don’t leave it running.

The drill cannot skip the seal step. If you can’t perform the unseal flow, you cannot restore from backup — that’s the whole point of the seal strategy.

Current state

The drill is not yet performed at v6 generation. It is one of the open gates in vault-oss-vm-plan.md. Tracked under the relevant Vault readiness issue.

Lost-node replacement

If one Raft voter dies (VM gone, disk gone, both):

  1. Capture the current statevault operator raft list-peers confirms which two voters are still up.

  2. Bring up a new VM with the same name + IP + MAC as the dead one (vault-1, say).

  3. Restore the Vault binary, config, and TLS material. Same as initial install.

  4. Wipe /var/lib/vault/raft if any state survived (you want a clean voter).

  5. Start vault.service. The voter will be sealed.

  6. Auto-unseal will fire if the seal Vault is up and the voter can reach it.

  7. Join the cluster:

    vault operator raft join https://vault-0.sub.comptech-lab.com:8200
  8. Verify:

    vault operator raft list-peers     # should show three voters again
    vault operator raft autopilot state # should report Healthy

The cluster’s Raft autopilot will absorb the new voter automatically once it joins.

Lost seal Vault (vault-seal-0)

The seal Vault is single-node Shamir 5/3. Losing it has two consequences:

  1. Existing voters keep running as long as none of them restart. Auto-unseal is only needed at startup.
  2. A restarted voter cannot unseal until vault-seal-0 is back.

Recovery options:

  • Rebuild from a vault-seal-0 snapshot. The seal Vault also takes Raft snapshots (it’s a single-node Raft cluster). Restore the snapshot to a fresh VM and unseal with the existing Shamir shares.
  • Re-key the seal Vault. If snapshots are gone too, the seal Vault has to be re-initialized. The main Vault’s sealed root key was encrypted by the old seal transit key; without that key, the main Vault’s data is undecryptable. The only recovery in that case is restoring the main Vault from a snapshot using a freshly initialized seal Vault (which means re-keying), which means the unseal flow re-keys too.

This is why the seal Vault’s Shamir shares are kept in offline custody, separated from the main Vault VMs. If both the seal Vault and the main Vault data are lost, the offline shares are the last line.

Total loss recovery

Worst case: all VMs gone, all qcow2 disks gone. Recovery sequence:

  1. Rebuild the hypervisors and lab base infrastructure (DNS, network, MinIO).

  2. Bring up a fresh vault-seal-0. Initialize with Shamir using the saved offline shares. Unseal.

  3. Bring up vault-0, vault-1, vault-2 with their seal config pointing at the new (re-initialized) seal Vault.

  4. Wait — the main Vault is sealed and has no data. That’s fine.

  5. Use the saved root token from the last init to log in. (Root tokens persist across the seal flow.)

  6. Restore from the most recent main-Vault snapshot:

    vault operator raft snapshot restore -force /path/to/raft.snap
  7. The restored data includes the auth mounts, policies, KV-v2 contents. ESO can reconnect.

This sequence is documented but not drilled. The first real drill is overdue (per the production-readiness gate).

Rotation: root token, unseal shares, snapshot token

AssetCadenceProcedure
Vault root tokenPer operator-defined schedule (currently ad-hoc)vault token create -policy=root -ttl=... then revoke the old token
Shamir unseal sharesOnly if compromise suspectedvault operator rekey -init followed by a re-share ceremony
Snapshot scoped tokenOperator-definedvault token create -policy=snapshot-readonly -ttl=8760h, drop into /etc/vault.d/snapshot-token, restart timer
Transit seal keyOnly if compromise suspectedRe-key the transit/key on the seal Vault; main Vault needs a “rewrap” using vault write sys/sealwrap/rewrap
TLS leaf certs12-month manualRe-issue from lab CA; copy to /etc/vault.d/tls/vault.crt; systemctl reload vault

The lab does not auto-rotate root tokens. Root is rare-use; rotation is operator-initiated when a token’s audit trail looks suspect or when an operator leaves the team.

What MinIO needs to be working

  • For scheduled snapshots: MinIO accessible from the voter running the snapshot job.
  • For restore: snapshot file accessible somewhere (MinIO ideal; local copy acceptable as fallback).
  • For DR drills: MinIO and a target VM with sufficient CPU/RAM.

If MinIO is down, snapshots queue locally (on the voter’s /var/lib/vault-snapshots/local/). When MinIO comes back, the queue can be uploaded manually:

ls /var/lib/vault-snapshots/local/
mc cp /var/lib/vault-snapshots/local/raft-*.snap lab/vault-snapshots/

“We’re paranoid” extra step: encrypt the snapshot

Vault’s Raft snapshot is unencrypted on disk after snapshot save. The current lab posture relies on:

  • MinIO IAM scoping (only the vault-snapshots-rw user can write; only operator with that key can read).
  • Lab network isolation.

A future tightening (track under DR follow-up): encrypt the snapshot with age or openssl enc before pushing to MinIO, keep the decryption key offline. Then a MinIO compromise alone doesn’t yield the secret data.

Failure modes

SymptomRoot causeFixPrevention
Snapshot job fails with 403Snapshot token’s policy doesn’t grant sys/storage/raft/snapshotRe-issue token with correct policyUse the named snapshot-readonly policy verbatim
Snapshot job runs but MinIO write failsMinIO vault-snapshots-rw IAM dropped the policyReattach the policy on MinIOTrack IAM as configuration, audit on schedule
Restore drill won’t unsealWrong seal Vault transit key (mismatched seal config)The drill snapshot is from a different seal generation — verify provenanceTag snapshots with the seal-key generation when re-keying happens
Lost voter rejoins but Autopilot reports unhealthyNew voter’s clock skew vs clusterNTP sync; restart Vault on the new voterAlways check chrony status on a new VM before joining
Snapshot fills /tmpSnapshot job didn’t clean up temp fileAdd trap 'rm -rf "$TMP"' EXIT (or fix the existing one)Test the snapshot job under disk-full conditions before relying on it

References

Last reviewed: 2026-05-11