Cloud-init base image: how new VMs join the fleet
The Ubuntu 24.04 cloud-init base image, the user-data convention, and how a new platform VM goes from libvirt define to SSH-as-zahid in a few minutes.
Every supporting VM in the lab — Vault, Jenkins, SigNoz, monitoring-0, Trivy, DefectDojo, docker-runtime, Kafka, Redis, WSO2, MinIO, Nexus, the DNS host, the HAProxy edge — was created from the same Ubuntu 24.04 cloud-init base image, with the same MAC/IP allocation pattern and the same SSH key set. This page documents that single shared build path so any new VM in this fleet starts from the same place.
What is in scope
This section is about non-OpenShift VMs. OpenShift nodes use RHCOS, the Agent-based Installer, and a different bootstrap path entirely (Ignition, machine-config operator, etc.). That work lives in §6 “OpenShift platform” of the docs site.
The VMs covered here all share four properties:
- They run Ubuntu 24.04 LTS as their guest OS.
- They live on the lab Linux bridge
br30with static addressing inside the lab/16. - They are bootstrapped from a single qcow2 base image plus a small per-VM cloud-init seed.
- They are administered through a single SSH user
ze(and lab adminzahidwhere applicable per the service-specific ADR).
The base image
| Item | Value |
|---|---|
| Path on each hypervisor | /var/lib/libvirt/images/ubuntu-24.04-base.qcow2 |
| OS | Ubuntu 24.04.4 LTS (noble) |
| Kernel | 6.8.x |
| QEMU agent | Installed |
| Default user | ubuntu (replaced at first boot by cloud-init) |
| Initial size | small (resized at clone time, typically to 100G OS + a data disk) |
| Image lineage | Built from the upstream Ubuntu cloud image |
It is a generic cloud-image qcow2 — no lab-specific provisioning is baked in. Lab-specific bootstrap (package install, user creation, firewall, service install) is done by cloud-init at first boot of each cloned VM. This keeps the base image swappable when Ubuntu issues a new image; the cloud-init layer is where the per-VM truth lives.
VM creation flow (one diagram’s worth of words)
For any new VM (<vm>):
- Allocate an IP and MAC by editing
plans/disconnected-rebuild/environments/dc-lab/allocation-table.md. The deterministic MAC convention is52:54:00:XX:XX:<last-octet-of-IP-in-hex>; the exact middle bytes are part of the internal lab convention recorded inopp-full-plat/connection-details/. This makes the allocation table greppable. - Clone the base qcow2 onto the chosen hypervisor:
cp /var/lib/libvirt/images/ubuntu-24.04-base.qcow2 /var/lib/libvirt/images/<vm>.qcow2, thenqemu-img resizeit to the per-VM OS disk size (typically100G). - Create a data disk if the role needs one (Jenkins home, Docker data dir, Vault raft path, MinIO buckets, ClickHouse storage, Loki/Tempo storage…). Mount it at the service’s expected path during cloud-init.
- Render cloud-init for the VM —
user-data,meta-data,network-config— and seed an ISO withcloud-localdsor attach anoClouddatasource via libvirt. virsh define+virsh startwith the deterministic MAC pinned to thebr30interface.- Wait for cloud-init to finish (
cloud-init status --waitover the serial console or once SSH is up). - DNS records are applied through the PowerDNS API for the private FQDN (and the
*.apps.sub.comptech-lab.comedge name if it gets an HAProxy route). - HAProxy gets an updated backend block if the VM is publicly addressable.
- Apply the service-specific provisioning (Jenkins LTS apt repo, Docker Engine + Compose, Vault binary checksum-verify, MinIO server, …) — this can be folded into cloud-init
runcmdor done as a follow-up role.
The cloud-init shape
The user-data uses the standard cloud-init schema. The minimal set the lab uses for every new VM:
#cloud-config
hostname: <vm>
fqdn: <vm>.sub.comptech-lab.com
manage_etc_hosts: true
users:
- name: ze
sudo: ALL=(ALL) NOPASSWD:ALL
shell: /bin/bash
ssh_authorized_keys:
- ssh-ed25519 AAAA... ze@workstation
- ssh-ed25519 AAAA... ze@mac-laptop
- ssh-ed25519 AAAA... ocp-bootstrap
ssh_pwauth: false
disable_root: true
write_files:
- path: /etc/systemd/resolved.conf.d/lab-dns.conf
content: |
[Resolve]
DNS=<lab-dns-recursor-ip>
Domains=~sub.comptech-lab.com
DNSSEC=no
packages:
- qemu-guest-agent
- htop
- jq
runcmd:
- [ systemctl, enable, --now, qemu-guest-agent ]
- [ systemctl, restart, systemd-resolved ]
Then per role, the runcmd extends:
- Jenkins: add LTS apt repo, install OpenJDK 21, install Jenkins, mount
/var/lib/jenkinsdata disk. - SigNoz: install Docker Engine + Compose, clone the SigNoz repo at the pinned tag, mount
/var/lib/dockerdata disk,docker compose up -d. - monitoring-0: install Grafana, Prometheus, Alertmanager, Loki, Tempo, Pyroscope, Alloy, Node Exporter as native systemd services.
- Vault VMs: download Vault
1.21.1from HashiCorp’s release archive, checksum-verify, place under/usr/local/bin/vault, install the dedicatedvaultuser, render/etc/vault.d/vault.hclfor the role (Raft voter vs transit-seal), enablevault.service. - MinIO: install the
miniobinary, mount the data disk at/srv/minio, render the systemd unit, start.
The exact runcmd lists for each role live in the corresponding plans/disconnected-rebuild/environments/dc-lab/*-vm-plan.md files.
SSH and identity
There is exactly one SSH user across the fleet: ze. Two laptop keys plus the ocp-bootstrap host key are pre-authorized via cloud-init. Root login is disabled, password auth is disabled, and sudo NOPASSWD is configured so the ze user can run privileged commands without prompts (this is a single-operator lab; sudo-with-password would only slow recovery work).
Service-specific app admin usernames (zahid on Jenkins, zahid on Grafana, zahid on SigNoz, zahid on DefectDojo) are a separate identity — they live inside the application, not in the OS. Their passwords are kept in local-only ignored files under opp-full-plat/secrets/<service>/.
What cloud-init never does
| Concern | Why not |
|---|---|
| Write secret values | Cloud-init logs are visible to anyone who can read /var/log/cloud-init-output.log. Any secret needed at first boot is pulled from Vault later, not seeded from user-data. |
| Open broad ingress firewalls | Per-VM firewall rules are role-specific; cloud-init sets only the SSH allowlist. Vault opens 8200/tcp only to approved clients via its own role; Jenkins opens 8080/tcp only to the HAProxy private address; MinIO opens 9000/tcp similarly. |
Hardcode lab DNS into /etc/hosts | The recursor on the DNS VM is the single source of truth. The only host-level override is for VMs that need to resolve localhost to their own service. |
| Install GUI / desktop packages | Every VM is headless; serial console + SSH only. |
Hypervisor side: br30 and libvirt
The Linux bridge br30 is the lab’s data-plane bridge. Every cloud-init VM gets exactly one NIC on this bridge with a deterministic MAC. The bridge itself is configured outside cloud-init (in the hypervisor host’s netplan); libvirt sees it as a bridge-type network.
A typical libvirt domain XML excerpt:
<interface type='bridge'>
<mac address='52:54:00:XX:XX:XX'/>
<source bridge='br30'/>
<model type='virtio'/>
</interface>
There is no NAT, no DHCP from libvirt — VM addresses are static and assigned by cloud-init’s network-config. The lab /16 gateway address is the upstream router; the lab DNS recursor address is the only resolver every VM points at.
Verifying a new VM came up clean
# from the hypervisor, after `virsh start`
virsh console <vm> # watch cloud-init complete
virsh domiflist <vm> # confirm the pinned MAC matches the allocation table
# from any lab VM, after DNS records are in place
dig @<lab-dns-recursor-ip> <vm>.sub.comptech-lab.com A +short
ssh ze@<vm>.sub.comptech-lab.com 'hostname && uname -r && cloud-init status'
Expected: cloud-init status returns status: done. If it returns running for longer than a minute or two on a freshly cloned image, check /var/log/cloud-init-output.log for a stalled runcmd.
Cloud-init history workspace (read-only)
There is a historical /home/ze/cloud-init/ workspace on the operator workstation that holds older ip-allocations.yaml, retired VM specs, and pre-rebuild material. Per the operator workspace boundary memory, that path is read-only scrap — useful for archeology, never for new work. New VM allocations go into opp-full-plat/plans/disconnected-rebuild/environments/dc-lab/allocation-table.md, full stop.
Failure modes
| Symptom | Root cause | Fix | Prevention |
|---|---|---|---|
virsh start succeeds but no IP on br30 | Wrong bridge name or network-config typo in cloud-init | Re-render network-config, regenerate the seed ISO, virsh destroy+start | Always copy network-config from a known-good VM of the same role |
Cloud-init stuck at running for 10+ minutes | runcmd step pulling from an upstream that the lab can’t reach without the local mirror | Console in, kill the stuck command, fix the run line to use the local mirror/proxy | Don’t pull upstream apt or container content in cloud-init when local equivalents exist; pin to local mirrors for non-trivial packages |
| Two VMs answer on the same IP | Allocation table not updated before virsh start; or someone reused a retired allocation | virsh destroy the duplicate; reconcile the allocation table | ping the planned address and virsh domiflist piped to grep <MAC> across all hypervisors before defining a new VM |
| SSH login refuses key | cloud-init wrote authorized_keys with a different user OR /etc/ssh/sshd_config was tightened after first boot | Console in as ubuntu/ze, fix the keyfile, restart sshd | Standardize on ze user in every cloud-init template; review sshd_config drift |
References
opp-full-plat/plans/disconnected-rebuild/environments/dc-lab/allocation-table.mdopp-full-plat/plans/disconnected-rebuild/environments/dc-lab/environment-profile.mdopp-full-plat/plans/disconnected-rebuild/environments/dc-lab/*-vm-plan.md(Vault, Jenkins, SigNoz, monitoring, Trivy, DefectDojo, Docker runtime, Kafka, Redis, WSO2)- ADRs 0009 (Jenkins VM), 0010 (SigNoz VM), 0012 (monitoring VM), 0013 (DefectDojo VM)