Failure modes and recovery
DNS is a single-VM dependency. What breaks, what to check first, and how to recover — covering both the recursor side and the authoritative side.
The lab DNS plane is one VM with two daemons. That has the obvious cost: if the VM dies, lab DNS stops. This page documents the symptoms operators actually see when something goes wrong, what to check first, and how to recover. It is meant to be readable in five minutes during an incident.
”DNS is down” — first triage
Run these on the operator workstation (or on any lab VM with dig available) in order:
# Does the recursor answer at all?
dig @<lab-recursor> google.com A +short +time=2 +tries=1
# Does the authoritative answer at all?
dig @<lab-auth> sub.comptech-lab.com SOA +short +time=2 +tries=1
# Does the VM respond to ICMP?
ping -c 2 -W 1 <pdns-vm-ip>
# Is the VM up at the libvirt level?
virsh list --all | grep pdns
The four outcomes map to four root causes:
| Recursor | Auth | Ping | libvirt | Likely cause |
|---|---|---|---|---|
| ok | ok | ok | running | DNS is fine; the problem is somewhere else |
| no reply | ok | ok | running | pdns-recursor daemon down |
| ok | no reply | ok | running | pdns authoritative daemon down |
| no reply | no reply | ok | running | Both daemons down (rare) or firewall change |
| no reply | no reply | no reply | running | VM-level networking broken |
| no reply | no reply | no reply | stopped | VM died |
Daemon-level recovery
If only the recursor is down:
ssh ze@<pdns-vm>
systemctl status pdns-recursor
journalctl -u pdns-recursor -n 200 --no-pager
# If config issue is obvious, fix the .conf and:
systemctl restart pdns-recursor
# Verify
dig @127.0.0.1 google.com A +short
If only the authoritative is down:
ssh ze@<pdns-vm>
systemctl status pdns
journalctl -u pdns -n 200 --no-pager
# Confirm SQLite is readable
ls -l /var/lib/powerdns/pdns.sqlite3
sudo -u pdns sqlite3 /var/lib/powerdns/pdns.sqlite3 "select count(*) from records;"
# Restart
systemctl restart pdns
# Verify
dig @127.0.0.1 sub.comptech-lab.com SOA +short
Common one-off causes:
- A
.confedit left invalid syntax →systemdreportsfailed. Fix the file, restart. - SQLite file permissions changed (e.g., after a manual root edit) →
chown pdns:pdns /var/lib/powerdns/pdns.sqlite3, restart. - Disk full → check
df /var; rotate logs; the SQLite needs a few MB of headroom.
SQLite corruption recovery
If pdns starts but answers with errors, or pdnsutil show-zone sub.comptech-lab.com returns “no zone,” check the SQLite:
sudo -u pdns sqlite3 /var/lib/powerdns/pdns.sqlite3 "PRAGMA integrity_check;"
Expected: ok. If the integrity check fails:
-
Stop the daemon so it doesn’t write into the corrupt DB.
-
Restore from backup. Backups are the entire SQLite file copied somewhere safe. The lab convention is:
ssh ze@<pdns-vm> sudo systemctl stop pdns sudo cp /var/lib/powerdns/pdns.sqlite3 /var/lib/powerdns/pdns.sqlite3.corrupt-$(date -u +%Y%m%d-%H%M%S) sudo cp <backup-location>/pdns.sqlite3.<date> /var/lib/powerdns/pdns.sqlite3 sudo chown pdns:pdns /var/lib/powerdns/pdns.sqlite3 sudo systemctl start pdns sudo pdnsutil show-zone sub.comptech-lab.com | head -
Replay any record changes made since the backup. The reference is
opp-full-plat/plans/disconnected-rebuild/environments/dc-lab/dns-records.md— every record addition is supposed to be journaled there with a serial number, so you can replay from the diff between the backup serial andHEAD.
If there is no backup, the dns-records doc itself is enough to rebuild the zone with pdnsutil add-record against an empty SQLite (re-initialize with pdnsutil create-zone sub.comptech-lab.com).
VM-level recovery
If the VM itself is dead, the recovery is:
- Confirm libvirt state on the host:
virsh list --all | grep pdns. Ifshut off, tryvirsh start <pdns-vm>first. Watch withvirsh console <pdns-vm>. - If the qcow2 is intact but the VM can’t boot, rebuild a fresh pdns VM from the Ubuntu base image, then restore the SQLite from backup. The cloud-init for the pdns VM is small (Ubuntu + the two PowerDNS packages + the two config files). The SQLite holds the actual zone.
- If the qcow2 is gone, rebuild the VM from scratch using the same hostname / IP / MAC convention. Apply the configs and restore the SQLite. End state is identical.
The recovery time depends on whether you have a recent SQLite backup. With a current backup the operation is ~10 minutes (boot Ubuntu, apt install pdns-server pdns-backend-sqlite3 pdns-recursor, drop in the configs and the SQLite, restart). Without one, you replay from dns-records.md.
What lab clients see during a DNS outage
- Existing TCP connections keep working — they have IPs already.
- New connections that use hostnames fail at the
getaddrinfostep. Curl reports “could not resolve host”; SSH reports “Could not resolve hostname”; Kafka clients fail to refresh metadata. - OpenShift API access through
api.<cluster>.sub.comptech-lab.com:6443fails. Routes inside the cluster keep working only as long as the cluster’s CoreDNS already cached the relevant lookups (e.g., forimage-registry) — eventually pulls will fail too. - HAProxy itself keeps routing traffic to backends it has already resolved; new client connections that need DNS to find HAProxy fail.
The blast radius is broad. This is the reason DNS HA is the single most likely future ADR for this lab — secondary nameserver on a different host, AXFR replication, or PowerDNS Lightning Stream against MariaDB. None of those exist yet.
”It works on one VM but not another”
Most common cause: that VM’s resolver is pointing at the wrong address.
Quick check on the affected VM:
resolvectl status | head -20 # look for the DNS server line
cat /etc/systemd/resolved.conf.d/lab-dns.conf
cat /etc/resolv.conf
If DNS= shows the authoritative address instead of the recursor address, that’s the bug. Lab names will work (the authoritative has them) but public names will return REFUSED (the authoritative doesn’t recurse). Fix the drop-in, systemctl restart systemd-resolved, retest.
This is the single most common DNS support call in the lab. The two addresses are similar enough to confuse a hand-edit.
”I added a record but it doesn’t resolve”
Two common causes:
- Negative cache. The recursor cached
NXDOMAINfrom your earlierdigbefore you added the record. Either query the authoritative directly (dig @<lab-auth> <name>.sub.comptech-lab.com A), or flush the negative cache (rec_control wipe-cache <fqdn>$). See the recursor page for detail. - Wrong zone serial / partial commit. If you used the HTTP API and didn’t commit the change correctly, the SOA can be ahead of the record set.
pdnsutil show-zone sub.comptech-lab.com | grep <name>will tell you if the record is actually in the DB; if not, re-add viapdnsutil.
Backups
Current backup convention (lightweight, single-VM):
- What:
/var/lib/powerdns/pdns.sqlite3(entire file). The SQLite is the entire authoritative state. - Where: copy to a sibling host or to the operator workstation; the lab does not yet replicate it to MinIO. Future work.
- How often: before any change run, and on a daily timer. A few hundred KB; trivial to keep many copies.
- Verify:
sqlite3 <backup> "PRAGMA integrity_check;"returnsok. - Test restore: never tested in anger in this lab. A drill is overdue; the recovery procedure above is plausible but unproven against a real corrupted-DB scenario.
The PowerDNS Recursor doesn’t need a backup; its recursor.conf and recursor.d/*.conf are tiny and recoverable from the operator workstation.
”DNS Works locally but not from another VM”
If dig @127.0.0.1 ... works on the pdns VM itself but a sibling VM’s dig @<lab-recursor> ... fails, check:
allow-from—cat /etc/powerdns/recursor.d/10-lab-forwarder.confshould include the lab/16. If it was tightened to just127.0.0.0/8, outside clients are silently denied.- NIC binding —
local-addressmust include the lab-side address, not just loopback. If a recent edit removed it, the recursor stopped listening on the lab address. - Host firewall —
ufw statusoriptables -L INPUTon the pdns VM. Port53/udpand53/tcpmust be open from the lab/16.
Public-side gotcha: the auth daemon on the public NIC
The authoritative also listens on the public NIC for external queries. It answers for sub.comptech-lab.com and refuses for anything else (correct). What it does not do is rate-limit external queries — if an internet scanner pounds on it, the SQLite read amplifier is not infinite but it is also not zero. Future work: add recursive-cache-size-equivalent throttling at the authoritative side or restrict the public NIC to specific source ranges.
Things to monitor (planned, not all in place)
pdnsandpdns-recursorsystemctlstate (Prometheus node exporter + textfile or a service-state collector).- Recursor cache hit ratio (
rec_control get cache-hits cache-misses). - Authoritative QPS and slow-query count.
- SQLite file size (alert if drops unexpectedly — sign of corruption).
- Successful response from a synthetic
dig @<lab-recursor> <known-lab-record>anddig @<lab-recursor> www.google.comfrom a probe (Blackbox Exporter from the monitoring VM).
References
opp-full-plat/plans/disconnected-rebuild/environments/dc-lab/dns-records.md- PowerDNS Recursor docs: doc.powerdns.com/recursor/running.html
- PowerDNS Authoritative docs: doc.powerdns.com/authoritative/running.html