Kafka brokers (KRaft cluster)

Three combined broker/controller VMs running Kafka 4.2.0 in KRaft mode at 30.30.30.24/25/26, with a pinned JMX exporter exposing ~10k kafka_* metrics per broker on :9404.

Kafka is deployed as three private VMs in KRaft mode — no Zookeeper. Each VM is a combined broker + controller voter. PLAINTEXT only on the internal LAN today; TLS/SASL/ACLs are deferred gates. The brokers expose Prometheus-format metrics on :9404 via a pinned jmx_prometheus_javaagent jar; OpenShift-side scraping is tracked separately under the Kafka monitoring follow-up.

What it is

PropertyValue
Brokerskafka-0, kafka-1, kafka-2
IPs30.30.30.24, 30.30.30.25, 30.30.30.26
Bootstrapkafka-bootstrap.sub.comptech-lab.com:9092 (PLAINTEXT)
ModeKRaft (no Zookeeper); 3-voter quorum, each node is broker + controller
Kafka version4.2.0
JVMOpenJDK 21
Cluster idfBNALO9WTje8UgaGET7XoQ
VM size4 vCPU / 8 GiB RAM / 80 G OS + 200 G data
Data path/var/lib/kafka/kraft-combined-logs
Listener ports9092 (client PLAINTEXT), 9093 (controller, per-broker only)
JMX exporter port:9404 open to 30.30.0.0/16
Public exposureNone — private lab only
TLS / SASL / ACLsNo (gate; issue #17)
Backups / DRNo (gate; issue #12)

DNS is served from PowerDNS at 30.30.30.53. The kafka-bootstrap name is a multi-A round-robin across all three broker IPs.

Topology — KRaft, not Zookeeper

Each VM runs a single kafka.service (systemd) that is both broker and controller. KRaft replaces Zookeeper with an internal Raft quorum of the three controllers. Practical consequence: there is no separate Zookeeper ensemble to operate, no zkCli.sh, and metadata reads go through kafka-metadata-quorum.sh describe --status (3 voters expected; LeaderId must be non-empty).

The :9093 controller listener is firewalled to just the three broker IPs — no other host on the lab network can reach the controller plane. Client traffic uses :9092.

JMX exporter wiring

Metrics are collected by jmx_prometheus_javaagent loaded into the Kafka JVM as a -javaagent. The agent listens on :9404 and serves /metrics in Prometheus exposition format. Wiring is in three pieces:

1. Pinned jar artifact

PropertyValue
Artifactjmx_prometheus_javaagent-1.0.1.jar
Path on each broker/opt/jmx-exporter/jmx_prometheus_javaagent.jar
Owner / moderoot:root 0644
SHA-2567d61f737fd661610ccc14aea79764faa1ea94a340cbc8f0029b3d2edea3d80c1
SourceMaven Central repo.maven.apache.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/1.0.1/
IntegrityMaven Central publishes .sha1 + .md5 siblings only (no .sha256 sibling) — both Maven-published sums verified at download, SHA-256 recorded for our records and re-verified on every broker after copy

The jar is pinned by version and by SHA-256. Re-verify the hash after any redistribution.

2. Systemd drop-in

The drop-in lives at /etc/systemd/system/kafka.service.d/10-jmx-exporter.conf (identical on all three brokers):

[Service]
Environment="KAFKA_OPTS=-javaagent:/opt/jmx-exporter/jmx_prometheus_javaagent.jar=9404:/opt/jmx-exporter/kafka-jmx-config.yml"

Crucially the drop-in only adds KAFKA_OPTS. It does not touch server.properties, log4j2.yaml, KAFKA_HEAP_OPTS, LOG_DIR, or KAFKA_LOG4J_OPTS. Confirmed post-install with systemctl show kafka -p Environment.

3. Host firewall (UFW)

9404/tcp                   ALLOW IN    30.30.0.0/16               # kafka jmx exporter

Added with sudo ufw allow proto tcp from 30.30.0.0/16 to any port 9404 comment 'kafka jmx exporter'. UFW default is deny (incoming) / allow (outgoing). No nftables direct rules in play.

JMX config — 17 rules, specific → generic

The exporter rule file (/opt/jmx-exporter/kafka-jmx-config.yml, sha256 21168263d8…3d92ab0c) is 13 specific rules + 4 generic catch-all rules. The order matters — JMX exporter evaluates rules top-to-bottom and stops at the first match, so per-topic / per-partition rules go above generic <>Value ones.

#RuleWhat it captures
1Per-topic per-partition <>Value gaugesPartition-level offsets, lag, ISR membership
2Per-topic broker metrics <>CountBytesInPerSec, BytesOutPerSec, MessagesInPerSec per topic
3Per-broker replica fetcher metricsFetcher max lag, follower fetch rate
4KRaft raft-metrics *-total / *-rateRaft message counts and rates
5KRaft raft other attrs → gaugesQuorum state, current leader, voter ids
6KRaft raft channel metricsInter-controller channel send/receive rates
7Per-user / per-client quota metricsThrottle and quota gauges
8Network processor connection metricsActive connections per processor
9Request metrics countersRequestsPerSec, ErrorsPerSec by API
10Request metrics percentilesLocal time, total time, queue time p50/p95/p99
11Generic <>Count countersCatch-all counters (rule 11 of 13 specific)
12Generic <>Percentile gaugesCatch-all percentiles
13Generic <>Value gaugesIncludes UnderReplicatedPartitions and ActiveControllerCount
14JVM heap / non-heap memoryJVM memory pools
15JVM GC count / timeGC collector metrics
16JVM thread countersThread state counts
17Tail catch-all kafka.*kafka_generic_* UNTYPEDUnknowns are surfaced rather than blackholed

Cardinality note

Rule 17 (the tail catch-all) emits about 4,045 of the ~10,000 kafka_* lines on each broker — roughly 40% of the scrape volume. Functional and intentional (“don’t blackhole the unknown”), but at fleet scale this may double UWM ingestion volume vs a more curated config. If scrape latency or cardinality becomes a problem, the follow-up is either:

  • Drop low-value subtrees at scrape time with metricRelabelConfigs on the ServiceMonitor, keeping the broker-side config permissive; or
  • Tighten rule 17 to specific subtrees and accept fewer unknowns.

Scrape latency observed on localhost:9404/metrics (3 runs on kafka-0): 0.624s / 0.631s / 0.564s; response size ~1.34 MB. Line counts per broker: kafka-0 ~10,114, kafka-1 ~10,066, kafka-2 ~10,028.

Rolling restart — zero-URP procedure

Restarting the Kafka cluster requires one broker at a time and a between-each UnderReplicatedPartitions=0 (URP) check. The procedure used during the JMX rollout was:

  1. Pre-flight on every broker.

    ssh ze@kafka-0 'sudo /usr/local/bin/kafka-topics.sh \
      --bootstrap-server kafka-bootstrap.sub.comptech-lab.com:9092 \
      --under-replicated-partitions'

    Expect empty output. If anything comes back, abort — do not restart with pre-existing URP.

  2. Restart one broker.

    ssh ze@kafka-0 'sudo systemctl restart kafka'
  3. Wait for the broker to come back up.

    Poll :9092 listening and curl -fsS localhost:9404/metrics succeeding, 5-second interval, 90-second budget. The 90-second budget is generous; observed times to heal were 9–14 s per broker during the JMX rollout.

  4. Verify KRaft quorum.

    ssh ze@kafka-0 'sudo /usr/local/bin/kafka-metadata-quorum.sh \
      --bootstrap-server localhost:9092 \
      describe --status'

    Expect 3 voters and a non-empty LeaderId. Leadership transitions during the rolling restart are normal and not a fault condition — they just mean one voter took over while the previous leader was down.

  5. Re-verify URP is zero before moving to the next broker.

    Same kafka-topics.sh --under-replicated-partitions call as step 1. Empty output is required.

  6. Repeat for kafka-1, then kafka-2.

    Order is kafka-0kafka-1kafka-2. If URP is non-zero after a restart, stop; restart only resumes after URP returns to 0.

The discipline matters because Kafka tolerates one broker out at a time without partition unavailability. Two brokers out simultaneously, with a partition whose two replicas are on those two brokers, leaves that partition offline.

Operational guardrails

  • No public IPs, no public DNS, no HAProxy wildcard for Kafka. Private LAN only until TLS/SASL/ACLs close.
  • Don’t restart all three at once. Use the rolling procedure above. Even for cosmetic changes — the URP=0 gate is the only thing that prevents a half-restart from going wrong.
  • Don’t touch server.properties or log4j2.yaml from the JMX wiring. The drop-in adds KAFKA_OPTS and nothing else.
  • Don’t widen UFW. :9404 is open to 30.30.0.0/16 only. Don’t open it wider; UWM scraping reaches it from the OpenShift workers, all on the same /16.
  • Don’t break the rule order in kafka-jmx-config.yml. Specific rules go above generic ones; moving the tail catch-all up would mask everything below.

Issue #269 — Phase 1 broker side done; OCP scraping blocked

Issue #269 covers Kafka monitoring end-to-end. The broker side of Phase 1 is complete:

  • Pinned JMX exporter is loaded on all three brokers and serving on :9404.
  • 17-rule config landed; ~10k kafka_* lines per broker.
  • Rolling restart completed with URP=0 throughout (3-voter quorum maintained).
  • UFW updated to 30.30.0.0/16 only for :9404.

The OCP side of Phase 1 is blocked. The spoke’s argocd-cm resource.exclusions blocks both core/Endpoints and discovery.k8s.io/EndpointSlice from Argo sync. Without an Endpoints/EndpointSlice flavor that Argo will sync, UWM cannot scrape an external-IP target. Phase 2 (issue #273) either relaxes that exclusion or picks a target representation Argo will sync. The platform-side Kafka monitoring stack ships a kafka-exporter deployment which scrapes :9092 (broker protocol), not the brokers’ :9404 JMX endpoint — those four exporter alerts are independent of the broker-side javaagent work documented here.

Validation

# DNS
dig @30.30.30.53 kafka-0.sub.comptech-lab.com A +short
dig @30.30.30.53 kafka-bootstrap.sub.comptech-lab.com A +short

# Listener ports
ssh ze@kafka-0 'ss -lntp | grep -E "(9092|9093|9404)"'

# JMX exporter
ssh ze@kafka-0 'curl -fsS localhost:9404/metrics | wc -l'
ssh ze@kafka-0 'curl -fsS localhost:9404/metrics | grep -E "^kafka_server_replicamanager_underreplicatedpartitions|^kafka_controller_kafkacontroller_activecontrollercount"'

# KRaft quorum
ssh ze@kafka-0 'sudo /usr/local/bin/kafka-metadata-quorum.sh \
  --bootstrap-server localhost:9092 describe --status'

# URP
ssh ze@kafka-0 'sudo /usr/local/bin/kafka-topics.sh \
  --bootstrap-server kafka-bootstrap.sub.comptech-lab.com:9092 \
  --under-replicated-partitions'

A scripted validation lives at opp-full-plat/scripts/rebuild/kafka/validate-kafka-kraft.sh.

Failure modes

Symptom: UnderReplicatedPartitions > 0 after a restart

Root cause. The just-restarted broker hasn’t caught up on replication yet; or a replica’s data disk filled; or a NIC dropped.

Fix. Wait. URP normally clears within seconds of the broker coming back. If it stays non-zero past a minute, check the broker logs (journalctl -u kafka -n 200) and replica fetcher metrics. Do not restart the next broker until URP returns to 0.

Prevention. Use the rolling restart procedure above; respect the URP=0 gate.

Symptom: :9404 returns nothing or connection refused

Root cause. The systemd drop-in didn’t get reloaded (systemctl daemon-reload missed), or KAFKA_OPTS was overridden by another env source, or UFW denied the source IP.

Fix. Verify systemctl show kafka -p Environment includes the -javaagent: line. Check ufw status for 9404/tcp ALLOW IN 30.30.0.0/16. If both look correct, look at journalctl -u kafka for jmx-exporter startup errors.

Prevention. systemctl daemon-reload before systemctl restart kafka when changing drop-ins.

Symptom: KRaft quorum reports only 2 voters

Root cause. One broker is down, or the controller listener on :9093 is blocked between voters.

Fix. Check kafka.service status on each broker. Verify :9093 is firewalled allow-listed for the other two broker IPs (not the wider lab /16). Restart the failed broker; quorum should re-form.

Prevention. Don’t change the :9093 firewall rules without re-verifying the three-IP allow list.

Symptom: a client reports LEADER_NOT_AVAILABLE after a broker restart

Root cause. Client cached the leader from before the restart; the client’s metadata refresh hasn’t fired yet.

Fix. Usually self-heals on the next metadata refresh. Verify the cluster itself is healthy (URP=0, 3 voters). If the client doesn’t recover, restart the client.

Prevention. Use a Kafka client library version that refreshes metadata on LEADER_NOT_AVAILABLE.

Symptom: ~40% of :9404 output looks like kafka_generic_* lines

Root cause. Rule 17 (tail catch-all) intentionally emits UNTYPED kafka_generic_* lines for un-mapped beans, so unknowns surface rather than disappear.

Fix. Not a fault. If scrape volume is a real problem, drop these at the ServiceMonitor with metricRelabelConfigs rather than removing rule 17.

Prevention. Don’t blackhole at the broker; filter at scrape time.

References

  • opp-full-plat/connection-details/kafka.md (planned) — runbook.
  • opp-full-plat/plans/disconnected-rebuild/environments/dc-lab/kafka-kraft-plan.md — VM plan.
  • Issue #269 — Kafka monitoring Phase 1.
  • Issue #273 — Kafka monitoring Phase 2 (broker JMX scrape from UWM).
  • Issue #17 — TLS/SASL/ACL hardening gate.
  • Issue #12 — retention, backup, and data durability.
  • HAProxy backend conventions — Kafka SNI passthrough notes if and when Kafka is fronted by HAProxy.

Last reviewed: 2026-05-12