GPU scheduling with Slurm
Install Slurm on the GPU server, expose the two L40S as gres:gpu resources, and teach the cohort the sbatch / srun / squeue muscle memory that ends GPU contention.
By the end of this module you will have:
- Slurm installed on the GPU server as a single-node “cluster” (controller and compute on the same host).
- The two L40S GPUs exposed as
gres:gpu:l40s:2, schedulable per-job. - A working fair-share policy so one student running a 4-hour fine-tune cannot block five others’ 5-minute experiments.
- The four commands every student needs in muscle memory:
sbatch,srun,squeue,scancel. - An example batch script the cookiecutter (from module 03) will reference for the rest of the course.
JupyterHub’s GPU pass-through from module 02 is fine for quick interactive work. The moment a student wants to kick off a multi-hour training run, you want it queued by Slurm, not occupying a notebook container.
Why Slurm — and why not Ray, k8s, or “just use tmux”
For ≤6 students sharing 2 GPUs, the options:
| Option | What it is | Why we picked / didn’t pick |
|---|---|---|
| Slurm | Job scheduler used by ~every HPC site on Earth | Picked. Battle-tested, GRES handles GPUs cleanly, every student should know sbatch anyway |
| Ray | Pythonic distributed compute | Great for serving and elastic clusters; weak as a sole job queue, and adds a stack to babysit |
| Kubernetes + GPU operator | k8s-native scheduling | Right answer at 30+ users; overkill at 6 |
Hand-rolled flock+tmux | ”Whoever’s tmux session has the lock owns the GPU” | Works for 2 people, falls apart at 3 |
The thing Slurm gives you that the others don’t, for free: a queue. A student submits a job, gets a job ID, walks away. Slurm dispatches when a GPU frees up. No “is the GPU free right now?” Slack messages.
Step 1 — Install Slurm in single-node mode
Slurm has three daemons: slurmctld (controller), slurmd (compute node), and slurmdbd (accounting). For a single host we run all three on the same box, but logically separated:
sudo apt update
sudo apt install -y slurm-wlm slurm-wlm-doc slurmdbd
The package creates a slurm system user, the /etc/slurm/ config directory, and stubs out the three systemd units. We will replace slurm.conf and gres.conf with the right contents below, then enable the daemons.
A small MariaDB is needed for slurmdbd (the accounting DB):
sudo apt install -y mariadb-server
sudo mysql_secure_installation # set a root password
sudo mysql -u root -p <<'SQL'
CREATE DATABASE slurm_acct_db;
CREATE USER 'slurm'@'localhost' IDENTIFIED BY '__slurm_db_password__';
GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost';
FLUSH PRIVILEGES;
SQL
Step 2 — slurm.conf for this server
/etc/slurm/slurm.conf:
ClusterName=gpu-server
SlurmctldHost=<gpu-server>
# Mostly defaults are fine; the bits that matter:
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
GresTypes=gpu
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
# Accounting → MariaDB via slurmdbd
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageEnforce=associations,limits,qos
JobAcctGatherType=jobacct_gather/cgroup
# Fair-share — half-life over 7 days; the longer you run, the lower your priority
PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0
PriorityWeightFairshare=10000
PriorityWeightAge=1000
PriorityWeightJobSize=500
# A single QoS that everyone uses for now; can split later (interactive vs batch)
NodeName=<gpu-server> CPUs=32 RealMemory=128000 \
Gres=gpu:l40s:2 State=UNKNOWN
PartitionName=gpu Nodes=<gpu-server> Default=YES \
MaxTime=24:00:00 State=UP
A few opinionated choices:
cons_tresplusCR_Core_Memoryis the modern way to do “I want 8 cores, 32 GB, and one GPU.” Older Slurm docs you’ll find online usecons_res; ignore them.PriorityWeightFairshare=10000dominates the score, so a student who ran a 4-hour job an hour ago will sit behind a student who hasn’t run anything today. This is the policy lever you’ll tweak the most.MaxTime=24:00:00caps any single job at 24 hours. Override with--time=...up to that, or raise the cap for a specific QoS.
Step 3 — Tell Slurm about the GPUs
/etc/slurm/gres.conf:
Name=gpu Type=l40s File=/dev/nvidia0
Name=gpu Type=l40s File=/dev/nvidia1
Cgroup config — /etc/slurm/cgroup.conf:
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
ConstrainDevices=yes is what makes a --gres=gpu:1 job see exactly one GPU device inside its cgroup. Without it, the job binary can nvidia-smi and see both even though Slurm intended to assign one.
Step 4 — Bring up the daemons
sudo systemctl enable --now slurmdbd
sudo systemctl enable --now slurmctld
sudo systemctl enable --now slurmd
sudo sinfo # one node listed, state IDLE
sudo scontrol show node <gpu-server> # Gres=gpu:l40s:2 listed
If sinfo shows the node as DOWN*, the slurmd log (/var/log/slurm/slurmd.log) almost always names the culprit — usually a mismatch between NodeName in slurm.conf and the host’s actual hostname, or a missing cgroup mount.
Step 5 — Add the cohort to Slurm accounting
Slurm has its own user/account/association tree, separate from Linux users. Three commands per student:
sudo sacctmgr -i add account cohort \
Description="Course cohort" Organization="course"
sudo sacctmgr -i add user alice DefaultAccount=cohort
sudo sacctmgr -i modify user alice set Fairshare=100
# … repeat for each student
Verify:
sacctmgr show association format=Account,User,Share,RawShares
Every student now sees themselves listed with Fairshare=100. The instructor’s account gets a higher share if you want to be able to pre-empt during live sessions.
Step 6 — The four commands every student needs
srun — interactive: “give me a shell on a GPU node, now, blocking until I exit”:
srun --gres=gpu:1 --cpus-per-task=4 --mem=16G --time=01:00:00 --pty bash
# Inside: nvidia-smi shows exactly one L40S
sbatch — queued: “run this script when a GPU is free, log the output, don’t make me wait”:
sbatch train.sh
# Returns: Submitted batch job 42
squeue — what’s in the queue:
squeue -u $USER # my jobs
squeue # everyone's, sorted by priority
scancel — kill a job:
scancel 42
sacct — what was in the queue (for last-night debugging):
sacct -u $USER --starttime today --format=JobID,JobName,State,Elapsed,MaxRSS
Step 7 — A real train.sh
The cookiecutter from module 03 ships this as scripts/train.sh. It’s the script every project’s “kick off a training run on the cluster” command resolves to:
#!/bin/bash
#SBATCH --job-name=churn-train
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=04:00:00
#SBATCH --output=logs/%x-%j.out
#SBATCH --error=logs/%x-%j.err
set -euo pipefail
# Resolve paths relative to the script — Slurm runs from the submission dir
cd "$SLURM_SUBMIT_DIR"
mkdir -p logs
# Environment: every project carries its own uv lockfile
uv sync --frozen
# MLflow tracking comes from the platform; never hardcode in the script
export MLFLOW_TRACKING_URI="http://<gpu-server>:5000"
export MLFLOW_EXPERIMENT_NAME="churn-${SLURM_JOB_ID}"
# The real work
uv run python -m src.train \
--data data/processed/train.parquet \
--config configs/baseline.yaml
Two patterns worth noticing:
--gres=gpu:1, not:2. The reflex to grab “everything available” should be fought. Most training jobs want one GPU; the second is for module 14’s DDP work or somebody else’s job.uv sync --frozenat job start. The lockfile is the contract; if it doesn’t exist or is out of date, the job fails fast instead of resolving the latest random transitives.
Step 8 — A “GPU is free?” check students can run
Drop this into /srv/shared/bin/gpu-status:
#!/usr/bin/env bash
# Quick read of: who's holding GPUs, how loaded they are.
echo "=== Slurm queue ==="
squeue --format='%.7i %.9P %.16j %.10u %.8T %.10M %.4D %.5C %.7m %.6b'
echo
echo "=== nvidia-smi ==="
nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total \
--format=csv,noheader
Make it executable and PATH-visible:
sudo chmod 755 /srv/shared/bin/gpu-status
echo 'export PATH=/srv/shared/bin:$PATH' | sudo tee -a /etc/profile.d/platform.sh
Now any student can type gpu-status to see whether the cluster is busy. This single tiny script eliminates ~90% of “is the GPU free?” interruptions.
ADR 0004 — Slurm over k8s for compute scheduling
/srv/shared/adr/0004-slurm-scheduler.md:
# ADR 0004 — Compute scheduler: Slurm over Kubernetes
## Status
Accepted, 2026-05-15.
## Context
The platform needs a job queue that mediates ≤6 users' access to 2 GPUs.
The two reasonable options were Slurm and Kubernetes with a GPU operator.
## Decision
Slurm in single-node mode. The two L40S are exposed as `gres:gpu:l40s:2`,
scheduled via `cons_tres` with cgroup device constraint, and accounted with
a fair-share decay over 7 days.
## Consequences
- Pro: every student leaves the course knowing `sbatch` / `srun` — a skill
that transfers to any HPC or academic GPU cluster they'll encounter.
- Pro: queue semantics, fair-share, and per-job resource isolation are
built-in, not bolted on.
- Con: Slurm config is genuinely arcane; the first install will surface
a cgroup or hostname mismatch that costs an hour.
- Con: Scaling to a second host adds munge-key distribution and a real
network topology. Acceptable: course is single-node by design.
## Alternatives considered
- Kubernetes + GPU operator. Right answer at ≥20 users or multi-host.
Adds 3+ moving parts (k3s, GPU operator, KubeRay or Kueue) without
proportional value at 6 users.
- Ray cluster only. Excellent for elastic distributed compute; lacks a
proper queue and accounting. We may add Ray on top of Slurm for module
14's DDP work, not replace it.
Recap and what’s next
You now have:
- A working single-node Slurm scheduler with both L40S as
gresresources. - Fair-share configured so heavy users yield to light ones over a 7-day window.
- A standard
train.shand agpu-statuscommand that the rest of the track will assume. - ADR pinning the choice for posterity.
The platform now schedules compute fairly. Next we put real data on it — bronze ingestion to the lake, the start of the data-engineering capstone.