Infrastructure: the GPU server

Provision the shared on-prem GPU server: user accounts, ProxyJump SSH, NVIDIA driver and CUDA, a sane directory layout, and the first ADR.

This is where the track starts being hands-on. By the end of this module you will have:

A Linux server with NVIDIA driver, CUDA, and the container toolkit installed and verified.
Per-student Linux accounts with public-key-only SSH and matching primary groups.
SSH access via ProxyJump from each student’s laptop, with the server reachable only over a private bridge.
A directory layout that separates per-user homes, shared datasets, and shared model artifacts.
A first ADR documenting the access pattern for future-you and any teammate who joins later.

You’ll do this once, as the instructor or platform owner. Every subsequent module assumes this is in place.

The server we’re starting from

The track was developed on a single physical box with:

A dual-socket x86 server with 2× NVIDIA L40S GPUs (48 GB VRAM each).
≥128 GB system RAM and ≥2 TB NVMe scratch — the SSDs see a lot of dataset traffic.
Ubuntu Server 24.04 LTS installed via the standard installer.
One NIC on a private bridge — the server has an internal IP only and is not exposed to the public internet. Students reach it through a jump host or a tunnel.

Smaller boxes work for everything except the heaviest training modules (13: LoRA, 14: DDP CV). The minimum that finishes the whole track is one GPU with ≥24 GB VRAM and 64 GB system RAM; below that, plan to swap in smaller models.

If your only option is a cloud GPU instance, that’s fine — g6e.2xlarge on AWS (one L40S) or an equivalent on a GPU-cloud provider gives you 90% of what’s here, with the caveat that module 14’s DDP section becomes “single-GPU, here’s what the multi-GPU code would look like.”

Why a dedicated server (and not laptops)

Three reasons, in order of how often they bite:

Reproducibility. Six laptops running six versions of CUDA against six versions of Python is six environments to debug. One server with one toolchain is one environment.
GPU time. Modern laptops don’t have L40S-class GPUs. A 7B LoRA fine-tune on a laptop GPU takes a weekend; on an L40S it takes an afternoon.
Long-running services. vLLM, MLflow, Prefect, and Superset all stay up for the duration of the course. Your students’ laptop lids should not be the kill switch.

The shape we adopt: laptops are editors and dashboards; the server is the runtime. Most students will work entirely through the browser (JupyterHub) and ssh (terminal). Their laptop never touches a GPU.

Step 1 — Install the NVIDIA stack

If the server already has working GPUs, skip to step 2 — but run the verification commands at the end of this step anyway. A “working” driver from six months ago is often the source of mystery training failures three modules later.

On a fresh Ubuntu 24.04:

# Block the open-source nouveau driver — required before installing NVIDIA's
sudo bash -c 'cat > /etc/modprobe.d/blacklist-nouveau.conf <<EOF
blacklist nouveau
options nouveau modeset=0
EOF'
sudo update-initramfs -u
sudo reboot

After reboot, install the recommended driver and CUDA toolkit. On Ubuntu the ubuntu-drivers helper picks the right one:

sudo apt update
sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers install            # installs the recommended NVIDIA driver
sudo apt install -y nvidia-cuda-toolkit
sudo reboot

For container workloads (Slurm jobs, vLLM, anything running in Docker), install the NVIDIA Container Toolkit:

distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -fsSL https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit docker.io
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verification. All three of these should succeed before you continue:

nvidia-smi                                            # driver sees both L40S
nvcc --version                                        # CUDA toolkit installed
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

The third one is the one that catches the most install bugs — a working host driver does not guarantee a working container runtime. If the Docker variant fails with “could not select device driver,” re-run nvidia-ctk runtime configure --runtime=docker and restart Docker.

Step 2 — A sane directory layout

Before you create a single user account, decide where things live. The layout below is what the rest of this track assumes:

/home/<user>            per-student home, default 50 GB quota
/srv/shared             shared assets, read-only for students by default
  ├── datasets/         raw and curated datasets used across modules
  ├── models/           downloaded base models (LLMs, vision backbones)
  └── scratch/          large temporary files (writable by all)
/var/lib/<service>      stateful service data (mlflow, minio, postgres, …)
/opt/<service>          third-party software installs

Two principles drive this:

Per-user homes stay small. They hold dotfiles, project repos, and small caches. Big things (models, datasets, training outputs) live in /srv/shared or in MinIO.
Shared is read-only by default. Students can request a write path; nobody can rm -rf a dataset by accident.

Create the layout:

sudo mkdir -p /srv/shared/{datasets,models,scratch}
sudo groupadd cohort
sudo chown -R root:cohort /srv/shared
sudo chmod 2755 /srv/shared
sudo chmod 2775 /srv/shared/scratch    # group-writable for /scratch only

We will add the cohort group to every student account in the next step. The setgid bit (2) on the directories ensures new files inherit the group.

Step 3 — Per-student user accounts

The pattern: each student gets a Linux user, password-disabled, SSH-key-only, in the cohort group, and (if they need direct GPU access for non-Slurm experiments) in a gpu-users group.

A small Ansible playbook handles this cleanly, but for ≤6 students a shell script is fine. Save this as ~/provision-student.sh on the server:

#!/usr/bin/env bash
# Usage: sudo ./provision-student.sh <username> <ssh-public-key-file>
set -euo pipefail

USER="$1"
KEYFILE="$2"

useradd -m -s /bin/bash -G cohort,docker "$USER"
passwd -l "$USER"                                    # disable password login

install -d -m 700 -o "$USER" -g "$USER" "/home/$USER/.ssh"
install -m 600 -o "$USER" -g "$USER" "$KEYFILE" "/home/$USER/.ssh/authorized_keys"

# Soft quota: warn the student when their home crosses 40 GB, hard cap at 50 GB
# (requires the `quota` package and an entry in /etc/fstab; see notes below)
echo "Provisioned $USER. Test: ssh $USER@$(hostname)"

Run it once per student:

sudo ./provision-student.sh alice /tmp/alice.pub
sudo ./provision-student.sh bob   /tmp/bob.pub
# … etc

Three opinionated choices baked in:

Password login is locked, not just unset. passwd -l is durable even if someone runs passwd later.
Students are in the docker group. This makes JupyterHub’s DockerSpawner work without extra sudo plumbing in module 02. The tradeoff is that docker group membership is effectively root — a student who tries can mount the host filesystem. Acceptable risk for a six-person classroom; not acceptable for production.
No sudo. Students never get root. Anything they need root for (installing a system package), they ask the instructor.

For disk quotas, install quota, add usrquota,grpquota to the /home entry in /etc/fstab, remount, and run quotacheck -cum /home && quotaon -v /home. Then set per-user limits with setquota:

sudo setquota -u alice 41943040 52428800 0 0 /home    # soft 40 GB, hard 50 GB

Quotas are optional but recommended — one accidentally-saved 200 GB checkpoint will otherwise fill /home for everyone.

Step 4 — ProxyJump SSH from student laptops

Each student adds a stanza to their ~/.ssh/config (on their laptop) that routes through a jump host. The jump host is whatever public-facing box can reach the GPU server — often a small VM in the same private network with a public IP.

# ~/.ssh/config on the student's laptop
Host gpu-jump
  HostName jump.example.com
  User <student>
  IdentityFile ~/.ssh/id_ed25519

Host gpu-server
  HostName 10.0.0.10                # private IP of the GPU server
  User <student>
  IdentityFile ~/.ssh/id_ed25519
  ProxyJump gpu-jump

Now ssh gpu-server from anywhere works. Two practical notes:

ServerAliveInterval 60 at the top of the file keeps long-running SSH sessions (training jobs, tail -f) from being killed by the jump host’s idle timeout.
Forward only what you need. Don’t LocalForward JupyterHub (port 8000) unless you’re debugging it; once you have a proper reverse proxy with TLS in module 02, the browser path is the right one.

If your private network is fully air-gapped and there’s no jump host, replace the ProxyJump with a VPN or wireguard tunnel — same effect, different mechanism. The principle is the same: the GPU server is never on a public IP.

Step 5 — Verify a student can use a GPU

From the instructor’s laptop, log in as one of the students and confirm everything end-to-end:

ssh gpu-server                              # using the alice@gpu-server stanza
nvidia-smi                                  # student sees both L40S
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

The third command is the real test: it confirms that an unprivileged student can run a containerized GPU workload. If it fails with a permission error, the student isn’t in the docker group — sudo gpasswd -a alice docker and have them log out and back in.

For a sanity check that PyTorch will see the GPUs (module 14 depends on this), drop a Python check into a temp container:

docker run --rm --gpus all pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime \
  python -c "import torch; print(torch.cuda.device_count(), torch.cuda.get_device_name(0))"

Expect output like 2 NVIDIA L40S.

Step 6 — Write the first ADR

You will accumulate small, load-bearing decisions over the course — what’s installed where, what’s shared, what gets a service vs. a per-user install, when to bump CUDA. An ADR (Architecture Decision Record) is one short markdown file per decision, and it pays for itself the first time you ask “wait, why did we do it this way?” three months in.

Convention: ADRs live in a docs/adr/ folder inside the platform repo (which you will set up in module 02 — for now, drop the file in /srv/shared/adr/).

/srv/shared/adr/0001-access-pattern.md:

# ADR 0001 — Access pattern for the GPU server

## Status
Accepted, 2026-05-15.

## Context
The course's shared GPU server hosts every student's notebook environment and
every long-running service (JupyterHub, MLflow, Prefect, vLLM, Postgres, MinIO).
The server holds research-grade data and consumes meaningful electricity. It
must not be reachable from the public internet, and individual student access
must be revocable in one place.

## Decision
- The GPU server has a private IP only. It is reachable through a jump host
  (`jump.example.com`) that owns the public-facing SSH endpoint.
- Each student has a Linux user, password-disabled, public-key SSH only,
  in groups `cohort` and `docker`. No `sudo`.
- Students reach the server via `ProxyJump` through the jump host.
- Browser surfaces (JupyterHub, MLflow, Superset, etc.) are exposed via a
  reverse proxy on the jump host with TLS; only the proxy talks to the
  GPU server's private IP.

## Consequences
- Revoking a student is one command: `userdel -r <student>` on the GPU server.
  The jump host has no per-student account; SSH terminates on the GPU server.
- Outages of the jump host break student access. Have a console-server fallback
  for the GPU server.
- All browser traffic is HTTPS at the edge; HTTP is acceptable on the
  loopback / private network between the proxy and backends.

## Alternatives considered
- Direct public IP on the GPU server. Rejected: too much attack surface, and
  every service we'll add (MLflow UI, Prefect UI) opens another port to defend.
- VPN-only access (WireGuard). Viable, but adds a client-side install to
  every student's laptop. We may revisit if the jump-host approach becomes
  painful.

This file will move into the platform’s git repo in module 02 and become the first entry in docs/adr/.

Recap and what’s next

You now have:

A GPU server with a verified NVIDIA stack at both the host and container layers.
Per-student Linux accounts with SSH-key-only access, no passwords, no sudo, and a sane group layout.
A documented access pattern (one ADR) that captures why the setup looks the way it does.

The server is reachable but it doesn’t yet host anything a student would call a platform. There’s no JupyterHub, no source control, no environment manager. That’s module 02.

Next: 02 — The toolchain.