Overview
What this track covers, who it's for, and how to use it.
This is a self-paced, opinionated track on doing real data science — from an empty on-prem server with two GPUs through to four shipped capstones: a tracked tabular ML model behind a service endpoint, a modeled data warehouse driving live dashboards, a self-hosted RAG application running on a private LLM, and a multi-GPU computer-vision model trained on data you annotated yourself. By the end you will have stood up the platform a working data team would actually depend on, and used it to ship work in each of the four DS domains.
The track is opinionated. It teaches the choices a small DS team would make today — JupyterHub for multi-user notebooks, Slurm for GPU scheduling, MinIO as the object store, MLflow for tracking and registry, vLLM for self-hosted inference, Prefect for orchestration, dbt for warehouse modeling, Superset for BI, Label Studio for annotation, Prometheus + Grafana + Langfuse + Evidently for observability. Where alternatives exist (Airflow, Kubeflow, KServe, Triton, Qdrant) I name them and say why we picked the other thing.
Who this is for
Engineers and analysts who:
- Write Python comfortably — packages, virtualenvs, basic typing, the difference between a script and a library.
- Read and write SQL — joins, window functions, subqueries.
- Are comfortable on a Linux shell — SSH, systemd,
ps/grep/curl/tmux. - Have stats literacy — distributions, hypothesis tests, regression intuition. No measure theory required.
- Are not trying to become a research ML scientist. This is the working-engineer’s surface-area version of modern DS.
If you’ve never trained a model or never written a SQL GROUP BY, work through a Kaggle “Titanic” tutorial and an intro SQL course first. A weekend; come back.
What you’ll build
A real platform, then four real projects on it. The platform:
- JupyterHub with a DockerSpawner — per-user containers, GPU pass-through where requested.
- Gitea for self-hosted git, PRs, and CI via Gitea Actions.
- MinIO as an S3-compatible object store backing the data lake, MLflow artifacts, and DVC remotes.
- Postgres as the shared metadata store, the warehouse, and the vector DB (via
pgvector). - MLflow tracking server with the Model Registry promoted-by-tag pattern.
- Slurm scheduling jobs across two NVIDIA L40S GPUs (48 GB VRAM each, 96 GB total).
- vLLM serving a 7–14B LLM with an OpenAI-compatible API for the whole cohort.
- Prefect, dbt, Superset for the data engineering and analytics stack.
- Label Studio for image and text annotation.
- Prometheus + Grafana, Langfuse, and Evidently for the three observability surfaces (system, LLM, drift).
The four capstones — each one a real artifact, not a notebook demo:
| # | Capstone | What it is |
|---|---|---|
| 1 | Tabular ML | Calibrated churn/credit-risk model, tracked with Optuna in MLflow, promoted in the registry, served behind FastAPI with drift monitoring. |
| 2 | Data engineering + analytics | Scheduled Prefect ingestion → bronze/silver/gold on MinIO + Postgres → dbt with tests and docs → Superset dashboards answering defined business questions. |
| 3 | NLP / LLM | RAG application over a real corpus on a self-hosted vLLM endpoint, with hybrid retrieval, reranking, Langfuse tracing, and a ≥30-example offline eval set. |
| 4 | Computer vision | Custom-annotated dataset (≥500 images in Label Studio), transfer-learned model trained via DDP across both L40S, exported to ONNX, served with batched FastAPI inference. |
The platform you’ll stand up
Reading the diagram:
- Blue layer — cohort-facing surfaces. Everything a student opens in a browser or pushes git to.
- Grey layer — the four self-service entry points: notebooks, code, dashboards, annotations.
- Amber layer — platform services. Schedulers, registries, pipelines, and the LLM endpoint.
- Green / red layer — state and compute. Object store, metadata DB, dataset versions, and the two GPUs.
- Purple layer — observability, dotted-grey edges into everything it watches.
- Dashed red — GPU consumption paths (Slurm jobs, vLLM inference).
The 17-module map
| # | Module | What you build |
|---|---|---|
| 00 | Overview (this page) | — |
| 01 | Infrastructure: the GPU server | User accounts, ProxyJump SSH, GPU verification, shared-server etiquette, the first ADR |
| 02 | The toolchain | JupyterHub multi-user, uv envs, Gitea, pre-commit, Docker basics |
| 03 | The reproducibility stack | MinIO + DVC + MLflow tracking server + a cookiecutter project template |
| 04 | GPU scheduling with Slurm | gres:gpu, sbatch/srun, fair-share, a working multi-user GPU queue |
| 05 | The data lake: bronze ingestion | Parquet on MinIO, source connectors, partitioning, compaction |
| 06 | The warehouse: Postgres + dbt | Staging → marts, tests and docs, snapshots, incremental models |
| 07 | Orchestration with Prefect | Flows, retries, idempotency, backfills, observability of pipelines |
| 08 | Analytics with Superset (Capstone 2) | Dashboards on the modeled warehouse — the data-engineering capstone |
| 09 | First ML model (Capstone 1) | Tabular pipeline + Optuna + MLflow registry — the business-ML capstone |
| 10 | Serving and drift | FastAPI + Evidently + Prometheus/Grafana wired to the production model |
| 11 | Self-hosted LLM | vLLM on the L40S, model selection, quantization, OpenAI-compatible endpoint |
| 12 | RAG application (Capstone 3) | pgvector + hybrid retrieval + reranker + Langfuse evals — the LLM capstone |
| 13 | LoRA fine-tuning | QLoRA on a 7–8B base with peft and trl, task-specific eval harness |
| 14 | Computer vision (Capstone 4) | PyTorch + Label Studio + DDP across both GPUs, ONNX export, batched serving |
| 15 | CI/CD with Gitea Actions | Lint, test, train-on-tag, deploy-on-merge — for every project on the platform |
| 16 | What’s next | The roadmap — what we deliberately left out and where to find it |
Each module assumes the previous ones. Expect 45–90 minutes per module on a real server; capstone modules (08, 09, 12, 14) run longer because they end in a shipped artifact.
How to use this track
Three patterns work:
- Sequential. Start at 01, walk to 14, then ship one project to production in 15. Best for a first-time cohort.
- Reference. Jump to the module you need from the sidebar. Best if you already have a platform and want one specific layer (e.g., “I have JupyterHub, I just need vLLM”).
- Instructor-led + async student work. Walk through one module per session live; students execute the steps on their own accounts between sessions.
Prerequisites and setup
- A Linux host with 2× NVIDIA L40S (or any modern dual-GPU box with ≥48 GB total VRAM). Smaller setups work for everything except module 13 (LoRA fine-tuning) and module 14 (DDP CV training).
- Root or sudo on that host to provision user accounts and install services in module 01.
- SSH from each student’s laptop to that host. A jump host is fine — module 01 covers the ProxyJump pattern.
- A GitHub account per student. Used initially to ssh in; switches to Gitea on the server in module 02.
- No Python, no CUDA, no Docker installed up front. Module 02 installs the toolchain.
A note on what’s not here
This track is deliberately focused on doing data science on a single shared server. The following are out of scope and live in adjacent tracks:
- Kubernetes-based ML platforms. Kubeflow, KServe, and the multi-cluster scaling story have their own track (Kubeflow).
- Cloud-native MLOps. SageMaker, Vertex AI, Databricks. The patterns transfer, but this track is intentionally on-prem.
- Deep theory. No proofs, minimal derivations. We pick the best practical default and point at the paper for the curious.
- Pure data science (no platform). If you want to read CSVs into a notebook and call it a day, this track is overkill — go through a Kaggle competition first.