~10 min read · updated 2026-05-15

Overview

What this track covers, who it's for, and how to use it.

This is a self-paced, opinionated track on doing real data science — from an empty on-prem server with two GPUs through to four shipped capstones: a tracked tabular ML model behind a service endpoint, a modeled data warehouse driving live dashboards, a self-hosted RAG application running on a private LLM, and a multi-GPU computer-vision model trained on data you annotated yourself. By the end you will have stood up the platform a working data team would actually depend on, and used it to ship work in each of the four DS domains.

The track is opinionated. It teaches the choices a small DS team would make today — JupyterHub for multi-user notebooks, Slurm for GPU scheduling, MinIO as the object store, MLflow for tracking and registry, vLLM for self-hosted inference, Prefect for orchestration, dbt for warehouse modeling, Superset for BI, Label Studio for annotation, Prometheus + Grafana + Langfuse + Evidently for observability. Where alternatives exist (Airflow, Kubeflow, KServe, Triton, Qdrant) I name them and say why we picked the other thing.

Who this is for

Engineers and analysts who:

  • Write Python comfortably — packages, virtualenvs, basic typing, the difference between a script and a library.
  • Read and write SQL — joins, window functions, subqueries.
  • Are comfortable on a Linux shell — SSH, systemd, ps/grep/curl/tmux.
  • Have stats literacy — distributions, hypothesis tests, regression intuition. No measure theory required.
  • Are not trying to become a research ML scientist. This is the working-engineer’s surface-area version of modern DS.

If you’ve never trained a model or never written a SQL GROUP BY, work through a Kaggle “Titanic” tutorial and an intro SQL course first. A weekend; come back.

What you’ll build

A real platform, then four real projects on it. The platform:

  • JupyterHub with a DockerSpawner — per-user containers, GPU pass-through where requested.
  • Gitea for self-hosted git, PRs, and CI via Gitea Actions.
  • MinIO as an S3-compatible object store backing the data lake, MLflow artifacts, and DVC remotes.
  • Postgres as the shared metadata store, the warehouse, and the vector DB (via pgvector).
  • MLflow tracking server with the Model Registry promoted-by-tag pattern.
  • Slurm scheduling jobs across two NVIDIA L40S GPUs (48 GB VRAM each, 96 GB total).
  • vLLM serving a 7–14B LLM with an OpenAI-compatible API for the whole cohort.
  • Prefect, dbt, Superset for the data engineering and analytics stack.
  • Label Studio for image and text annotation.
  • Prometheus + Grafana, Langfuse, and Evidently for the three observability surfaces (system, LLM, drift).

The four capstones — each one a real artifact, not a notebook demo:

#CapstoneWhat it is
1Tabular MLCalibrated churn/credit-risk model, tracked with Optuna in MLflow, promoted in the registry, served behind FastAPI with drift monitoring.
2Data engineering + analyticsScheduled Prefect ingestion → bronze/silver/gold on MinIO + Postgres → dbt with tests and docs → Superset dashboards answering defined business questions.
3NLP / LLMRAG application over a real corpus on a self-hosted vLLM endpoint, with hybrid retrieval, reranking, Langfuse tracing, and a ≥30-example offline eval set.
4Computer visionCustom-annotated dataset (≥500 images in Label Studio), transfer-learned model trained via DDP across both L40S, exported to ONNX, served with batched FastAPI inference.

The platform you’ll stand up

Reading the diagram:

  • Blue layer — cohort-facing surfaces. Everything a student opens in a browser or pushes git to.
  • Grey layer — the four self-service entry points: notebooks, code, dashboards, annotations.
  • Amber layer — platform services. Schedulers, registries, pipelines, and the LLM endpoint.
  • Green / red layer — state and compute. Object store, metadata DB, dataset versions, and the two GPUs.
  • Purple layer — observability, dotted-grey edges into everything it watches.
  • Dashed red — GPU consumption paths (Slurm jobs, vLLM inference).

The 17-module map

#ModuleWhat you build
00Overview (this page)
01Infrastructure: the GPU serverUser accounts, ProxyJump SSH, GPU verification, shared-server etiquette, the first ADR
02The toolchainJupyterHub multi-user, uv envs, Gitea, pre-commit, Docker basics
03The reproducibility stackMinIO + DVC + MLflow tracking server + a cookiecutter project template
04GPU scheduling with Slurmgres:gpu, sbatch/srun, fair-share, a working multi-user GPU queue
05The data lake: bronze ingestionParquet on MinIO, source connectors, partitioning, compaction
06The warehouse: Postgres + dbtStaging → marts, tests and docs, snapshots, incremental models
07Orchestration with PrefectFlows, retries, idempotency, backfills, observability of pipelines
08Analytics with Superset (Capstone 2)Dashboards on the modeled warehouse — the data-engineering capstone
09First ML model (Capstone 1)Tabular pipeline + Optuna + MLflow registry — the business-ML capstone
10Serving and driftFastAPI + Evidently + Prometheus/Grafana wired to the production model
11Self-hosted LLMvLLM on the L40S, model selection, quantization, OpenAI-compatible endpoint
12RAG application (Capstone 3)pgvector + hybrid retrieval + reranker + Langfuse evals — the LLM capstone
13LoRA fine-tuningQLoRA on a 7–8B base with peft and trl, task-specific eval harness
14Computer vision (Capstone 4)PyTorch + Label Studio + DDP across both GPUs, ONNX export, batched serving
15CI/CD with Gitea ActionsLint, test, train-on-tag, deploy-on-merge — for every project on the platform
16What’s nextThe roadmap — what we deliberately left out and where to find it

Each module assumes the previous ones. Expect 45–90 minutes per module on a real server; capstone modules (08, 09, 12, 14) run longer because they end in a shipped artifact.

How to use this track

Three patterns work:

  • Sequential. Start at 01, walk to 14, then ship one project to production in 15. Best for a first-time cohort.
  • Reference. Jump to the module you need from the sidebar. Best if you already have a platform and want one specific layer (e.g., “I have JupyterHub, I just need vLLM”).
  • Instructor-led + async student work. Walk through one module per session live; students execute the steps on their own accounts between sessions.

Prerequisites and setup

  • A Linux host with 2× NVIDIA L40S (or any modern dual-GPU box with ≥48 GB total VRAM). Smaller setups work for everything except module 13 (LoRA fine-tuning) and module 14 (DDP CV training).
  • Root or sudo on that host to provision user accounts and install services in module 01.
  • SSH from each student’s laptop to that host. A jump host is fine — module 01 covers the ProxyJump pattern.
  • A GitHub account per student. Used initially to ssh in; switches to Gitea on the server in module 02.
  • No Python, no CUDA, no Docker installed up front. Module 02 installs the toolchain.

A note on what’s not here

This track is deliberately focused on doing data science on a single shared server. The following are out of scope and live in adjacent tracks:

  • Kubernetes-based ML platforms. Kubeflow, KServe, and the multi-cluster scaling story have their own track (Kubeflow).
  • Cloud-native MLOps. SageMaker, Vertex AI, Databricks. The patterns transfer, but this track is intentionally on-prem.
  • Deep theory. No proofs, minimal derivations. We pick the best practical default and point at the paper for the curious.
  • Pure data science (no platform). If you want to read CSVs into a notebook and call it a day, this track is overkill — go through a Kaggle competition first.

Next: 01 — Infrastructure: the GPU server.