~90 min read · updated 2026-05-15

Self-hosted LLM with vLLM

Run a 7–14B-parameter LLM on one L40S with vLLM, behind an OpenAI-compatible API the whole cohort uses. Model selection, quantization, batching, and the load test that decides which model wins.

By the end of this module you will have:

  • vLLM serving a 7–14B-parameter LLM on one L40S (leaving the other for training jobs).
  • An OpenAI-compatible endpoint that any LangChain, LlamaIndex, or raw openai SDK call points at, with no code changes from “use GPT-4 in dev” to “use the platform model in prod.”
  • A clear understanding of AWQ vs. GPTQ vs. FP16 tradeoffs on this hardware, with measured numbers from your own machine.
  • A load test showing the server’s tokens-per-second and concurrency ceiling for the chosen model.
  • A shared embedding endpoint for the RAG capstone in module 12.

Why vLLM, not Ollama or text-generation-inference

Three serious options:

OptionStrengthsWeaknesses
vLLMIndustry-best throughput via PagedAttention; OpenAI-compatible API; strong quantization support; production-gradeSlightly less friendly for “I want to try 10 models in an afternoon”
OllamaTrivial install, friendly model library, great laptop storyLower throughput; weaker OpenAI compatibility; single-user assumptions
HF TGIPolished, similar throughput goalsLess momentum than vLLM right now; commercial license friction

For a multi-user course where six people are hitting the endpoint concurrently and we want concurrency-correct batching, vLLM wins. For “try a new model on my laptop for an hour” use Ollama and don’t apologize.

Step 1 — Pick the model

The choice you make here is the model the whole cohort uses for the next two modules. Optimize for:

  • Fits comfortably on one L40S. 48 GB VRAM is enough for 14B models at FP16, 30B at 4-bit. We leave headroom for the KV cache.
  • Strong instruction following. This is RAG and tool-using territory; raw next-token quality matters less than “does it follow the schema.”
  • Permissive license. Don’t pick something the cohort can’t redistribute their results from.

Three reasonable picks as of mid-2026:

ModelParamsBest fitLicense
Qwen2.5-14B-Instruct14BFP16 fits, strong tool useApache 2.0
Llama-3.1-8B-Instruct8BFP16 plenty of headroom for big contextLlama community license
Mistral-Small-3-24B-Instruct24BAWQ 4-bit fits with roomApache 2.0

We’ll use Qwen2.5-14B-Instruct as the default in the rest of the track because it’s a sweet-spot on this hardware. Substitute freely.

Step 2 — Cache the weights

Download once into /srv/shared/models/. That way every restart is fast and the cohort isn’t pulling 30 GB from HuggingFace six times:

sudo -u <hf_user> bash -lc "
  pip install --upgrade huggingface_hub
  huggingface-cli login                                  # paste a read token
  huggingface-cli download Qwen/Qwen2.5-14B-Instruct \
    --local-dir /srv/shared/models/Qwen2.5-14B-Instruct
"

/srv/shared/models/ is the lake’s _base-models/ exception from ADR 0003 — not in DVC, not in git, but stable enough to live on disk.

Step 3 — Run vLLM

The simplest possible config. /opt/vllm/docker-compose.yml:

services:
  vllm:
    image: vllm/vllm-openai:v0.6.5
    container_name: vllm
    restart: unless-stopped
    runtime: nvidia
    environment:
      - HUGGING_FACE_HUB_TOKEN=__hf_token__
      - VLLM_API_KEY=__platform_internal_key__
    volumes:
      - /srv/shared/models:/models
    ipc: host
    ports:
      - "8000:8000"
    command: >
      --model /models/Qwen2.5-14B-Instruct
      --served-model-name qwen2.5-14b
      --gpu-memory-utilization 0.85
      --max-model-len 32768
      --enable-prefix-caching
      --disable-log-requests
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']                          # pin to GPU 0
              capabilities: [gpu]

Five choices worth understanding:

  • --gpu-memory-utilization 0.85 leaves ~7 GB free for the KV cache to grow. Higher values mean OOM during a long-context request.
  • --max-model-len 32768 caps context. vLLM allocates KV cache for this length — if you set 128k, the KV cache will eat your VRAM. Tune to your real use.
  • --enable-prefix-caching is free latency for RAG. If 100 requests share a 5 K-token system prompt, vLLM caches it once.
  • device_ids: ['0'] pins vLLM to GPU 0. GPU 1 stays free for Slurm training jobs.
  • VLLM_API_KEY turns on the OpenAI-style Authorization: Bearer ... requirement. Use it; never run an open endpoint on a shared server.

Bring it up:

sudo docker compose -f /opt/vllm/docker-compose.yml up -d
sudo docker logs -f vllm

First start takes ~60 seconds to load the FP16 weights. The log line Application startup complete. means it’s ready.

Step 4 — Use it

From any notebook or shell:

from openai import OpenAI

client = OpenAI(
    base_url="http://<gpu-server>:8000/v1",
    api_key="__platform_internal_key__",
)

resp = client.chat.completions.create(
    model="qwen2.5-14b",
    messages=[
        {"role": "system", "content": "You are a concise SQL analyst."},
        {"role": "user", "content": "Write a SQL query that returns the top 5 boroughs by average tip rate from a table called fct_trips with columns total_amount, fare_amount, tip_amount, pickup_borough."}
    ],
    temperature=0.2,
)
print(resp.choices[0].message.content)

The whole cohort uses this endpoint by setting OPENAI_BASE_URL and OPENAI_API_KEY env vars. Any code that already speaks the OpenAI API (LangChain, LlamaIndex, raw SDK calls) works untouched.

Step 5 — AWQ / GPTQ when you need more VRAM headroom

FP16 of a 14B model is ~28 GB. On an L40S with 48 GB you have room for a healthy KV cache. The moment you want a 30B+ model, quantize.

Two reasonable quantization formats:

  • AWQ — Activation-aware Weight Quantization, 4-bit, very good preserved quality.
  • GPTQ — older, similar idea, slightly broader model availability.

Run a pre-quantized AWQ version of a bigger model:

command: >
  --model Qwen/Qwen2.5-32B-Instruct-AWQ
  --served-model-name qwen2.5-32b-awq
  --quantization awq
  --gpu-memory-utilization 0.85
  --max-model-len 16384

Expect ~70% of FP16 throughput at ~25% of the weight memory. For most RAG and chat use cases the quality drop is imperceptible.

Step 6 — Load test

The right number for “how many concurrent users can we serve” comes from actually testing. The vllm/vllm-bench image makes this easy:

docker run --rm --net host -e OPENAI_API_KEY=__platform_internal_key__ \
  vllm/vllm-openai:v0.6.5 \
  python -m vllm.entrypoints.openai.api_server.bench \
    --base-url http://localhost:8000/v1 \
    --model qwen2.5-14b \
    --concurrency 1,4,8,16,32 \
    --num-prompts 200 \
    --input-len 256 --output-len 256

Expected on one L40S with Qwen2.5-14B FP16:

ConcurrencyThroughput (tok/s)p50 latencyp99 latency
1~703.6 s4.0 s
8~5204.1 s5.5 s
32~11007.5 s11.0 s

Aggregate throughput scales nicely (vLLM’s continuous batching is doing real work); per-request latency creeps up. The number to remember is your per-request latency at your expected concurrency — quote it in the runbook.

Step 7 — Embedding endpoint

RAG needs an embedding model. Run a second small vLLM (or a dedicated text-embeddings-inference container) for bge-large-en-v1.5 or Qwen3-Embedding-0.6B:

  embedding:
    image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.5
    container_name: embedding
    restart: unless-stopped
    environment:
      - HF_TOKEN=__hf_token__
    ports:
      - "8001:80"
    volumes:
      - /srv/shared/models:/data
    command: >
      --model-id /data/bge-large-en-v1.5
      --port 80

Embedding models are small (under 2 GB at FP32) — running on CPU is fine and frees both GPUs.

Test:

curl -s -X POST http://localhost:8001/embed \
  -H 'Content-Type: application/json' \
  -d '{"inputs": ["taxi tip prediction", "borough revenue"]}' | jq '.[] | length'

Two 1024-dim vectors come back. The RAG capstone module turns this into something real.

Step 8 — Observability for vLLM

vLLM exposes a Prometheus-compatible /metrics endpoint. Add it to your scrape config from module 10:

  - job_name: 'vllm'
    static_configs:
      - targets: ['localhost:8000']
        labels:
          service: 'vllm-qwen'

The metrics worth charting in Grafana:

MetricWhy
vllm:gpu_cache_usage_percKV cache pressure. Above 90% means context overflow is near
vllm:num_runningConcurrent active requests
vllm:num_waitingQueued requests — a non-zero standing queue means oversubscribed
vllm:prompt_tokens_totalInput tokens consumed (cost proxy if you’re billing internally)
vllm:generation_tokens_totalOutput tokens generated
vllm:e2e_request_latency_secondsEnd-to-end request latency histogram

ADR 0011 — vLLM as the platform’s LLM gateway

/srv/shared/adr/0011-vllm-platform.md:

# ADR 0011 — vLLM as the platform LLM endpoint

## Status
Accepted, 2026-05-15.

## Context
The platform needs a self-hosted LLM endpoint usable by every project, with
strong throughput, an OpenAI-compatible API, and operational visibility.

## Decision
A single vLLM service pinned to GPU 0 of the GPU server, serving
Qwen2.5-14B-Instruct FP16, with prefix caching and a 32k context window.
Behind a Bearer-key auth. Exposed as `http://<gpu-server>:8000/v1`.

A second container (`text-embeddings-inference`) on CPU serves
`bge-large-en-v1.5` embeddings at `:8001`.

## Consequences
- Pro: one endpoint, one auth, one set of metrics. Every project's
  `OPENAI_BASE_URL` points here.
- Pro: GPU 1 remains available for Slurm training jobs.
- Con: only one model served at a time. Swapping models is a restart.
  For multi-model hosting at this scale, see Triton or LMDeploy.

## Alternatives considered
- Ollama. Easier install; weaker concurrency and lower throughput. Lost on
  the load-test result at 8+ concurrency.
- A separate endpoint per project. Ruled out as a memory waste — every
  project would reload the same weights.

Recap and what’s next

You now have:

  • A vLLM endpoint serving a real instruction-tuned model.
  • A measured ceiling on concurrency and latency.
  • An embedding endpoint for RAG.
  • Both wired into Prometheus.

Time to use it. The RAG capstone — the LLM application module — comes next.


Next: 12 — RAG application (Capstone 3).