Self-hosted inference that scales on two levels — and proves it when you press the buttons.
vLLM on Kubernetes, on your own GPUs. Try a prompt, then run three real tests: caching under load, replicas scaling within one GPU, and a whole new GPU machine spinning up and self-destructing back to $0. Everything below is live and measured.
// PLAYGROUND
interactive
Try your own prompt. It's proxied over HTTPS to the live cache-aware router on the GPU fleet and runs to a full completion — no per-token cost, so nothing is cut short.
metrics
// RUN A TEST
interactive
Three tests, one per level of the architecture. Each fires a real load on the live fleet and reads real metrics. One runs at a time (single physical GPU) — press one and watch running → PASS.
TEST 01
Inference & Caching
Proves the prefix cache pays off: sends a long fresh prefix cold, then warm — first-token time collapses several×.
TEST 02
Scale WITHIN a GPU
Proves more load → more replicas on the same L40. Resets to 1, then a burst drives KEDA 1 → 2 on its time-slices. When load stops it settles back to 1 in ~90s — watch it breathe in the live metrics below.
TEST 03
Scale GPUs themselves
Proves one GPU full → a whole new GPU machine spins up, serves, then self-destructs → $0.
1 → 2 → 1GPU machinesPROVEN
VM 886267
real L40, provisioned → terminated
~$0.16
total spend · $0 idle
served ✓
real pod ran on the new node
~14s
snapshot wake from $0
/opt/nodescale · scheduler-driven cluster autoscaler. A real Hyperstack L40 VM provisioned → joined k3s (1→2) → served a real completion → drained → terminated (2→1). Press run live to do it again.
00:00step 1/7
starting…
Real GPU VM — ~10 min end to end. Watch Nodes and slices go 1→2 then 2→1 in the live metrics below.
// LIVE METRICS
liveread-only
Prometheus-backed Grafana, embedded over HTTPS. These refresh every few seconds — run a test above and watch them respond.
GPU: 1× NVIDIA L40(4 time-slices) · Nodes: 1 — one physical card; GPU time-slicing exposes 4 schedulable slices to Kubernetes. (Capacity below counts slices, not physical cards.)
Replicas = model copies running (live usage); each replica uses 1 of the 4 GPU slices (the card's fixed capacity). A model at 0 replicas is scaled-to-zero — costing $0 until called.
GPU slices (capacity) · 1 physical L40 = 4 slices
GPU nodes (physical) · 1 node
Model replicas · each copy = 1 of 4 slices
Prefix-cache hit rate per model
Request queue depth (num_requests_waiting)
GPU utilization % (real — DCGM)
Token throughput (prompt + gen tok/s)
Time-to-first-token — latency quantiles (s)
GPU VRAM used / total (real — DCGM)
Truth on hardware: 1 physical NVIDIA L40, exposed as 4 GPU time-slices, on 1 node. Models: qwen-coder-3b / qwen-coder-1.5b + deepseek-coder-1.3b. Every number is measured live on Hyperstack — nothing faked.