Self-hosted inference that scales on two levels — and proves it when you press the buttons.

vLLM on Kubernetes, on your own GPUs. Try a prompt, then run three real tests: caching under load, replicas scaling within one GPU, and a whole new GPU machine spinning up and self-destructing back to $0. Everything below is live and measured.

// PLAYGROUND

interactive

Try your own prompt. It's proxied over HTTPS to the live cache-aware router on the GPU fleet and runs to a full completion — no per-token cost, so nothing is cut short.

prompt

model

metrics

// RUN A TEST

interactive

Three tests, one per level of the architecture. Each fires a real load on the live fleet and reads real metrics. One runs at a time (single physical GPU) — press one and watch running → PASS.

TEST 01

Inference & Caching

Proves the prefix cache pays off: sends a long fresh prefix cold, then warm — first-token time collapses several×.

TEST 02

Scale WITHIN a GPU

Proves more load → more replicas on the same L40. Resets to 1, then a burst drives KEDA 1 → 2 on its time-slices. When load stops it settles back to 1 in ~90s — watch it breathe in the live metrics below.

TEST 03

Scale GPUs themselves

Proves one GPU full → a whole new GPU machine spins up, serves, then self-destructs → $0.

1 → 2 → 1 GPU machines PROVEN

VM 886267

real L40, provisioned → terminated

~$0.16

total spend · $0 idle

served ✓

real pod ran on the new node

~14s

snapshot wake from $0

/opt/nodescale · scheduler-driven cluster autoscaler. A real Hyperstack L40 VM provisioned → joined k3s (1→2) → served a real completion → drained → terminated (2→1). Press run live to do it again.

00:00 step 1/7

starting…

Real GPU VM — ~10 min end to end. Watch Nodes and slices go 1→2 then 2→1 in the live metrics below.

// LIVE METRICS

live read-only

Prometheus-backed Grafana, embedded over HTTPS. These refresh every few seconds — run a test above and watch them respond.

Replicas = model copies running (live usage); each replica uses 1 of the 4 GPU slices (the card's fixed capacity). A model at 0 replicas is scaled-to-zero — costing $0 until called.

GPU slices (capacity) · 1 physical L40 = 4 slices

GPU nodes (physical) · 1 node

Model replicas · each copy = 1 of 4 slices

Prefix-cache hit rate per model

Request queue depth (num_requests_waiting)

GPU utilization % (real — DCGM)

Token throughput (prompt + gen tok/s)

Time-to-first-token — latency quantiles (s)

GPU VRAM used / total (real — DCGM)

Truth on hardware: 1 physical NVIDIA L40, exposed as 4 GPU time-slices, on 1 node. Models: qwen-coder-3b / qwen-coder-1.5b + deepseek-coder-1.3b. Every number is measured live on Hyperstack — nothing faked.