POC for Command Code · built by Waqas Haider

Self-hosted inference that scales on two levels — and proves it when you press the buttons.

vLLM on Kubernetes, on your own GPUs. Try a prompt, then run three real tests: caching under load, replicas scaling within one GPU, and a whole new GPU machine spinning up and self-destructing back to $0. Everything below is live and measured.

// PLAYGROUND
interactive

Try your own prompt. It's proxied over HTTPS to the live cache-aware router on the GPU fleet and runs to a full completion — no per-token cost, so nothing is cut short.


        
metrics
// RUN A TEST
interactive

Three tests, one per level of the architecture. Each fires a real load on the live fleet and reads real metrics. One runs at a time (single physical GPU) — press one and watch running → PASS.

TEST 01

Inference & Caching

Proves the prefix cache pays off: sends a long fresh prefix cold, then warm — first-token time collapses several×.
TEST 02

Scale WITHIN a GPU

Proves more load → more replicas on the same L40. Resets to 1, then a burst drives KEDA 1 → 2 on its time-slices. When load stops it settles back to 1 in ~90s — watch it breathe in the live metrics below.
TEST 03

Scale GPUs themselves

Proves one GPU full → a whole new GPU machine spins up, serves, then self-destructs → $0.
1 → 2 → 1 GPU machines PROVEN
VM 886267
real L40, provisioned → terminated
~$0.16
total spend · $0 idle
served ✓
real pod ran on the new node
~14s
snapshot wake from $0
/opt/nodescale · scheduler-driven cluster autoscaler. A real Hyperstack L40 VM provisioned → joined k3s (1→2) → served a real completion → drained → terminated (2→1). Press run live to do it again.
00:00 step 1/7
starting…
Real GPU VM — ~10 min end to end. Watch Nodes and slices go 1→2 then 2→1 in the live metrics below.
// LIVE METRICS
live read-only

Prometheus-backed Grafana, embedded over HTTPS. These refresh every few seconds — run a test above and watch them respond.

GPU: 1× NVIDIA L40 (4 time-slices) · Nodes: 1  —  one physical card; GPU time-slicing exposes 4 schedulable slices to Kubernetes. (Capacity below counts slices, not physical cards.)

Replicas = model copies running (live usage); each replica uses 1 of the 4 GPU slices (the card's fixed capacity). A model at 0 replicas is scaled-to-zero — costing $0 until called.

GPU slices (capacity) · 1 physical L40 = 4 slices
GPU nodes (physical) · 1 node
Model replicas · each copy = 1 of 4 slices
Prefix-cache hit rate per model
Request queue depth (num_requests_waiting)
GPU utilization % (real — DCGM)
Token throughput (prompt + gen tok/s)
Time-to-first-token — latency quantiles (s)
GPU VRAM used / total (real — DCGM)

Truth on hardware: 1 physical NVIDIA L40, exposed as 4 GPU time-slices, on 1 node. Models: qwen-coder-3b / qwen-coder-1.5b + deepseek-coder-1.3b. Every number is measured live on Hyperstack — nothing faked.