Reading

Test-Time Distilling on Spark — Same Compute Envelope, Wider Semantic Reach

ESamp adds a tiny test-time-trained probe to vLLM that converts decoding from lexical resampling into semantic exploration. The runtime is vLLM-native — and that is a Spark catalog-gap story before it is a benchmark.

Series Frontier Scout

№29 inference Foundation 02 May 2026 ~2 hours — most of it watching vLLM 0.20 build inside an NGC PyTorch container; the runtime+drift diagnosis that follows is the short, sharp half advanced NVIDIA DGX Spark Manav Sehgal

Terms in this piece4

Test-time scalingSpending more inference compute per query — n parallel samples, beam search, sampler interventions, or self-verification loops — to lift output quality without retraining the model. The trade-off shape is "more tok/s budget per answer in exchange for higher Pass@k." ESamp belongs to the sampler-intervention family: same n, same wall clock, different distribution over the candidate space.
Pass@kStandard reasoning-benchmark metric: of n independently sampled solutions to a problem, what fraction of problems have at least one correct solution among any k-sized subset? Unbiased estimator from the Codex paper (Chen et al., 2021) corrects for variance when n > k. The metric rewards semantic spread across the n samples — if all n rephrase one wrong attempt, Pass@k is no better than Pass@1.
vLLMOpen-source LLM inference engine introduced with the PagedAttention paper (Kwon et al., 2023). Manages KV-cache memory in 16-token blocks instead of pre-reserving the worst case per request, raising effective concurrency 2–4× over pre-paged stacks. The v1 rewrite (0.6+, stabilized through 0.10–0.20) is what most test-time-scaling literature targets because the request lifecycle and sampler hot path are extension-friendly.
Distiller (online probe)A small two-layer MLP that ESamp trains during inference to predict the host LLM's deep-layer hidden state from its shallow-layer hidden state. Training data is the running sequence of decode rows the model is producing right now — no ground-truth labels, just self-consistency between layers. Training error on a candidate continuation is the novelty signal the sampler intervention reweights against. ~1 GB of parameters; fits trivially next to the 7B host on Spark's unified memory.

KV-cache arithmetic at inference asked one question of a fixed compute budget: how much fits? Weights × dtype, KV cache × context × batch, all the way down to the last GiB. The answer determined what could run on the Spark at all.

ESamp — Large Language Models Explore by Latent Distilling — asks a different question of the same envelope: how widely can the model search inside it? When you sample n=8 candidates from the same prompt, do you get eight lexically different rephrasings of one bad attempt, or eight semantically distinct paths through the answer space? At fixed compute, the second is worth more. Pass@k benchmarks on AIME, MATH, and HumanEval reward whichever dimension the samples actually spread along; the paper reports that a ~1 GB online-trained probe pushes that spread without measurably moving the wall clock — the optimized 7B min_p path lands at 0.9878× the baseline tokens-per-second on a reference RTX 4090 run, with a Pass@k lift on the reasoning benchmarks the paper highlights.

The plot twist for a Spark power-user is upstream of the numbers. ESamp ships as a runtime extension to vLLM v1, packaged as the tLLM repository — a Producer/Consumer hook layer over vLLM. The Spark’s blessed inference path is NIM (TensorRT-LLM) and Triton, not vLLM. So the article’s first half is about a stack mismatch, not a benchmark: when a paper’s runtime is in a different lane than the box’s verified runtime, what does it actually take to make the experiment runnable here?

The paper, in one breath

Thesis. Standard stochastic decoding produces lexical variation but rarely semantic exploration — temperature and top-p resample near-duplicate ideas. ESamp adds a lightweight Distiller trained online at test time to predict the LLM’s deep-layer hidden state from its shallow-layer hidden state. When the Distiller’s prediction error spikes on a candidate continuation, that’s a novelty signal — the prefix is moving into territory the LLM hasn’t been recently calibrated on — and ESamp reweights token candidates toward those less-explored semantic patterns.

Why this technique matters for a personal AI builder. Reasoning workloads spend their compute budget on n parallel samples; if all n collapse onto rephrasings of one bad attempt, the budget is wasted. ESamp converts the same n into n semantically distinct paths through the answer space — which is the dimension Pass@k on AIME, MATH, and HumanEval actually rewards. At a fixed compute envelope, that is the difference between a brittle reasoner and one that explores.

Promise vs achieved. Paper claims 0.9878× baseline tokens-per-second on a reference RTX 4090 with CUDA graphs (vLLM 0.10.x), with a Pass@k lift on the reasoning benchmarks above. This article does not measure the ratio — it lands the runtime substrate and surfaces the first two upstream API drifts that block tLLM’s hooks on vLLM 0.20.0. The ratio measurement on Spark lands in the follow-up, which closes at 0.974× on patched Qwen 2.5 7B — within 1.4 percentage points of the paper, with CUDA graphs deliberately disabled and six (not two) patches in place.

Why this matters for a personal AI builder

Reasoning models are the workload class where the Spark’s 128 GiB unified pool earns its line on the spec sheet — n=16 parallel completions of a 7B reasoning model with a few thousand tokens each is comfortable, not tight. A frontier-API equivalent would burn through credits and rate limits long before you finished iterating on the sampler. The Spark makes test-time-scaling techniques — Pass@k sweeps, beam-search ablations, sampler-guidance tuning — iterable. ESamp is one such technique that needs the iteration: the Distiller is online-trained, its hyperparameters interact with the model and the prompt distribution, and getting the --distiller-beta knob right takes a sweep.

But none of that is reachable until vLLM runs on the box. The Spark catalog ships eight-or-so curated -dgx-spark NIM images and zero vLLM containers. vLLM-on-Blackwell exists in the broader ecosystem but the wheels and CUDA-13 ABI matrix is its own afternoon. This article documents that afternoon as a first-class part of the work — the catalog gap is the experiment.

Where this sits in the stack

The paper’s algorithm and the runtime that hosts it are deliberately decoupled. ESamp’s idea — predict the model’s deep-layer hidden state from its shallow-layer hidden state with a lightweight probe, treat prediction error as a novelty signal, reweight token candidates in proportion — could in principle live in any inference engine. In practice it lives in vLLM v1 because that is where the published reference implementation runs, and re-porting it to TensorRT-LLM is a separate research-engineering project the paper does not attempt.

The intervention is gated through the same SamplingParams either way — the difference is in how the n samples spread through the answer space, not in their wall-clock cost.

The runtime split is what makes the paper’s overhead claim tractable: the consumer’s training pipeline runs in a background window — its backward pass overlaps the next vLLM forward. On the validated 7B min_p path, the optimized intervention sits at 98.78% of baseline tok/s (5,304.855 vs 5,370.616 on the reference 4090). On Spark, that overhead figure is the second number we want to read. The first one is whether the runtime starts at all.

The journey — landing a vLLM-native paper on a NIM-curated box

The repo’s install instructions are honest and short. From tLLM/doc/getting-started/installation.md:

python -m venv .venv && source .venv/bin/activate
pip install vllm
pip install -e .
python starter.py --max-new-tokens 32

Three lines, one validated environment (vllm==0.10.x, the doc says), and a healthy run prints loss_count > 0 and a list of generations. On a normal x86 + CUDA 12 host with pre-built wheels, this works in minutes. On a Spark — aarch64 GB10 (SM 12.1), nvcr.io/nvidia/pytorch:25.11-py3 shipping torch 2.10.0a0+nv25.11 against CUDA 13.0 — the pip install vllm line is the work. There is no published wheel that matches all of: aarch64, SM 12.1, CUDA 13, vLLM ≥ 0.10. The install resolves, then starts compiling source dependencies (fastsafetensors is the first to appear in /tmp/pip-build-env-*/), and the wall clock starts ticking.

docker run -d --name tllm-build --gpus all --ipc=host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -v /tmp/tllm-spark:/work \
  -v /home/nvidia/.../evidence/repo-snapshot:/tllm:ro \
  nvcr.io/nvidia/pytorch:25.11-py3 sleep infinity

docker exec tllm-build sh -c '
  python3 -c "import torch; print(torch.__version__, torch.version.cuda,
                                   torch.cuda.get_device_capability(0))"
'
# 2.10.0a0+b558c986e8.nv25.11  13.0  (12, 1)

The four numbers in the second command’s output are the entire integration story compressed: torch is the container’s nightly, CUDA is the new major (13.0, not 12.x), and the device reports SM 12.1 — the Blackwell B100/GB10 compute capability. vLLM 0.10.x wheels were built for CUDA 12, so the matrix forces a build. The build is the article’s first measurable: a sustained pip install vllm that holds the box at ~3 GB of resident pip-build-env, walking through every source dep that doesn’t have a matching pre-built wheel for this triple.

While that runs, the rest of the experimental shape is fixed enough to write the harness around it. ESamp’s published functional check is a repro_esamp_loss runner that takes the model name, a list of debug prompts, and the two layer paths the Distiller bridges. Its published throughput benchmark is per_request_esamp_benchmark, comparing single_off (plain vLLM) against model_bank_on (ESamp registered as a consumer) on identical SamplingParams. The ratio is the headline. From doc/reference/esamp-usage.md:

ratio = model_bank_on / single_off   # baseline=1.0; paper's optimized 7B = 0.9878

The Bench shape that absorbed AutoResearchBench’s per-question schema in article #1 generalizes to ESamp’s per-prompt-batch schema cleanly:

# scripts/run_esamp_bench.py — registers fieldkit.eval.Bench around the
# tLLM throughput-benchmark workflow, so the same harness rolls up Pass@k
# tasks once the verifier loops are added in the next article.
from fieldkit.eval import Bench, summarize_metric
from fieldkit.capabilities import Capabilities, practical_inference_envelope

caps = Capabilities.load()
print(practical_inference_envelope("7B params bf16"))   # sanity: Qwen-7B fits
# ~14 GB weights; leaves >100 GB of unified pool for KV / activations / Distiller

for label, args in [("baseline", BASELINE_ARGS), ("esamp", ESAMP_ARGS)]:
    with Bench(name=f"esamp/{label}",
               metrics=["tokens_per_s", "loss_avg", "answers"]) as bench:
        for prompt in PROMPTS:
            bench.record(callable=lambda: tllm_run(prompt, args))
    bench.report()

The install resolves cleanly — pip install vllm lands vllm-0.20.0 after a ~14-minute build that walks fastsafetensors and a long tail of CUDA-13 user-space packages. Post-install: torch 2.11.0+cu130, CUDA available, GB10 reported as SM (12, 1), vllm 0.20.0, from vllm import LLM, SamplingParams round-trips. So far so good.

The validated tLLM environment was vllm==0.10.x. The starter, run on the smallest model the doc suggests as the OOM-safe default, gets through model load and into vLLM v1 engine init — and dies inside a tLLM patch:

File "/tmp/tllm-rw/tllm/runtime/vllm_patch/sampler_patch.py", line 181,
    in wrapped_sampler_sample
  logits_for_sampling = sampler.apply_temperature(
      logits, sampling_metadata.temperature
  )
TypeError: Sampler.apply_temperature() missing 1 required positional argument: 'all_random'

That’s the article’s catalog gap reduced to a single signature change. tLLM’s runtime patches a v1-engine sampler whose entry point gained a third required argument across the 0.10 → 0.20 churn. The fix is local — all_random is already computed three lines above the failing call, in the same function — and the patched call goes through:

- logits_for_sampling = sampler.apply_temperature(logits, sampling_metadata.temperature)
+ logits_for_sampling = sampler.apply_temperature(logits, sampling_metadata.temperature,
+                                                 sampling_metadata.all_random)

The patch goes through. The starter — patched, otherwise unmodified — gets the rest of the way through engine init: vLLM v1 allocates a KV cache of 3,894,336 tokens (max-concurrency 15,212× at 256 tokens per request — Qwen 2.5 0.5B is tiny next to the 128 GiB pool), the FlashInfer autotuner runs in 60 ms, and CUDA-graph capture finishes in 3 seconds across 51 piecewise-prefill-decode shapes and 35 decode-full shapes. Engine init reports init engine (profile, create kv cache, warmup model) took 82.18 s (compilation: 5.15 s). Four prompts render. The first execute_model hits a second drift:

File "/tmp/tllm-rw/tllm/runtime/vllm_patch/port_runtime_hooks.py", line 511,
    in wrapped_prepare_inputs
TypeError: _wrapped_prepare_inputs() takes 2 positional arguments but 3 were given

vLLM 0.20.0’s GPUModelRunner._prepare_inputs(self, scheduler_output, num_scheduled_tokens) added a required second positional argument; tLLM’s wrapped_prepare_inputs(*, core, runner, scheduler_output) was written to keyword-route a single scheduler_output. That is a deeper change than the sampler one — the wrapper’s keyword-only signature, the adapter that downstream consumers call to unpack prepare_inputs output, and the consumer’s bundle-assembly path all need to thread the new num_scheduled_tokens argument. It’s still tractable — five-to-ten lines and a careful re-read of the runtime — but it’s the second uncaptured drift in the same file in two patches, and that is itself the article’s central evidence: the runtime is the frontier. The test-time-distilling literature is moving fast enough that the production-grade inference engine it targets is itself moving fast enough that one-line drifts compound into deeper ones. The catalog gap on the Spark side meets the version drift on the upstream side; both are tractable, neither is documented in the paper, and a power user landing this stack here ends a session with two upstream patches in their notes and a Pass@k matrix queued for the next session.

Verification — what success looks like on Spark

The “did the integration work” question splits cleanly into three layers, and we got two and a half of them by running the patched starter against Qwen/Qwen2.5-0.5B-Instruct (the doc’s OOM-safe default) inside the PyTorch container:

Layer	What it asks	This session
Install	Does `pip install vllm` resolve and import on the Spark’s torch + CUDA + SM triple?	✅ vllm 0.20.0 in ~14 min; `torch 2.11.0+cu130`; `import vllm` round-trips
Engine	Does vLLM v1 init, allocate KV, capture CUDA graphs on GB10?	✅ KV cache 3.8M tokens; 86 CUDA graphs captured; `init engine took 82.18 s`
tLLM hooks	Do tLLM’s runtime patches bind into the vLLM v1 hot-path on this version?	⚠ two API drifts: `apply_temperature` (one-line fix), `_prepare_inputs` (multi-line fix)
ESamp loss	Does the consumer fire and `loss_count > 0` after a real prompt?	deferred — gated on the second patch landing
Throughput ratio	Is `model_bank_on / single_off` in the same neighborhood as the paper’s 0.9878?	deferred — gated on the consumer firing

The interesting line is the Engine row. vLLM 0.20.0 — torch-2.11.0+cu130, no Spark-specific tuning, default --gpu-memory-utilization=0.4 — initialized cleanly on GB10 (SM 12.1) inside an nvcr.io/nvidia/pytorch:25.11-py3 container with nothing more exotic than pip install vllm. CUDA-graph capture, FlashInfer autotuning, and KV-cache profiling all worked. That is the positive result of the session and it is worth naming: vLLM-on-Blackwell is no longer the multi-day port it was at the start of the GB10 cycle. The fact that it “just works” with one pip install is the substrate the next experiment lives on top of.

The unified-memory check the Spark uniquely cares about is the easy one to read off this run. KV cache reserved 3.89 M token slots inside the 0.4 × 121 GiB envelope — meaning a 0.5B model’s KV is a rounding error against the unified pool. Scaling to Qwen-7B at bf16 (~14 GB weights), n=16 decode at max_tokens=512 (a few GB of KV in the same arithmetic), plus the ESamp Distiller (~1 GB) lands the whole loadout under 25 GB on a 121 GiB envelope. That is the inversion of the KV-cache arithmetic story: where the foundation article asked what fits, the test-time-distilling article asks what does fitting waste. The Spark is comfortable; the bottleneck — once the runtime patches land — will be throughput overhead and consumer-side GPU contention, not capacity.

Tradeoffs and surprises

vLLM-on-Blackwell is solved; vLLM-on-Blackwell-with-tLLM is the open afternoon. The two halves of that sentence point in opposite directions. The base install — pip install vllm inside nvcr.io/nvidia/pytorch:25.11-py3 against torch 2.11.0+cu130 and SM 12.1 — resolved cleanly to vllm 0.20.0 in ~14 minutes, walked through one source build (fastsafetensors), and produced an engine that allocated KV cache, captured CUDA graphs, and warmed up in 82 seconds. None of that needed a Spark-specific patch. The runtime extension — tLLM, validated against vllm==0.10.x — needed two patches in two different files in the same session to clear two different signature drifts on the v1-engine API surface. Pinning to vllm==0.10.2 would close those drifts but open a different set: torch 2.11.0+cu130 is what the container’s stack converges on, and rolling torch back to whatever 0.10.x was built against re-opens the build matrix. The right answer depends on whether the tLLM authors land 0.20.x support upstream first or whether the user files patches and runs from a fork; this article is the data point that prompts that decision.

The runtime/algorithm split is the paper’s gift. ESamp the algorithm — predict deep hidden state from shallow, treat error as novelty, reweight tokens by (1+β)·llm − β·distiller — is reasonably small and well-isolated in the consumer. It is the runtime — the Producer/Consumer/Port/ConsumerFlow plumbing that lets a consumer read packed-tensor row-localized hidden states and write back through the sampler bridge without forking vLLM — that is the engineering load. The split means the algorithm is portable in principle: a TRT-LLM consumer with the same intervention math could in theory live on the verified Spark inference path. Nobody has written that consumer; the gap is engineering, not research.

Pass@k verifier loops are deferred, intentionally. AIME and HumanEval Pass@k requires per-task verifier loops (math correctness, sandbox code execution). Those are well-trodden ground in the eval ecosystem and not novel to this article — but they are non-trivial to wire up correctly, and the article’s claim is sharper if the runtime number is honest before the task number is asserted. The follow-up article in this series will land Pass@k on AIME and HumanEval against a Spark-side ESamp run; today’s article lands the runtime and characterizes its overhead.

fieldkit.eval already absorbs the harness; fieldkit.inference is the next surface to lift. The throughput-comparison loop, the per-prompt-batch metrics dict, the model_bank_on / single_off ratio computation — all of that fits inside the existing fieldkit.eval.Bench shape. What does not fit is a vLLM-flavored client wrapper analogous to the existing fieldkit.nim.NIMClient. A fieldkit.inference.VLLMClient would absorb the SamplingParams construction, the make_llm call, and the throughput-measurement boilerplate that every vLLM-side experiment in the series will otherwise repeat. This is the one new module the test-time-distilling work motivates; tracked for fieldkit v0.2. A second candidate worth filing now is fieldkit.eval.PassAtK — a verifier-loop primitive that takes a per-task grader and an n-sample iterator and returns pass@1, pass@k. That candidate also lands in v0.2, alongside the AIME/HumanEval follow-up article.

What this unlocks

A repeatable shape for “the paper’s runtime is in a different lane than my box’s runtime.” This article is not the first time the Spark has met a paper whose reference implementation runs on a stack the box does not curate; it will not be the last. The shape — pull a PyTorch container, install the third-party runtime from source, document the version-pin choices, then write the harness around it — is reusable. The next paper in the Frontier Scout queue (scientific-foundation-models-as-tools) will need a similar move for a different runtime. Documenting the surface area of the move now means the next one is shorter.

A test-time-scaling experimental substrate on the desk. vLLM-on-Blackwell now installs in one pip line and 14 minutes inside the standard NGC PyTorch container. That is the substrate every test-time-scaling technique lives on — speculative decoding, classifier-free guidance, contrastive decoding, beam-search ablations, the rest of the literature. All of them are sampler interventions; all of them are tunable on a workload the user owns; all of them benefit from the Spark’s ability to run n=16 (or n=64) parallel completions of a 7B reasoning model without rate limits or per-token billing. ESamp is the first-class citizen; the runtime install path is what makes the rest reachable.

A clean motivating case for fieldkit.inference. The same week’s AutoResearchBench article closed by surfacing a candidate fieldkit.eval.AgentRun for the per-question, per-turn agent schema. This week’s article surfaces two new candidates — fieldkit.inference.VLLMClient and fieldkit.eval.PassAtK — that the package’s v0.2 release will likely absorb. The pattern is healthy: each Frontier Scout article validates one or two existing modules and proposes one or two new ones, with the package’s CHANGELOG growing from real authoring rather than speculative scope.

Closing — exploration as the dual of capacity

The Spark’s distinguishing feature is not that it runs models you couldn’t run elsewhere; it is that it lets one person own the entire test-time-scaling loop end-to-end — including the part where the loop is two upstream patches away from completing. KV-cache arithmetic answered what fits; ESamp asks what does fitting waste. Same compute envelope, different question. The paper’s claim — that an online-trained ~1 GB probe converts decoding from lexical resampling into semantic exploration at ≤1.2% overhead — is exactly the kind of claim that wants a second machine to verify it, and a Spark is the second machine that does not need a second wallet. The verification is queued, not finished; the runtime substrate is now in place to finish it.

Next in the Frontier Scout series: the AIME and HumanEval Pass@k follow-up that this article scaffolds the harness for, with the two vllm_patch drifts either landed upstream or filed locally. After that, the same fieldkit.eval.Bench shape rolls into clawgym, claw-eval-live, and scientific-foundation-models-as-tools — three more papers, three more catalog-gap shapes, all of them landing on the desk and not the cloud.