Stage

Observability

Knowing what is actually happening on your GPUs. Latency, memory, throughput — instrumentation for a rig you cannot stare at all day.

Article №31 inference Foundation ~3 hours of measurement · ~one line of patch
Frontier Scout

Pass@k After the Seventh Patch — Three Shapes ESamp Takes on Spark

Patches were six. The Pass@k harness surfaced a seventh — a one-line slice in the residual tap that only fires once batches shrink mid-run. Once cleared, ESamp takes three shapes: flat on saturated cells, lifting both rates on instruct headroom, and +6.67pp pass@8 on the unsaturated reasoning cell.

uses fieldkit.evalfieldkit.capabilities

Article №30 inference Foundation ~2 hours of patching · ~30 minutes of measuring
Frontier Scout

Two Patches Were Six — ESamp Lands at 97.4% on a Patched Spark

Article #2 closed at two patches. Applying them surfaced six — including the silent return-shape adapter that broke the consumer's port. Once cleared, ESamp lands at 97.4% of baseline on patched Qwen 2.5 7B, within 1.4 pp of the paper's reference.

uses fieldkit.evalfieldkit.capabilities

Article №29 inference Foundation ~2 hours — most of it watching vLLM 0.20 build inside an NGC PyTorch container; the runtime+drift diagnosis that follows is the short, sharp half
Frontier Scout

Test-Time Distilling on Spark — Same Compute Envelope, Wider Semantic Reach

ESamp adds a tiny test-time-trained probe to vLLM that converts decoding from lexical resampling into semantic exploration. The runtime is vLLM-native — and that is a Spark catalog-gap story before it is a benchmark.

uses fieldkit.evalfieldkit.capabilities

Article №28 observability NIM ~3 hours — 30 min plumbing, ~20 min for the runs themselves, the rest is reading what they show
Frontier Scout

AutoResearchBench on Spark — Two NIMs, One Bench, Two Failure Modes

Two Spark-tuned NIMs run AutoResearchBench's three Deep-Research example questions. Llama-3.1-8B crashes by turn 5-6 on its 8K context; Nemotron-Nano-9B-v2 finishes cleanly at 128K. Both score 0% Accuracy@1 — for completely different reasons.

uses fieldkit.nimfieldkit.evalfieldkit.capabilities

Article №26 observability NIM Llama 3.1 8B ~2 hours wall — analysis runs in seconds, the rest is reading + writing
Machine that Builds Machines

Was the Agent Researching, or Flailing? An Observability Pass on the Trajectory

A8 said the LoRA mode-collapsed because the trajectory was thin. This puts numbers on it: 6 of 13 knobs ever touched, 72% of proposals repeated a prior pair, and the proposer's k=5 history window is the structural cause.

Article №15 observability NeMo Evaluator ~60 minutes end-to-end — 40 s to ingest the blog into pgvector, 2 min for retrieval, 4 min for generation across three 8B variants, 90 s for the LoRA variant, 9 min for grading
Second Brain

Ragas, Reranked — What 44 Held-Out Questions Say About the Second Brain Stack

A Ragas-style harness written in 200 lines of stdlib Python, run locally on the DGX Spark, against four variants of the Second Brain RAG chain. Naive RAG scores 3.30 / 5. Rerank RAG scores 4.27. LoRA+RAG is a surprise — it does not beat naive. Retrieval is where the points come from.

uses fieldkit.eval

Upcoming observability NemoClaw ~30 min read
Machine that Builds Machines

Claw-Eval-Live on Spark — Spark reproduction notes

Stand up Claw-Eval-Live sandboxed-workflow protocol on Spark via NemoClaw + OpenShell, mock the business-service backends, run Llama 8B vs Nemotron 49B with deterministic-trace + LLM-judge grading, and chart where local agents land vs the paper 66.7 percent ceiling.

Upcoming observability NVIDIA DCGM + Prometheus + Grafana planned ~3 hours, mostly dashboard tuning

Watching the GPU — DCGM, Prometheus, and a Local Grafana for the Spark

A planned setup of DCGM Exporter → Prometheus → Grafana entirely on the Spark itself. The goal is a single dashboard that tells the truth about GPU memory, SM occupancy, and per-container utilization for a rig that's running NIMs, pgvector, and an occasional training job at the same time.