Stage
Observability
Knowing what is actually happening on your GPUs. Latency, memory, throughput — instrumentation for a rig you cannot stare at all day.
Pass@k After the Seventh Patch — Three Shapes ESamp Takes on Spark
Patches were six. The Pass@k harness surfaced a seventh — a one-line slice in the residual tap that only fires once batches shrink mid-run. Once cleared, ESamp takes three shapes: flat on saturated cells, lifting both rates on instruct headroom, and +6.67pp pass@8 on the unsaturated reasoning cell.
uses fieldkit.evalfieldkit.capabilities
Two Patches Were Six — ESamp Lands at 97.4% on a Patched Spark
Article #2 closed at two patches. Applying them surfaced six — including the silent return-shape adapter that broke the consumer's port. Once cleared, ESamp lands at 97.4% of baseline on patched Qwen 2.5 7B, within 1.4 pp of the paper's reference.
uses fieldkit.evalfieldkit.capabilities
Test-Time Distilling on Spark — Same Compute Envelope, Wider Semantic Reach
ESamp adds a tiny test-time-trained probe to vLLM that converts decoding from lexical resampling into semantic exploration. The runtime is vLLM-native — and that is a Spark catalog-gap story before it is a benchmark.
uses fieldkit.evalfieldkit.capabilities
AutoResearchBench on Spark — Two NIMs, One Bench, Two Failure Modes
Two Spark-tuned NIMs run AutoResearchBench's three Deep-Research example questions. Llama-3.1-8B crashes by turn 5-6 on its 8K context; Nemotron-Nano-9B-v2 finishes cleanly at 128K. Both score 0% Accuracy@1 — for completely different reasons.
uses fieldkit.nimfieldkit.evalfieldkit.capabilities
Was the Agent Researching, or Flailing? An Observability Pass on the Trajectory
A8 said the LoRA mode-collapsed because the trajectory was thin. This puts numbers on it: 6 of 13 knobs ever touched, 72% of proposals repeated a prior pair, and the proposer's k=5 history window is the structural cause.
Ragas, Reranked — What 44 Held-Out Questions Say About the Second Brain Stack
A Ragas-style harness written in 200 lines of stdlib Python, run locally on the DGX Spark, against four variants of the Second Brain RAG chain. Naive RAG scores 3.30 / 5. Rerank RAG scores 4.27. LoRA+RAG is a surprise — it does not beat naive. Retrieval is where the points come from.
uses fieldkit.eval
Claw-Eval-Live on Spark — Spark reproduction notes
Stand up Claw-Eval-Live sandboxed-workflow protocol on Spark via NemoClaw + OpenShell, mock the business-service backends, run Llama 8B vs Nemotron 49B with deterministic-trace + LLM-judge grading, and chart where local agents land vs the paper 66.7 percent ceiling.
Watching the GPU — DCGM, Prometheus, and a Local Grafana for the Spark
A planned setup of DCGM Exporter → Prometheus → Grafana entirely on the Spark itself. The goal is a single dashboard that tells the truth about GPU memory, SM occupancy, and per-container utilization for a rig that's running NIMs, pgvector, and an occasional training job at the same time.