Tag

#benchmark

Articles tagged "benchmark" — 2 entries.

Article №28 observability NIM ~3 hours — 30 min plumbing, ~20 min for the runs themselves, the rest is reading what they show
Frontier Scout

AutoResearchBench on Spark — Two NIMs, One Bench, Two Failure Modes

Two Spark-tuned NIMs run AutoResearchBench's three Deep-Research example questions. Llama-3.1-8B crashes by turn 5-6 on its 8K context; Nemotron-Nano-9B-v2 finishes cleanly at 128K. Both score 0% Accuracy@1 — for completely different reasons.

uses fieldkit.nimfieldkit.evalfieldkit.capabilities

Upcoming observability NemoClaw ~30 min read
Machine that Builds Machines

Claw-Eval-Live on Spark — Spark reproduction notes

Stand up Claw-Eval-Live sandboxed-workflow protocol on Spark via NemoClaw + OpenShell, mock the business-service backends, run Llama 8B vs Nemotron 49B with deterministic-trace + LLM-judge grading, and chart where local agents land vs the paper 66.7 percent ceiling.