Tag

#evaluation

Articles tagged "evaluation" — 4 entries.

Article №49 agentic NIM 28 May 2026 ~6 hours across three serving lanes, N=5 attempts per prompt

Picking the Hermes Brain on a DGX Spark — When Throughput Stops Being the Answer

The Hermes serving-lane bakeoff couldn't pick a winner: all five lanes cleared the tool-call format bar. A graded brain-quality rubric breaks the tie — and shows the fastest serving lane is also the better agent, by a margin throughput could never have measured.

uses fieldkit.evalfieldkit.harness

Article №28 observability NIM 02 May 2026 ~3 hours — 30 min plumbing, ~20 min for the runs themselves, the rest is reading what they show

Frontier Scout

AutoResearchBench on Spark — Two NIMs, One Bench, Two Failure Modes

Two Spark-tuned NIMs run AutoResearchBench's three Deep-Research example questions. Llama-3.1-8B crashes by turn 5-6 on its 8K context; Nemotron-Nano-9B-v2 finishes cleanly at 128K. Both score 0% Accuracy@1 — for completely different reasons.

uses fieldkit.nimfieldkit.evalfieldkit.capabilities

Article №26 observability NIM Llama 3.1 8B 01 May 2026 ~2 hours wall — analysis runs in seconds, the rest is reading + writing

Machine that Builds Machines

Was the Agent Researching, or Flailing? An Observability Pass on the Trajectory

A8 said the LoRA mode-collapsed because the trajectory was thin. This puts numbers on it: 6 of 13 knobs ever touched, 72% of proposals repeated a prior pair, and the proposer's k=5 history window is the structural cause.

Article №15 observability NeMo Evaluator 23 Apr 2026 ~60 minutes end-to-end — 40 s to ingest the blog into pgvector, 2 min for retrieval, 4 min for generation across three 8B variants, 90 s for the LoRA variant, 9 min for grading

Second Brain

Ragas, Reranked — What 44 Held-Out Questions Say About the Second Brain Stack

A Ragas-style harness written in 200 lines of stdlib Python, run locally on the DGX Spark, against four variants of the Second Brain RAG chain. Naive RAG scores 3.30 / 5. Rerank RAG scores 4.27. LoRA+RAG is a surprise — it does not beat naive. Retrieval is where the points come from.

uses fieldkit.eval