Tag

#llama-cpp

Articles tagged "llama-cpp" — 6 entries.

Article №50 agentic Foundation 28 May 2026 ~3 hours including bakeoff + harness publish

The Hermes Vertical Router on a DGX Spark — One Brain Always Warm, Five Specialists Summoned on Demand

Five published Orionfold verticals plus the pinned MoE brain become a router on one Spark — not by parallel inference (the unified-memory envelope forbids that), but by a deterministic keyword classifier that dispatches the prompt and serves the right specialist one-at-a-time.

uses fieldkit.harness

Article №49 agentic NIM 28 May 2026 ~6 hours across three serving lanes, N=5 attempts per prompt

Harnesses

Picking the Hermes Brain on a DGX Spark — When Throughput Stops Being the Answer

The Hermes serving-lane bakeoff couldn't pick a winner: all five lanes cleared the tool-call format bar. A graded brain-quality rubric breaks the tie — and shows the fastest serving lane is also the better agent, by a margin throughput could never have measured.

uses fieldkit.evalfieldkit.harness

Article №46 deployment NIM 26 May 2026 ~3 hours, most of it model pulls and four cold-starts

Harnesses

The Hermes Serving Lane on a DGX Spark — MoE vs Dense, and the Number That Actually Picks the Lane

Five Hermes serving lanes on one DGX Spark: Qwen3-30B-A3B MoE vs Qwen3-32B dense across vLLM, llama.cpp, and NIM. The MoE runs ~8.5× faster for the same memory — but the lane is picked by tool-call reliability, which took two config fights to get to 0% everywhere.

uses fieldkit.capabilitiesfieldkit.harnessfieldkit.nim

Article №43 fine-tuning Foundation 19 May 2026 ~1 hour (one container, six gates, two GGUFs)

Machine that Builds Machines

Unsloth on the Spark — When the Train-Time Peak Equals the Base-Load Peak

Six gates clear in one container against the v1 reset: pip install --no-deps preserves the s40 stack, FastLanguageModel loads at 16.94 GB peak, a 100-step LoRA train holds the same envelope, save_pretrained_gguf() emits both quants in 207 seconds end-to-end.

Article №41 fine-tuning Foundation 17 May 2026 ~10 hours (mostly automated overnight sweeps)

Three-Mode Bracket: Baselining a Reasoning Model Before Fine-Tuning, On One Spark

Before you fine-tune a small reasoning model on a domain bench you need to know where it stands. Three context modes — closed, retrieval, oracle — triangulate the model's ceiling on one Spark, no Judge backend or cluster required.

Upcoming agentic Foundation planned ~14 min read

Machine that Builds Machines

Governed Routing With Receipts — When the Local Lane Consults the Frontier, and What It Costs

The Advisor's router is deterministic and observables-only: it escalates on detectable failure signals — a citation outside the retrieved set, a rank-sanity anomaly — never on vibes. Route bakeoffs at $0 and $0.0033, a no-egress gate for private state, and a receipt a script re-verifies.