Tag

#fieldkit

Articles tagged "fieldkit" — 10 entries.

Article №55 agentic Foundation 02 Jun 2026 ~15 min read — no setup; a synthesis of work already shipped on this box

The Meta-Program on a DGX Spark — When the Tool You Build With Is an Instance of the Thing You Build

The opener for the Machine-that-Builds-Machines arc. The book describes a meta-program on a SaaS platform; this is the same pattern on one personal box — a pane → hands → engine loop where the spec is the application and the skills are configuration over code.

Article №54 observability Foundation 03 Jun 2026 ~4 hours end-to-end — bring up the cockpit, drive a reindex + two RAG-evals through the control plane, score 44 questions, and ship the artifact

Second Brain

The Machine Manages Its Own Memory — and the Bug the Mocks Slept Through

Driving the Arena recall layer end-to-end on its own corpus: reindex → score → gate, dispatched through the control plane, recall@5 measured against 44 held-out questions. The first real drain caught a bug eight mock-injected unit tests had slept through — the case for operating the thing you built.

uses fieldkit.memoryfieldkit.arenafieldkit.harnessfieldkit.eval

Article №53 fine-tuning Foundation 03 Jun 2026 ~16 min read — a synthesis of a proven run plus the engine it became

Machine that Builds Machines

The Machine Improves Itself — Closed-Loop RLVR on a DGX Spark, Where the Eval Harness Is the Reward

Closed-loop RLVR on one box: an eval→reward→fine-tune loop where the Spark's own verifiers ARE the reward — no learned reward model. The hero finding is defensive: pick the checkpoint on a frozen held-out split, never the training pool, or the loop reports success while it regresses.

uses fieldkit.rlfieldkit.rewardfieldkit.evalfieldkit.lineage

Article №48 agentic Foundation 26 May 2026 ~3 hours, including the live tool-call gate against a local NIM

Harnesses

Hermes Drives the Spark via fieldkit-as-MCP — The Agent That Operates Its Own Machine

The keystone of the Harnesses series: expose a curated slice of fieldkit as MCP tools and the local Hermes agent can measure, quantize, publish, and retrieve on the box itself. The gate is a real llama-bench run the agent drove end-to-end — 0% tool-call format error, no API key.

uses fieldkit.harnessfieldkit.capabilitiesfieldkit.quantfieldkit.publishfieldkit.rag

Article №40 deployment llama.cpp 16 May 2026 ~5 hours end-to-end on a DGX Spark

Machine that Builds Machines

Orionfold/II-Medical-8B-GGUF on Spark — five medical-reasoning variants, MedMCQA mini-eval, ChatML reasoning format

Five GGUF variants of Intelligent-Internet/II-Medical-8B (Qwen3-8B + DAPO reasoning recipe) measured on a DGX Spark. Q5_K_M lands at 36.4 tok/s, 5.45 GB, and 52% on a MedMCQA n=50 mini-eval — above F16. First reasoning recipe in the series.

uses fieldkit.quantfieldkit.publishfieldkit.evalfieldkit.lineage

Article №39 deployment llama.cpp 15 May 2026 ~5 hours end-to-end on a DGX Spark

Machine that Builds Machines

Orionfold/SecurityLLM-GGUF on Spark — five cyber variants, CyberMetric mini-eval, MCQ letter scoring

Five GGUF variants of ZySec-AI/SecurityLLM measured on a DGX Spark — Q4_K_M scores 40% on CyberMetric MCQ at 47.7 tok/s and 4.1 GB; the smaller variants matched or beat F16's 34%. Third vertical card; zero fieldkit source changes.

uses fieldkit.quantfieldkit.publishfieldkit.evalfieldkit.lineage

Article №38 deployment llama.cpp 14 May 2026 ~5 hours end-to-end on a DGX Spark

Machine that Builds Machines

Orionfold/Saul-7B-Instruct-v1-GGUF on Spark — five legal variants, LegalBench mini-eval, four-axis measurement card

Five GGUF variants of Equall/Saul-7B-Instruct-v1 measured on a DGX Spark — Q5_K_M scores 72% on LegalBench (n=50, contains) at 20 tok/s and 4.8 GB. Each card carries perplexity, sustained tok/s, thermal envelope, and a 5-task LegalBench subset score.

uses fieldkit.quantfieldkit.publishfieldkit.evalfieldkit.lineage

Article №37 deployment llama.cpp 14 May 2026 ~6 hours end-to-end on a DGX Spark

Machine that Builds Machines

Orionfold/finance-chat-GGUF on Spark — five variants, FinanceBench mini-eval, four-axis measurement card

Five GGUF variants of AdaptLLM/finance-chat measured on a DGX Spark — Q8_0 perplexity-matches F16 losslessly, Q4_K_M ships at 31 tok/s. Each card carries perplexity, sustained tok/s, thermal envelope, and FinanceBench accuracy.

uses fieldkit.quantfieldkit.publishfieldkit.evalfieldkit.lineage

Article №36 fine-tuning NeMo 11 May 2026 ~30 min read

Machine that Builds Machines

Adaptive Turn Clipping on a Single Spark — A²TGPO, Studied from Source

A²TGPO redesigns how Information Gain feeds GRPO: turn-group normalization, variance-rescaled accumulation, and adaptive turn-level clipping. The paper's release is the code; the Spark's contribution is the lineage primitive that records what each trial learned.

uses fieldkit.capabilitiesfieldkit.trainingfieldkit.lineage

Article №35 agentic NeMo 10 May 2026 ~28 min read

Machine that Builds Machines

Reading the Lineage Primitive — cxcscmu Auto-Research, Studied from release_artifacts

cxcscmu's own lineage_on vs lineage_off ablation closes the case: same agent, same trial budget, same prompt template — only the rendered lineage block differs, and the run with lineage produces 5.3× more keeps and 3.2× less wall-time waste. This piece extracts that primitive into fieldkit.lineage.

uses fieldkit.capabilitiesfieldkit.trainingfieldkit.lineage