Series
Machine that Builds Machines
Field evidence for the book's Part-4 thesis (Ch10–11). Self-improvement loops on agent trajectories, synthetic-data pipelines, codegen agents, self-fine-tuning, alignment-engineering primitives — Karpathy's autoresearch loop is one installment of the broader arc. Each article grounds a chapter claim with a Spark-scale reproduction.
NeMo Framework on the Spark — What It Earns Over a Hand-Rolled train.py
Same 354M GPT, same 100 steps, same random tokens — once in a hand-rolled train.py against vanilla PyTorch, once via Megatron-Core inside the NeMo Framework container. Same hardware (GB10, 128 GB unified). The framework earns +5.8% throughput and 30% less GPU memory.
The GB10 Pretrain Envelope — Sweeping Batch, Sequence, and Precision on One Spark
Same 354M GPT, same training loop, swept across micro-batch (2,4,8,16), sequence length (1024,2048), and precision (bf16,fp8). 16 configurations, 30 steps each. Peak: 14,266 tokens/sec at batch=16, seq=1024, fp8 — 18% above the hand-rolled PyTorch baseline.
The Data-Path Envelope — When Real Tokens Beat Random Tokens at Pretrain Throughput
Curator-cleaned wikitext-103 (109M tokens, 417 MiB packed) feeding the same 354M GPT pretrain loop from A2. Eight configs swept; data-path overhead is 0.01–0.04% across all of them. New peak: 14,980 tok/s — slightly above A2's random-token ceiling.
Guardrails Before the Agent Edits — Code-Edit Policy as a Programmatic Funnel
Five programmatic rails between the Autoresearch agent's proposal and any mutation of train.py — schema, menu, range, cross-constraint, diff lint. 27 adversarial test cases: block recall 1.0, clean pass 1.0, every rail attribution correct. Zero LLM-as-judge calls.
The Autoresearch Loop — 50 Iterations of an LLM Editing Its Own Trainer Overnight
NIM Llama 3.1 8B drives a structured-perturbation agent loop against a 354M GPT pretrain. 50 iterations, 73.4 min wall, 0.07 kWh of electricity. 8 keeps, 42 reverts, 0 rail blocks, 0 crashes. Best result: val_bpb 10.8534, +0.93% over baseline at d_model=768.
Distilling the Architect — A 3B LoRA Trained on the Agent's Own Trajectory
A4's 50-iter trajectory becomes training data for a Qwen2.5-3B LoRA proposer. Holding out 8 iters, the 3B mode-collapses onto d_model=768 (the trajectory's most-frequent keep) and matches 0 / 8 exact; the 8B at T=0.5 matches 4 / 8 of its own past picks.
Was the Agent Researching, or Flailing? An Observability Pass on the Trajectory
A8 said the LoRA mode-collapsed because the trajectory was thin. This puts numbers on it: 6 of 13 knobs ever touched, 72% of proposals repeated a prior pair, and the proposer's k=5 history window is the structural cause.
Reading the Lineage Primitive — cxcscmu Auto-Research, Studied from release_artifacts
cxcscmu's own lineage_on vs lineage_off ablation closes the case: same agent, same trial budget, same prompt template — only the rendered lineage block differs, and the run with lineage produces 5.3× more keeps and 3.2× less wall-time waste. This piece extracts that primitive into fieldkit.lineage.
uses fieldkit.capabilitiesfieldkit.trainingfieldkit.lineage
Adaptive Turn Clipping on a Single Spark — A²TGPO, Studied from Source
A²TGPO redesigns how Information Gain feeds GRPO: turn-group normalization, variance-rescaled accumulation, and adaptive turn-level clipping. The paper's release is the code; the Spark's contribution is the lineage primitive that records what each trial learned.
uses fieldkit.capabilitiesfieldkit.trainingfieldkit.lineage
Orionfold/finance-chat-GGUF on Spark — five variants, FinanceBench mini-eval, four-axis measurement card
Five GGUF variants of AdaptLLM/finance-chat measured on a DGX Spark — Q8_0 perplexity-matches F16 losslessly, Q4_K_M ships at 31 tok/s. Each card carries perplexity, sustained tok/s, thermal envelope, and FinanceBench accuracy.
uses fieldkit.quantfieldkit.publishfieldkit.evalfieldkit.lineage
Orionfold/Saul-7B-Instruct-v1-GGUF on Spark — five legal variants, LegalBench mini-eval, four-axis measurement card
Five GGUF variants of Equall/Saul-7B-Instruct-v1 measured on a DGX Spark — Q5_K_M scores 72% on LegalBench (n=50, contains) at 20 tok/s and 4.8 GB. Each card carries perplexity, sustained tok/s, thermal envelope, and a 5-task LegalBench subset score.
uses fieldkit.quantfieldkit.publishfieldkit.evalfieldkit.lineage
Orionfold/SecurityLLM-GGUF on Spark — five cyber variants, CyberMetric mini-eval, MCQ letter scoring
Five GGUF variants of ZySec-AI/SecurityLLM measured on a DGX Spark — Q4_K_M scores 40% on CyberMetric MCQ at 47.7 tok/s and 4.1 GB; the smaller variants matched or beat F16's 34%. Third vertical card; zero fieldkit source changes.
uses fieldkit.quantfieldkit.publishfieldkit.evalfieldkit.lineage
Orionfold/II-Medical-8B-GGUF on Spark — five medical-reasoning variants, MedMCQA mini-eval, ChatML reasoning format
Five GGUF variants of Intelligent-Internet/II-Medical-8B (Qwen3-8B + DAPO reasoning recipe) measured on a DGX Spark. Q5_K_M lands at 36.4 tok/s, 5.45 GB, and 52% on a MedMCQA n=50 mini-eval — above F16. First reasoning recipe in the series.
uses fieldkit.quantfieldkit.publishfieldkit.evalfieldkit.lineage
The Trainer Was Fine, the Corpus Wasn't: Three Misdiagnoses on a Patent-Specialist Fine-Tune
Five thousand rows of synthetic patent reasoning, two clean 131-minute LoRA trains, three rounds of confident diagnosis — and none of them found the bug. The bug was the corpus all along. A field report on the cheapest mistake to make on the Spark.
Unsloth on the Spark — When the Train-Time Peak Equals the Base-Load Peak
Six gates clear in one container against the v1 reset: pip install --no-deps preserves the s40 stack, FastLanguageModel loads at 16.94 GB peak, a 100-step LoRA train holds the same envelope, save_pretrained_gguf() emits both quants in 207 seconds end-to-end.
The Gate Before the GPU — Deciding SFT vs RL vs RLVR Before You Spend the Run
Building Kepler — a numeric astrodynamics reasoner — from scratch on one Spark. The method choice (SFT vs RL vs RLVR) is decided by cheap gates before any GPU run: a base preflight, an SFT gate, and a Goldilocks headroom gate. A flawless RLVR run that changed nothing is the proof.
uses fieldkit.rlfieldkit.rewardfieldkit.eval
The Machine Improves Itself — Closed-Loop RLVR on a DGX Spark, Where the Eval Harness Is the Reward
Closed-loop RLVR on one box: an eval→reward→fine-tune loop where the Spark's own verifiers ARE the reward — no learned reward model. The hero finding is defensive: pick the checkpoint on a frozen held-out split, never the training pool, or the loop reports success while it regresses.
uses fieldkit.rlfieldkit.rewardfieldkit.evalfieldkit.lineage
The Meta-Program on a DGX Spark — When the Tool You Build With Is an Instance of the Thing You Build
The opener for the Machine-that-Builds-Machines arc. The book describes a meta-program on a SaaS platform; this is the same pattern on one personal box — a pane → hands → engine loop where the spec is the application and the skills are configuration over code.
The Refusal Floor Is Trainable — What a Frozen Curveball Proved About Prompts vs Weights
A 30B model with a hand-tuned prompt contract refused 3 of 9 adversarial pretexts and fabricated private-looking state 3 times. A 4B trained for 21 minutes refused 9 of 9. The bench that saw the difference was frozen before training — and that discipline is the whole method.
uses fieldkit.arenafieldkit.eval
Claw-Eval-Live on Spark — Spark reproduction notes
Stand up Claw-Eval-Live sandboxed-workflow protocol on Spark via NemoClaw + OpenShell, mock the business-service backends, run Llama 8B vs Nemotron 49B with deterministic-trace + LLM-judge grading, and chart where local agents land vs the paper 66.7 percent ceiling.
Heterogeneous Scientific Foundation Model Collaboration — Spark reproduction notes
Wrap a domain foundation model (Pangu-Weather) as a Triton tool, drive it from a NIM-served Llama 3.1 8B planner via NemoClaw, and show when specialist routing beats language-only reasoning — all inside the Spark 128 GB envelope.
Continued Pre-training on a DGX Spark — NeMo Framework Without a Cluster
When does it make sense to continue pre-training on a single GB10 box, and when is it a category error? A planned run that pushes NeMo Framework, Megatron-LM parallelism, and BF16 mixed precision against the 128 GB unified-memory wall with a small domain corpus.
SkillOS: Learning Skill Curation for Self-Evolving Agents — Spark reproduction notes
Reproducing the SkillOS curator/executor split on a DGX Spark — both Qwen3-8B (frozen executor + LoRA-trained curator) over a markdown SkillRepo with BM25 retrieval, then extracting the pattern into `fieldkit.skills`.
Gates Before the Advisor — Recall Floors, Raw-Base Preflights, and the Bench That Ate Its Own Spec
Before the Advisor trained: a 182-source corpus pack with recall gates on two retrieval lanes (BM25 and live pgvector + NIM embedder), raw-base preflights that failed two NVIDIA bases honestly, and the rebuild that caught the bench's own spec contaminating its retrieval context.
Synthetic Corpus Frameworks on the Spark — From a Bespoke Pipeline to an Orchestration Layer
A bespoke synth pipeline got 200 rows into a 5000-row reasoning corpus before a fourth meta-state surface form forced a retreat. The diagnosis: a regex-floor approach cannot catch novel surface forms by construction. The fix is the open-source orchestration layer.
Governed Routing With Receipts — When the Local Lane Consults the Frontier, and What It Costs
The Advisor's router is deterministic and observables-only: it escalates on detectable failure signals — a citation outside the retrieved set, a rank-sanity anomaly — never on vibes. Route bakeoffs at $0 and $0.0033, a no-egress gate for private state, and a receipt a script re-verifies.