Field notes on building AI-native businesses.

One builder's running log — articles on agent orchestration, governed inference, RAG patterns, training economics, and the deep-dive papers that anchor the AI Native Platform series. Every post is a session transcript turned into an essay.

At a glance

35 +10
Articles
114,166
Words
19,860
14k article · 6k package
Lines of code
8
Models
11
NVIDIA Products
Products & frameworks
NVIDIA NIM 30 DGX Spark 27 NeMo Framework 26 TensorRT-LLM 16 Triton Inference Server 12 pgvector 12 NeMo Retriever 10 NemoClaw 8 Ollama 5 NeMo Guardrails 4 OpenClaw 3
Models deployed
Llama 3.1 8B Instruct 18 Qwen2.5 7B Instruct 7 Nemotron Reranker 1B 6 Nemotron Super 49B 5 Qwen2.5 3B Instruct 4 Nemotron Embed 1B v2 4 Llama 3.3 70B Instruct 3 Nemotron Nano 9B v2 2
Upcoming observability NemoClaw ~30 min read
Machine that Builds Machines

Claw-Eval-Live on Spark — Spark reproduction notes

Stand up Claw-Eval-Live sandboxed-workflow protocol on Spark via NemoClaw + OpenShell, mock the business-service backends, run Llama 8B vs Nemotron 49B with deterministic-trace + LLM-judge grading, and chart where local agents land vs the paper 66.7 percent ceiling.

Upcoming agentic NemoClaw ~30 min read
Machine that Builds Machines

Heterogeneous Scientific Foundation Model Collaboration — Spark reproduction notes

Wrap a domain foundation model (Pangu-Weather) as a Triton tool, drive it from a NIM-served Llama 3.1 8B planner via NemoClaw, and show when specialist routing beats language-only reasoning — all inside the Spark 128 GB envelope.

Upcoming training NeMo Framework + Llama 3.1 8B planned ~2 days of wall-clock, one long weekend
Machine that Builds Machines

Continued Pre-training on a DGX Spark — NeMo Framework Without a Cluster

When does it make sense to continue pre-training on a single GB10 box, and when is it a category error? A planned run that pushes NeMo Framework, Megatron-LM parallelism, and BF16 mixed precision against the 128 GB unified-memory wall with a small domain corpus.

Upcoming fine-tuning NeMo ~30 min read
Machine that Builds Machines

A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping — Spark reproduction notes

Reproducing A²TGPO turn-level clipping on Qwen3-4B on a single DGX Spark — local faiss retriever + verl + the three IG primitives, then promoting fieldkit.training.rl to a first-class submodule.

Upcoming agentic NIM ~30 min read
Second Brain

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction — Spark reproduction notes

Reproducing DCI-Agent-Lite on a DGX Spark — NIM-served 8B agent + ripgrep + filesystem corpus, no embedder or vector DB; extracts the operator vocabulary as `fieldkit.rag.operators` and quantifies how much of the existing pgvector + reranker stack DCI lets you delete.

Upcoming inference NIM ~30 min read
LLM Wiki

RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation — Spark reproduction notes

Reproducing the RaguTeam SemEval-2026 T8 winning system on a DGX Spark — judge-orchestrated 7-LLM ensemble (Qwen3-4B-FP8 + Meno-Lite-0.1 7B local + remote members) with Qwen3-32B judge, then extracting the pattern into `fieldkit.ensemble` + `fieldkit.judge`.

Upcoming agentic NemoClaw ~30 min read
Machine that Builds Machines

SkillOS: Learning Skill Curation for Self-Evolving Agents — Spark reproduction notes

Reproducing the SkillOS curator/executor split on a DGX Spark — both Qwen3-8B (frozen executor + LoRA-trained curator) over a markdown SkillRepo with BM25 retrieval, then extracting the pattern into `fieldkit.skills`.

Upcoming fine-tuning NeMo Customizer + Nemotron Nano 9B v2 planned ~4 hours per sweep
LLM Wiki

LoRA on Nemotron Nano — Fine-tuning a 9B Without Blowing Unified Memory

A planned walk through LoRA fine-tuning on Nemotron Nano 9B with NeMo Customizer: rank and alpha sweeps, a tiny domain corpus, and the memory accounting that keeps a PEFT run from tripping the Spark's 128 GB unified-memory wall.

Upcoming observability NVIDIA DCGM + Prometheus + Grafana planned ~3 hours, mostly dashboard tuning

Watching the GPU — DCGM, Prometheus, and a Local Grafana for the Spark

A planned setup of DCGM Exporter → Prometheus → Grafana entirely on the Spark itself. The goal is a single dashboard that tells the truth about GPU memory, SM occupancy, and per-container utilization for a rig that's running NIMs, pgvector, and an occasional training job at the same time.

Upcoming dev-tools NVIDIA Nsight Systems + CUDA Toolkit planned ~4 hours including trace analysis

Tracing a NIM Request with Nsight Systems — What the 24.8 tok/s Number Hides

A planned kernel-level trace of a single NIM inference request on GB10. Where does the wall-clock time actually go — tokenization, KV-cache attention, the sampling loop, memcpy? The article turns 24.8 tokens per second into a timeline you can point at and say 'that line is the bottleneck'.

Article №35 agentic NeMo ~28 min read
Machine that Builds Machines

Reading the Lineage Primitive — cxcscmu Auto-Research, Studied from release_artifacts

cxcscmu's own lineage_on vs lineage_off ablation closes the case: same agent, same trial budget, same prompt template — only the rendered lineage block differs, and the run with lineage produces 5.3× more keeps and 3.2× less wall-time waste. This piece extracts that primitive into fieldkit.lineage.

uses fieldkit.capabilitiesfieldkit.training

Article №34 fine-tuning NeMo ~18.5 hours wall (50 T²PO steps + three evals)
Frontier Scout

T²PO on Spark — When the Training Pool Says 28/32 and Held-out Says 9/158

T²PO's two deltas on the Phase 6 ClawGym harness: mean turns 5.00 → 4.61, task_complete 154/158, but the per-assertion ceiling stays flat at 47.7%. The strongest training-side step (45) is the worst held-out checkpoint — pool saturation lies on a single Spark.

uses fieldkit.capabilitiesfieldkit.evalfieldkit.training

Article №33 fine-tuning NeMo ~9 hours wall (34 GRPO steps + two evals)
Frontier Scout

ClawGym GRPO on Spark — Closing the Loop the SFT Adapter Couldn't

Phase 5 SFT taught the agent to keep working but never to stop. 34 GRPO steps with a shaped reward unlearn the failure mode — same model, same base, same LoRA-init, but task_complete climbs 0/158 → 154/158, mean turns drop 12 → 5, and per-assertion still inches up +3.1 pp.

Article №32 fine-tuning NeMo ~3 days end-to-end (mostly waiting on rollouts)
Frontier Scout

ClawGym on Spark — A 7B Base, A LoRA Adapter, and the +15 pp the Adapter Earned

ClawGym shipped only a .github profile, so we built the substrate ourselves — persona task synth, sandbox harness, 200-task corpus, LoRA SFT, matched-base eval. The adapter earns +3.8 pp task pass and +15.0 pp per-assertion against its own base. The diagnostic is the lift.

uses fieldkit.nim

Article №31 inference Foundation ~3 hours of measurement · ~one line of patch
Frontier Scout

Pass@k After the Seventh Patch — Three Shapes ESamp Takes on Spark

Patches were six. The Pass@k harness surfaced a seventh — a one-line slice in the residual tap that only fires once batches shrink mid-run. Once cleared, ESamp takes three shapes: flat on saturated cells, lifting both rates on instruct headroom, and +6.67pp pass@8 on the unsaturated reasoning cell.

uses fieldkit.evalfieldkit.capabilities

Article №30 inference Foundation ~2 hours of patching · ~30 minutes of measuring
Frontier Scout

Two Patches Were Six — ESamp Lands at 97.4% on a Patched Spark

Article #2 closed at two patches. Applying them surfaced six — including the silent return-shape adapter that broke the consumer's port. Once cleared, ESamp lands at 97.4% of baseline on patched Qwen 2.5 7B, within 1.4 pp of the paper's reference.

uses fieldkit.evalfieldkit.capabilities

Article №29 inference Foundation ~2 hours — most of it watching vLLM 0.20 build inside an NGC PyTorch container; the runtime+drift diagnosis that follows is the short, sharp half
Frontier Scout

Test-Time Distilling on Spark — Same Compute Envelope, Wider Semantic Reach

ESamp adds a tiny test-time-trained probe to vLLM that converts decoding from lexical resampling into semantic exploration. The runtime is vLLM-native — and that is a Spark catalog-gap story before it is a benchmark.

uses fieldkit.evalfieldkit.capabilities

Article №28 observability NIM ~3 hours — 30 min plumbing, ~20 min for the runs themselves, the rest is reading what they show
Frontier Scout

AutoResearchBench on Spark — Two NIMs, One Bench, Two Failure Modes

Two Spark-tuned NIMs run AutoResearchBench's three Deep-Research example questions. Llama-3.1-8B crashes by turn 5-6 on its 8K context; Nemotron-Nano-9B-v2 finishes cleanly at 128K. Both score 0% Accuracy@1 — for completely different reasons.

uses fieldkit.nimfieldkit.evalfieldkit.capabilities

Article №27 foundations TensorRT-LLM ~22 minute read
Looking Beyond Spark

Looking Beyond Spark — KV-Cache Arithmetic at Inference

The serving memory bill is not weights. It's KV cache, and KV scales with concurrent users × context length, not parameters. Same four bills as training; different weights. A 70B at 32 users × 16k context wants 168 GB just for KV — and the Spark teaches you the per-token math.

uses fieldkit.capabilities

Article №26 observability NIM Llama 3.1 8B ~2 hours wall — analysis runs in seconds, the rest is reading + writing
Machine that Builds Machines

Was the Agent Researching, or Flailing? An Observability Pass on the Trajectory

A8 said the LoRA mode-collapsed because the trajectory was thin. This puts numbers on it: 6 of 13 knobs ever touched, 72% of proposals repeated a prior pair, and the proposer's k=5 history window is the structural cause.

Article №25 fine-tuning NeMo Customizer ~2 hours wall — 4 min LoRA training, 4 min race, the rest writing
Machine that Builds Machines

Distilling the Architect — A 3B LoRA Trained on the Agent's Own Trajectory

A4's 50-iter trajectory becomes training data for a Qwen2.5-3B LoRA proposer. Holding out 8 iters, the 3B mode-collapses onto d_model=768 (the trajectory's most-frequent keep) and matches 0 / 8 exact; the 8B at T=0.5 matches 4 / 8 of its own past picks.

Article №24 training Foundation ~30 minute read · math + economics, no GPU required
Looking Beyond Spark

Derisking the Cloud Pretrain — How a $5K Spark Saves $50K on H100 Rentals

The Spark is too small for a serious pretrain — but it's the right size for the recipe-search that precedes one. Cull 100 candidate architectures down to 3 on one Spark for ~$1 of electricity, then book the cloud node knowing what to train. The expected savings per campaign run into the thousands.

Article №23 foundations Foundation ~15 minute read · no GPU required
Looking Beyond Spark

What the Agent Actually Built — Five Articles in Plain English, and Why You Probably Don't Want to Train From Scratch

Five technical articles in one day built an unattended AI research loop on a desk for $0.02 of electricity. The plain-English readout: what the agent built (not a usable model), what it changes for one person, and a four-tier roadmap from LoRA in minutes to from-scratch in weeks.

Article №22 agentic NeMo ~3 hours — 90 min to scaffold the loop, 73 min for the unattended run, the rest is reading the trajectory
Machine that Builds Machines

The Autoresearch Loop — 50 Iterations of an LLM Editing Its Own Trainer Overnight

NIM Llama 3.1 8B drives a structured-perturbation agent loop against a 354M GPT pretrain. 50 iterations, 73.4 min wall, 0.07 kWh of electricity. 8 keeps, 42 reverts, 0 rail blocks, 0 crashes. Best result: val_bpb 10.8534, +0.93% over baseline at d_model=768.

Article №21 agentic NeMo Guardrails ~2 hours — 30 min for the perturbation menu + structured proposal schema, 60 min for the 5 rails + 27-case adversarial bench, 30 min to write up
Machine that Builds Machines

Guardrails Before the Agent Edits — Code-Edit Policy as a Programmatic Funnel

Five programmatic rails between the Autoresearch agent's proposal and any mutation of train.py — schema, menu, range, cross-constraint, diff lint. 27 adversarial test cases: block recall 1.0, clean pass 1.0, every rail attribution correct. Zero LLM-as-judge calls.

Article №20 training NeMo ~2 hours — 5 min for the corpus pull, 45 min for a derived container build, 2 min for the Curator pipeline + 40s tokenize, 3 min for the 8-config sweep, the rest is reading the numbers
Machine that Builds Machines

The Data-Path Envelope — When Real Tokens Beat Random Tokens at Pretrain Throughput

Curator-cleaned wikitext-103 (109M tokens, 417 MiB packed) feeding the same 354M GPT pretrain loop from A2. Eight configs swept; data-path overhead is 0.01–0.04% across all of them. New peak: 14,980 tok/s — slightly above A2's random-token ceiling.

Article №19 training NeMo ~30 min once the NeMo container is on disk — 7.4 min wall for the 16-config sweep, the rest is reading the numbers
Machine that Builds Machines

The GB10 Pretrain Envelope — Sweeping Batch, Sequence, and Precision on One Spark

Same 354M GPT, same training loop, swept across micro-batch (2,4,8,16), sequence length (1024,2048), and precision (bf16,fp8). 16 configurations, 30 steps each. Peak: 14,266 tokens/sec at batch=16, seq=1024, fp8 — 18% above the hand-rolled PyTorch baseline.

Article №18 training NeMo ~3 hours — 90 min for two container pulls (PyTorch 30 GB, NeMo Framework Megatron Backend 70 GB), 30 min for the matched scripts, 10 min for the two pretrain runs and analysis
Machine that Builds Machines

NeMo Framework on the Spark — What It Earns Over a Hand-Rolled train.py

Same 354M GPT, same 100 steps, same random tokens — once in a hand-rolled train.py against vanilla PyTorch, once via Megatron-Core inside the NeMo Framework container. Same hardware (GB10, 128 GB unified). The framework earns +5.8% throughput and 30% less GPU memory.

Article №17 agentic NIM ~90 minutes — 30 min to design the tool surface, 30 min to wire FastMCP + pgvector, 15 min to register with Claude Code, 15 min for the demo and trace
Second Brain

Second Brain as a Tool — Wrapping the RAG Stack in MCP for Claude Code

Closing the Second Brain arc. Four MCP tools wrap the RAG chain — embed, retrieve, optionally rerank, generate — and any Claude Code session anywhere on the box becomes a grounded research client. 200 lines of Python, one launcher, one .mcp.json entry.

Article №16 foundations Foundation ~25 minute read
Looking Beyond Spark

Looking Beyond Spark — Fine-Tuning a 100B Nemotron

A working answer to: how many GPUs to fine-tune a 100B Nemotron? Three methods, three memory footprints — full FT ≈ 1.6 TB needs 24× H100; LoRA ≈ 250 GB fits 8× H100; QLoRA ≈ 65 GB fits 1× H200. The Spark's 3B LoRA teaches the math.

uses fieldkit.capabilities

Article №15 observability NeMo Evaluator ~60 minutes end-to-end — 40 s to ingest the blog into pgvector, 2 min for retrieval, 4 min for generation across three 8B variants, 90 s for the LoRA variant, 9 min for grading
Second Brain

Ragas, Reranked — What 44 Held-Out Questions Say About the Second Brain Stack

A Ragas-style harness written in 200 lines of stdlib Python, run locally on the DGX Spark, against four variants of the Second Brain RAG chain. Naive RAG scores 3.30 / 5. Rerank RAG scores 4.27. LoRA+RAG is a surprise — it does not beat naive. Retrieval is where the points come from.

uses fieldkit.eval

Article №14 fine-tuning Hugging Face PEFT + Qwen2.5-3B-Instruct ~45 minutes end-to-end — 5 min corpus via NIM 8B, 69 s training, 3 min benchmark, plus a 6 GB base-model download
Second Brain

LoRA on Your Own Q&A — What 231 Pairs Actually Teach a 3B Model

231 own-voice Q&A pairs, a rank-16 LoRA, 69 s of training on a GB10 Spark. The adapter won't memorize your exact numbers, but it will take a model that refuses 61% of questions about your work and turn it into one that answers all of them in your voice. For facts you still need RAG.

uses fieldkit.eval

Article №13 deployment TensorRT-LLM + Triton Inference Server ~4 hours including two container pulls and three engine builds
Second Brain

TensorRT-LLM on the Spark — FP8 Isn't the Reason to Drop NIM. NVFP4 Is.

Dropping below NIM to raw TensorRT-LLM on a GB10 Spark. FP8 beats NIM's vLLM by 10-15% — barely worth the rebuild. NVFP4 beats it by 76% on decode, 43% on TTFT, and ships a 34%-smaller engine. The reason to drop NIM is the Blackwell-native 4-bit kernel, not FP8.

Article №12 foundations Foundation 10-minute read; no hands-on
Foundations

One Substrate, Three Apps — Where the Foundation Forks

Seven articles installed one stack on the Spark — NIM, Embed, pgvector, RAG glue, reranker, generator A/B, Guardrails. This bridge retells that install as three different answers to one question — corpus plus 128 GB — and walks readers to the top of three tracks.

Article №11 inference NeMo Guardrails ~90 minutes on top of the rerank-fusion / bigger-generator chain
Foundations

One Rail, Three Policies — NeMo Guardrails on the Retrieval Path

NeMo Guardrails drops a policy gate between retrieval and generation. One install, three per-arc configs — PII for Second Brain, style for LLM Wiki, code-safety for Autoresearch — and a 15-query benchmark: 100% block recall, 100% clean pass. Rails are scaffolding; detectors are the content.

uses fieldkit.rag

Article №10 inference Llama 3.3 70B + Nemotron-Super-49B + Llama 3.1 8B NIM ~30 minutes on top of the rerank-and-fusion chain
Foundations

Bigger Generator, Same Grounding — 8B vs 49B vs 70B on One Retrieval Chain

The rerank-and-fusion article bet that a bigger generator would heal the 8B Google-IPO refusal. Ran the A/B across three sizes on one retrieval chain. Bet lost: Nemotron-Super-49B over-refuses the 8B baseline; Llama 3.3 70B narrows the gap, not closes it. The refusal was the scaffold working.

uses fieldkit.rag

Article №09 inference Nemotron Reranker + pgvector full-text + Llama 3.1 8B NIM ~45 minutes on top of the naive-RAG chain
Foundations

Hybrid Retrieval on the Spark — BM25, Dense, Fusion, Rerank

Four retrieval modes on one corpus — naive dense, BM25, Reciprocal Rank Fusion, Nemotron rerank. Dense is already 92% recall@5; rerank adds a point at K=10 and reorders the top. The 8B generator still refuses where retrieval is perfect — grounding, not retrieval, is the new bottleneck.

uses fieldkit.rag

Article №08 inference Llama 3.1 8B NIM + Nemotron Retriever + pgvector ~30 minutes if the three endpoints are already warm
Foundations

Three Endpoints, One Answer — Naive RAG on a DGX Spark

Three endpoints in one curl chain — a query embeds through Nemotron, pgvector returns top-5 chunks in under 80 ms, and a Llama 3.1 8B NIM stuffs them into a strict-context prompt. The chain works; the 8B generator still refuses on questions its own context answers.

uses fieldkit.ragfieldkit.eval

Article №07 inference pgvector ~15 minutes first install, re-runs in seconds
Foundations

Where Your Vectors Live — pgvector on a DGX Spark

The substrate between the embed call and the retrieve call — pgvector 0.8.2 running as a Postgres 16 container on GB10, with 1000 Nemotron vectors, HNSW and ivfflat both indexed, and a planner that prefers seq scan until you tell it otherwise.

uses fieldkit.rag

Article №06 inference NeMo ~30 minutes first install, ~1 minute every restart after
Foundations

Your Own Semantic Space — a Nemotron Embedding NIM on a DGX Spark

The embedding endpoint that every downstream RAG, wiki, and agent piece will reuse — a 2048-dim Nemotron Retriever NIM running locally on GB10, ready 52 seconds after docker run and holding 28 docs/s under batched load.

uses fieldkit.rag

Article №05 inference NIM ~2 hours first install, ~2 minutes every restart after
Foundations

Your First NIM on a DGX Spark — What 24.8 Tokens Per Second Doesn't Tell You

First-contact notes on NVIDIA's DGX-Spark-specific Llama 3.1 8B NIM. 9.4 GB image, ~108 s warm-cache cold-start, 24.8 tok/s steady, OpenAI-compatible on :8000 — and a confidently wrong Python one-liner that clarifies what small-model FP8 buys and what it costs.

uses fieldkit.nim

Article №04 agentic NemoClaw ~2 hours after prerequisites

The Sandbox Tax That Wasn't — NemoClaw vs OpenClaw on One DGX Spark

I ran NemoClaw's sandboxed agent stack and the host Ollama-OpenClaw CLI side by side on one DGX Spark with the same 123B Nemotron model. The sandbox overhead I went looking for is real but modest (~2× raw inference); the real tax is onboarding, and NemoClaw paid it at install time.

Article №03 foundations Foundation ~6 hours spread across a week

Access First, Models Second — How I Set Up My DGX Spark for Solo AI Work

Most DGX Spark walkthroughs open with CUDA and tokens/sec. This one opens with streaming, AI-pair-programming, sandboxed agents, and browser automation — the access layer. For a solo edge builder, that interaction stack is more load-bearing than the model stack.

Article №02 foundations ainative 22 min read
AI Native Platform

Case Study: Building a Domain Application in One Day with AI Agent Infrastructure

A single-day, verifiable build of a full wealth-management application — prediction-market integration, divergence detection, scenario modeling, and conviction synthesis — entirely on ainative-business primitives. 7,435 lines of domain code built on 31k–57k lines of inherited platform. Every metric traceable to a specific git commit.

Article №01 foundations ainative 22 min read
AI Native Platform

AI Transformation Research: Building and Governing AI-Native Businesses

How solo founders, agencies, and PE firms build AI-native businesses with governed agent orchestration. A market-sizing, governance, and competitive-landscape analysis of the $7.63B→$52.62B agent economy, the 80–90% production failure rate, and the structural gap no platform currently fills.

More articles in preparation End of Vol. 01