Tag
#dgx-spark
Articles tagged "dgx-spark" — 45 entries.
The Refusal Floor Is Trainable — What a Frozen Curveball Proved About Prompts vs Weights
A 30B model with a hand-tuned prompt contract refused 3 of 9 adversarial pretexts and fabricated private-looking state 3 times. A 4B trained for 21 minutes refused 9 of 9. The bench that saw the difference was frozen before training — and that discipline is the whole method.
uses fieldkit.arenafieldkit.eval
The Meta-Program on a DGX Spark — When the Tool You Build With Is an Instance of the Thing You Build
The opener for the Machine-that-Builds-Machines arc. The book describes a meta-program on a SaaS platform; this is the same pattern on one personal box — a pane → hands → engine loop where the spec is the application and the skills are configuration over code.
The Machine Manages Its Own Memory — and the Bug the Mocks Slept Through
Driving the Arena recall layer end-to-end on its own corpus: reindex → score → gate, dispatched through the control plane, recall@5 measured against 44 held-out questions. The first real drain caught a bug eight mock-injected unit tests had slept through — the case for operating the thing you built.
uses fieldkit.memoryfieldkit.arenafieldkit.harnessfieldkit.eval
The Machine Improves Itself — Closed-Loop RLVR on a DGX Spark, Where the Eval Harness Is the Reward
Closed-loop RLVR on one box: an eval→reward→fine-tune loop where the Spark's own verifiers ARE the reward — no learned reward model. The hero finding is defensive: pick the checkpoint on a frozen held-out split, never the training pool, or the loop reports success while it regresses.
uses fieldkit.rlfieldkit.rewardfieldkit.evalfieldkit.lineage
The Gate Before the GPU — Deciding SFT vs RL vs RLVR Before You Spend the Run
Building Kepler — a numeric astrodynamics reasoner — from scratch on one Spark. The method choice (SFT vs RL vs RLVR) is decided by cheap gates before any GPU run: a base preflight, an SFT gate, and a Goldilocks headroom gate. A flawless RLVR run that changed nothing is the proof.
uses fieldkit.rlfieldkit.rewardfieldkit.eval
Cost-Routing the Hermes Harness — When Local Stops Being Enough on a DGX Spark
The local 30B-MoE on a Spark is at $0 marginal cost — until it isn't. H6 measures the failure-mode curve: where does local stop being enough, and what does the dollar curve look like when you escalate to OpenRouter only when you have to?
uses fieldkit.harnessfieldkit.eval
The Hermes Vertical Router on a DGX Spark — One Brain Always Warm, Five Specialists Summoned on Demand
Five published Orionfold verticals plus the pinned MoE brain become a router on one Spark — not by parallel inference (the unified-memory envelope forbids that), but by a deterministic keyword classifier that dispatches the prompt and serves the right specialist one-at-a-time.
uses fieldkit.harness
Picking the Hermes Brain on a DGX Spark — When Throughput Stops Being the Answer
The Hermes serving-lane bakeoff couldn't pick a winner: all five lanes cleared the tool-call format bar. A graded brain-quality rubric breaks the tie — and shows the fastest serving lane is also the better agent, by a margin throughput could never have measured.
uses fieldkit.evalfieldkit.harness
Hermes Drives the Spark via fieldkit-as-MCP — The Agent That Operates Its Own Machine
The keystone of the Harnesses series: expose a curated slice of fieldkit as MCP tools and the local Hermes agent can measure, quantize, publish, and retrieve on the box itself. The gate is a real llama-bench run the agent drove end-to-end — 0% tool-call format error, no API key.
uses fieldkit.harnessfieldkit.capabilitiesfieldkit.quantfieldkit.publishfieldkit.rag
Hardening the Hermes Harness on a DGX Spark — The Box Contains It, You Don't Trust the Model
Before you leave a tool-wielding agent running on your desk, harden it. One pure function turns Hermes' permissive defaults into a desk-grade posture, then a scripted hostile-tool-call test proves it: egress denied at the sandbox, secrets in .env only, the config surviving a restart.
uses fieldkit.harness
The Hermes Serving Lane on a DGX Spark — MoE vs Dense, and the Number That Actually Picks the Lane
Five Hermes serving lanes on one DGX Spark: Qwen3-30B-A3B MoE vs Qwen3-32B dense across vLLM, llama.cpp, and NIM. The MoE runs ~8.5× faster for the same memory — but the lane is picked by tool-call reliability, which took two config fights to get to 0% everywhere.
uses fieldkit.capabilitiesfieldkit.harnessfieldkit.nim
The Hermes Harness on a DGX Spark — A Local Cockpit That Holds Tools, With No API Key
Installing the Hermes agent harness on a DGX Spark and running the first local agent turn against the cached Nemotron-Nano-9B-v2 NIM — reliable tool calls, no API key, no cloud hop. The defensible angle is NIM-first; everyone else's Spark Hermes write-up leads with Ollama.
uses fieldkit.nimfieldkit.capabilitiesfieldkit.harness
Two Trainers, One LoRA: NeMo Framework Beats Unsloth by 26% on a Patent-Strategist Fine-Tune
Same recipe, same R1-distilled base, same 5000-row patent corpus — once via Unsloth, once via NeMo Framework + Megatron-Bridge. NeMo finishes 26% faster and produces 44% longer patent-strategic chains. The cost is one YARN-defaults landmine and a stdout that lied for four hours.
Unsloth on the Spark — When the Train-Time Peak Equals the Base-Load Peak
Six gates clear in one container against the v1 reset: pip install --no-deps preserves the s40 stack, FastLanguageModel loads at 16.94 GB peak, a 100-step LoRA train holds the same envelope, save_pretrained_gguf() emits both quants in 207 seconds end-to-end.
The Trainer Was Fine, the Corpus Wasn't: Three Misdiagnoses on a Patent-Specialist Fine-Tune
Five thousand rows of synthetic patent reasoning, two clean 131-minute LoRA trains, three rounds of confident diagnosis — and none of them found the bug. The bug was the corpus all along. A field report on the cheapest mistake to make on the Spark.
Looking Beyond Spark — KV-Cache Arithmetic at Inference
The serving memory bill is not weights. It's KV cache, and KV scales with concurrent users × context length, not parameters. Same four bills as training; different weights. A 70B at 32 users × 16k context wants 168 GB just for KV — and the Spark teaches you the per-token math.
uses fieldkit.capabilities
Distilling the Architect — A 3B LoRA Trained on the Agent's Own Trajectory
A4's 50-iter trajectory becomes training data for a Qwen2.5-3B LoRA proposer. Holding out 8 iters, the 3B mode-collapses onto d_model=768 (the trajectory's most-frequent keep) and matches 0 / 8 exact; the 8B at T=0.5 matches 4 / 8 of its own past picks.
Derisking the Cloud Pretrain — How a $5K Spark Saves $50K on H100 Rentals
The Spark is too small for a serious pretrain — but it's the right size for the recipe-search that precedes one. Cull 100 candidate architectures down to 3 on one Spark for ~$1 of electricity, then book the cloud node knowing what to train. The expected savings per campaign run into the thousands.
What the Agent Actually Built — Five Articles in Plain English, and Why You Probably Don't Want to Train From Scratch
Five technical articles in one day built an unattended AI research loop on a desk for $0.02 of electricity. The plain-English readout: what the agent built (not a usable model), what it changes for one person, and a four-tier roadmap from LoRA in minutes to from-scratch in weeks.
The Autoresearch Loop — 50 Iterations of an LLM Editing Its Own Trainer Overnight
NIM Llama 3.1 8B drives a structured-perturbation agent loop against a 354M GPT pretrain. 50 iterations, 73.4 min wall, 0.07 kWh of electricity. 8 keeps, 42 reverts, 0 rail blocks, 0 crashes. Best result: val_bpb 10.8534, +0.93% over baseline at d_model=768.
Guardrails Before the Agent Edits — Code-Edit Policy as a Programmatic Funnel
Five programmatic rails between the Autoresearch agent's proposal and any mutation of train.py — schema, menu, range, cross-constraint, diff lint. 27 adversarial test cases: block recall 1.0, clean pass 1.0, every rail attribution correct. Zero LLM-as-judge calls.
The Data-Path Envelope — When Real Tokens Beat Random Tokens at Pretrain Throughput
Curator-cleaned wikitext-103 (109M tokens, 417 MiB packed) feeding the same 354M GPT pretrain loop from A2. Eight configs swept; data-path overhead is 0.01–0.04% across all of them. New peak: 14,980 tok/s — slightly above A2's random-token ceiling.
The GB10 Pretrain Envelope — Sweeping Batch, Sequence, and Precision on One Spark
Same 354M GPT, same training loop, swept across micro-batch (2,4,8,16), sequence length (1024,2048), and precision (bf16,fp8). 16 configurations, 30 steps each. Peak: 14,266 tokens/sec at batch=16, seq=1024, fp8 — 18% above the hand-rolled PyTorch baseline.
NeMo Framework on the Spark — What It Earns Over a Hand-Rolled train.py
Same 354M GPT, same 100 steps, same random tokens — once in a hand-rolled train.py against vanilla PyTorch, once via Megatron-Core inside the NeMo Framework container. Same hardware (GB10, 128 GB unified). The framework earns +5.8% throughput and 30% less GPU memory.
Second Brain as a Tool — Wrapping the RAG Stack in MCP for Claude Code
Closing the Second Brain arc. Four MCP tools wrap the RAG chain — embed, retrieve, optionally rerank, generate — and any Claude Code session anywhere on the box becomes a grounded research client. 200 lines of Python, one launcher, one .mcp.json entry.
Looking Beyond Spark — Fine-Tuning a 100B Nemotron
A working answer to: how many GPUs to fine-tune a 100B Nemotron? Three methods, three memory footprints — full FT ≈ 1.6 TB needs 24× H100; LoRA ≈ 250 GB fits 8× H100; QLoRA ≈ 65 GB fits 1× H200. The Spark's 3B LoRA teaches the math.
uses fieldkit.capabilities
Ragas, Reranked — What 44 Held-Out Questions Say About the Second Brain Stack
A Ragas-style harness written in 200 lines of stdlib Python, run locally on the DGX Spark, against four variants of the Second Brain RAG chain. Naive RAG scores 3.30 / 5. Rerank RAG scores 4.27. LoRA+RAG is a surprise — it does not beat naive. Retrieval is where the points come from.
uses fieldkit.eval
LoRA on Your Own Q&A — What 231 Pairs Actually Teach a 3B Model
231 own-voice Q&A pairs, a rank-16 LoRA, 69 s of training on a GB10 Spark. The adapter won't memorize your exact numbers, but it will take a model that refuses 61% of questions about your work and turn it into one that answers all of them in your voice. For facts you still need RAG.
uses fieldkit.eval
TensorRT-LLM on the Spark — FP8 Isn't the Reason to Drop NIM. NVFP4 Is.
Dropping below NIM to raw TensorRT-LLM on a GB10 Spark. FP8 beats NIM's vLLM by 10-15% — barely worth the rebuild. NVFP4 beats it by 76% on decode, 43% on TTFT, and ships a 34%-smaller engine. The reason to drop NIM is the Blackwell-native 4-bit kernel, not FP8.
One Substrate, Three Apps — Where the Foundation Forks
Seven articles installed one stack on the Spark — NIM, Embed, pgvector, RAG glue, reranker, generator A/B, Guardrails. This bridge retells that install as three different answers to one question — corpus plus 128 GB — and walks readers to the top of three tracks.
One Rail, Three Policies — NeMo Guardrails on the Retrieval Path
NeMo Guardrails drops a policy gate between retrieval and generation. One install, three per-arc configs — PII for Second Brain, style for LLM Wiki, code-safety for Autoresearch — and a 15-query benchmark: 100% block recall, 100% clean pass. Rails are scaffolding; detectors are the content.
uses fieldkit.rag
Bigger Generator, Same Grounding — 8B vs 49B vs 70B on One Retrieval Chain
The rerank-and-fusion article bet that a bigger generator would heal the 8B Google-IPO refusal. Ran the A/B across three sizes on one retrieval chain. Bet lost: Nemotron-Super-49B over-refuses the 8B baseline; Llama 3.3 70B narrows the gap, not closes it. The refusal was the scaffold working.
uses fieldkit.rag
Hybrid Retrieval on the Spark — BM25, Dense, Fusion, Rerank
Four retrieval modes on one corpus — naive dense, BM25, Reciprocal Rank Fusion, Nemotron rerank. Dense is already 92% recall@5; rerank adds a point at K=10 and reorders the top. The 8B generator still refuses where retrieval is perfect — grounding, not retrieval, is the new bottleneck.
uses fieldkit.rag
Three Endpoints, One Answer — Naive RAG on a DGX Spark
Three endpoints in one curl chain — a query embeds through Nemotron, pgvector returns top-5 chunks in under 80 ms, and a Llama 3.1 8B NIM stuffs them into a strict-context prompt. The chain works; the 8B generator still refuses on questions its own context answers.
uses fieldkit.ragfieldkit.eval
Where Your Vectors Live — pgvector on a DGX Spark
The substrate between the embed call and the retrieve call — pgvector 0.8.2 running as a Postgres 16 container on GB10, with 1000 Nemotron vectors, HNSW and ivfflat both indexed, and a planner that prefers seq scan until you tell it otherwise.
uses fieldkit.rag
Your Own Semantic Space — a Nemotron Embedding NIM on a DGX Spark
The embedding endpoint that every downstream RAG, wiki, and agent piece will reuse — a 2048-dim Nemotron Retriever NIM running locally on GB10, ready 52 seconds after docker run and holding 28 docs/s under batched load.
uses fieldkit.rag
Your First NIM on a DGX Spark — What 24.8 Tokens Per Second Doesn't Tell You
First-contact notes on NVIDIA's DGX-Spark-specific Llama 3.1 8B NIM. 9.4 GB image, ~108 s warm-cache cold-start, 24.8 tok/s steady, OpenAI-compatible on :8000 — and a confidently wrong Python one-liner that clarifies what small-model FP8 buys and what it costs.
uses fieldkit.nim
Field-Fixing the Hermes Harness on a DGX Spark — When the NIM Won't Stream Tool Calls, and Other Rough Edges
Fifth in the Harnesses series: the field fixes that take a fresh Hermes agent on a local NIM from 'mostly works' to 'just works.' Leads with the one that bit hardest — the Spark NIM ships a non-streaming tool parser, fixed by bind-mounting NVIDIA's own streaming parser.
uses fieldkit.harness
Gates Before the Advisor — Recall Floors, Raw-Base Preflights, and the Bench That Ate Its Own Spec
Before the Advisor trained: a 182-source corpus pack with recall gates on two retrieval lanes (BM25 and live pgvector + NIM embedder), raw-base preflights that failed two NVIDIA bases honestly, and the rebuild that caught the bench's own spec contaminating its retrieval context.
Governed Routing With Receipts — When the Local Lane Consults the Frontier, and What It Costs
The Advisor's router is deterministic and observables-only: it escalates on detectable failure signals — a citation outside the retrieved set, a rank-sanity anomaly — never on vibes. Route bakeoffs at $0 and $0.0033, a no-egress gate for private state, and a receipt a script re-verifies.
LoRA on Nemotron Nano — Fine-tuning a 9B Without Blowing Unified Memory
A planned walk through LoRA fine-tuning on Nemotron Nano 9B with NeMo Customizer: rank and alpha sweeps, a tiny domain corpus, and the memory accounting that keeps a PEFT run from tripping the Spark's 128 GB unified-memory wall.
Continued Pre-training on a DGX Spark — NeMo Framework Without a Cluster
When does it make sense to continue pre-training on a single GB10 box, and when is it a category error? A planned run that pushes NeMo Framework, Megatron-LM parallelism, and BF16 mixed precision against the 128 GB unified-memory wall with a small domain corpus.
Tracing a NIM Request with Nsight Systems — What the 24.8 tok/s Number Hides
A planned kernel-level trace of a single NIM inference request on GB10. Where does the wall-clock time actually go — tokenization, KV-cache attention, the sampling loop, memcpy? The article turns 24.8 tokens per second into a timeline you can point at and say 'that line is the bottleneck'.
Watching the GPU — DCGM, Prometheus, and a Local Grafana for the Spark
A planned setup of DCGM Exporter → Prometheus → Grafana entirely on the Spark itself. The goal is a single dashboard that tells the truth about GPU memory, SM occupancy, and per-container utilization for a rig that's running NIMs, pgvector, and an occasional training job at the same time.
Synthetic Corpus Frameworks on the Spark — From a Bespoke Pipeline to an Orchestration Layer
A bespoke synth pipeline got 200 rows into a 5000-row reasoning corpus before a fourth meta-state surface form forced a retreat. The diagnosis: a regex-floor approach cannot catch novel surface forms by construction. The fix is the open-source orchestration layer.