Tag

#rag

Articles tagged "rag" — 12 entries.

Article №54 observability Foundation 03 Jun 2026 ~4 hours end-to-end — bring up the cockpit, drive a reindex + two RAG-evals through the control plane, score 44 questions, and ship the artifact

Second Brain

The Machine Manages Its Own Memory — and the Bug the Mocks Slept Through

Driving the Arena recall layer end-to-end on its own corpus: reindex → score → gate, dispatched through the control plane, recall@5 measured against 44 held-out questions. The first real drain caught a bug eight mock-injected unit tests had slept through — the case for operating the thing you built.

uses fieldkit.memoryfieldkit.arenafieldkit.harnessfieldkit.eval

Article №41 fine-tuning Foundation 17 May 2026 ~10 hours (mostly automated overnight sweeps)

Three-Mode Bracket: Baselining a Reasoning Model Before Fine-Tuning, On One Spark

Before you fine-tune a small reasoning model on a domain bench you need to know where it stands. Three context modes — closed, retrieval, oracle — triangulate the model's ceiling on one Spark, no Judge backend or cluster required.

Article №17 agentic NIM 24 Apr 2026 ~90 minutes — 30 min to design the tool surface, 30 min to wire FastMCP + pgvector, 15 min to register with Claude Code, 15 min for the demo and trace

Second Brain

Second Brain as a Tool — Wrapping the RAG Stack in MCP for Claude Code

Closing the Second Brain arc. Four MCP tools wrap the RAG chain — embed, retrieve, optionally rerank, generate — and any Claude Code session anywhere on the box becomes a grounded research client. 200 lines of Python, one launcher, one .mcp.json entry.

Article №15 observability NeMo Evaluator 23 Apr 2026 ~60 minutes end-to-end — 40 s to ingest the blog into pgvector, 2 min for retrieval, 4 min for generation across three 8B variants, 90 s for the LoRA variant, 9 min for grading

Second Brain

Ragas, Reranked — What 44 Held-Out Questions Say About the Second Brain Stack

A Ragas-style harness written in 200 lines of stdlib Python, run locally on the DGX Spark, against four variants of the Second Brain RAG chain. Naive RAG scores 3.30 / 5. Rerank RAG scores 4.27. LoRA+RAG is a surprise — it does not beat naive. Retrieval is where the points come from.

uses fieldkit.eval

Article №11 inference NeMo Guardrails 22 Apr 2026 ~90 minutes on top of the rerank-fusion / bigger-generator chain

Foundations

One Rail, Three Policies — NeMo Guardrails on the Retrieval Path

NeMo Guardrails drops a policy gate between retrieval and generation. One install, three per-arc configs — PII for Second Brain, style for LLM Wiki, code-safety for Autoresearch — and a 15-query benchmark: 100% block recall, 100% clean pass. Rails are scaffolding; detectors are the content.

uses fieldkit.rag

Article №10 inference Llama 3.3 70B + Nemotron-Super-49B + Llama 3.1 8B NIM 22 Apr 2026 ~30 minutes on top of the rerank-and-fusion chain

Foundations

Bigger Generator, Same Grounding — 8B vs 49B vs 70B on One Retrieval Chain

The rerank-and-fusion article bet that a bigger generator would heal the 8B Google-IPO refusal. Ran the A/B across three sizes on one retrieval chain. Bet lost: Nemotron-Super-49B over-refuses the 8B baseline; Llama 3.3 70B narrows the gap, not closes it. The refusal was the scaffold working.

uses fieldkit.rag

Article №09 inference Nemotron Reranker + pgvector full-text + Llama 3.1 8B NIM 22 Apr 2026 ~45 minutes on top of the naive-RAG chain

Foundations

Hybrid Retrieval on the Spark — BM25, Dense, Fusion, Rerank

Four retrieval modes on one corpus — naive dense, BM25, Reciprocal Rank Fusion, Nemotron rerank. Dense is already 92% recall@5; rerank adds a point at K=10 and reorders the top. The 8B generator still refuses where retrieval is perfect — grounding, not retrieval, is the new bottleneck.

uses fieldkit.rag

Article №08 inference Llama 3.1 8B NIM + Nemotron Retriever + pgvector 22 Apr 2026 ~30 minutes if the three endpoints are already warm

Foundations

Three Endpoints, One Answer — Naive RAG on a DGX Spark

Three endpoints in one curl chain — a query embeds through Nemotron, pgvector returns top-5 chunks in under 80 ms, and a Llama 3.1 8B NIM stuffs them into a strict-context prompt. The chain works; the 8B generator still refuses on questions its own context answers.

uses fieldkit.ragfieldkit.eval

Upcoming agentic NIM ~30 min read

Second Brain

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction — Spark reproduction notes

Reproducing DCI-Agent-Lite on a DGX Spark — NIM-served 8B agent + ripgrep + filesystem corpus, no embedder or vector DB; extracts the operator vocabulary as `fieldkit.rag.operators` and quantifies how much of the existing pgvector + reranker stack DCI lets you delete.

Upcoming inference NIM planned ~14 min read

Machine that Builds Machines

Gates Before the Advisor — Recall Floors, Raw-Base Preflights, and the Bench That Ate Its Own Spec

Before the Advisor trained: a 182-source corpus pack with recall gates on two retrieval lanes (BM25 and live pgvector + NIM embedder), raw-base preflights that failed two NVIDIA bases honestly, and the rebuild that caught the bench's own spec contaminating its retrieval context.

Upcoming inference NIM ~30 min read

LLM Wiki

RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation — Spark reproduction notes

Reproducing the RaguTeam SemEval-2026 T8 winning system on a DGX Spark — judge-orchestrated 7-LLM ensemble (Qwen3-4B-FP8 + Meno-Lite-0.1 7B local + remote members) with Qwen3-32B judge, then extracting the pattern into `fieldkit.ensemble` + `fieldkit.judge`.

Upcoming fine-tuning Foundation planned ~45 min read

Machine that Builds Machines

Synthetic Corpus Frameworks on the Spark — From a Bespoke Pipeline to an Orchestration Layer

A bespoke synth pipeline got 200 rows into a 5000-row reasoning corpus before a fourth meta-state surface form forced a retreat. The diagnosis: a regex-floor approach cannot catch novel surface forms by construction. The fix is the open-source orchestration layer.