Reading
Font size
Line height
Reader theme
Explainers
Settings save to this browser only.
Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction — Spark reproduction notes
Reproducing DCI-Agent-Lite on a DGX Spark — NIM-served 8B agent + ripgrep + filesystem corpus, no embedder or vector DB; extracts the operator vocabulary as `fieldkit.rag.operators` and quantifies how much of the existing pgvector + reranker stack DCI lets you delete.
Series Second BrainThe paper, in one breath (ARTICLE OPENING — required at publish)
tech-writer: this becomes a
## The paper, in one breathsection in the published article, placed immediately after the lede and before any “Why this matters for a personal AI builder” substrate framing. Pull thesis material from the eval’s## Hypothesis; fill in the achieved beat after the experiment runs.
Thesis. <paraphrase the eval’s Hypothesis section in 2–3 sentences, plain language, one concrete mechanism — distinguish from the obvious baseline the technique replaces>
Why this technique matters for a personal AI builder. <2 sentences on what this unlocks for the reader on a single Spark — distinct from the substrate framing in the next section>
Promise vs achieved. Paper:
Source paper
- arXiv: 2605.05242 — Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction
- Repo: (none — see eval Blockers section)
- Popularity: 15 · 6 HF upvotes · 0 citations
Frontier Scout verdict
spark-feasible — the published config trivially runs against a remote API and the Spark-local config (NIM-served 8B–70B agent + ripgrep + filesystem corpus) is strictly cheaper in resident memory than the existing pgvector + reranker RAG stack already documented in the blog; the only adaptation is swapping the agent’s LLM endpoint.
Hypothesis (from eval)
Modern retrieval — lexical or semantic — exposes a corpus through a fixed top-k similarity interface that compresses access into a single retrieval step before reasoning. For agentic tasks (multi-hop QA, deep research, evidence chasing) this is a bottleneck: exact lexical constraints, sparse-clue conjunctions, local-context checks, and plan revision after partial evidence are all hard to express through retrieve(query, k=10). Direct Corpus Interaction (DCI) removes the retriever entirely — the agent searches the raw corpus with general-purpose terminal tools (rg, find, sed, file reads, lightweight scripts), composes its own search primitives, and revises plans mid-search. No embedding model, no vector index, no retrieval API. With GPT-5.4-nano as the agent, DCI-Agent-Lite hits 62.9 % on BrowseComp-Plus, beating top baselines powered by GPT-5.2, Claude-Sonnet-4.6, Qwen3.5-122B, and GLM-4.7.
Proposed Spark recipe
The repo is at github.com/DCI-Agent/DCI-Agent-Lite and is uv-managed with a one-click bash setup.sh. It builds on Pi (badlogic/pi-mono coding-agent) with bash tools.
git clone --depth 1 https://github.com/DCI-Agent/DCI-Agent-Lite && cd DCI-Agent-Lite && bash setup.sh- Configure
.envwith at least one ofOPENAI_API_KEY/ANTHROPIC_API_KEYfor the published path. Spark-local path: point the harness at a NIM endpoint via the OpenAI-compatible API the NIM container exposes — Pi already speaks OpenAI-format, so this is a base-URL swap, no code change. - Download the corpus + bench:
uv run python scripts/download_corpus.pyanduv run python scripts/download_dci_bench.py. Both come from HF:DCI-Agent/corpusandDCI-Agent/dci-bench. - Install ripgrep (
apt install ripgrep) — capability map’sstackblock already presumes a Linux userspace; this is a one-line dependency. - Run a benchmark: the repo ships scripts for BRIGHT, BEIR, BrowseComp-Plus, and multi-hop QA — total 13 benchmarks. The full suite is hours, not days, on a single Spark with a local NIM.
- The extractable abstraction is the operator vocabulary —
rg(regex with-A/-Bcontext),find(filename / mtime predicates),sed(slice ranges),cat(whole-file read), shell pipes for composition. The agent learns to compose these instead of callingretriever.search(q). This is what becomesfieldkit.rag.operators.
Open questions for the experiment
- (none — recipe should run as-is)
- Long-horizon agentic loops accumulate context, but Pi ships explicit context-management for this; the README mentions the
codex/context-management-ablationbranch is the supported path. - The published headline uses GPT-5.4-nano which is a remote API — quoting Spark-local numbers with a 8B-class NIM is a follow-up study, not a blocker.
Suggested article shape
- Would write? yes
- Suggested slug: dci-corpus-operators-on-spark
- Suggested stage: agentic
- Suggested series: Second Brain
- Suggested tags: rag, retrieval, agentic, search, ripgrep, terminal-tools, no-vector
- Suggested summary: Reproducing DCI-Agent-Lite on a DGX Spark — NIM-served 8B agent + ripgrep + filesystem corpus, no embedder or vector DB; extracts the operator vocabulary as
fieldkit.rag.operatorsand quantifies how much of the existing pgvector + reranker stack DCI lets you delete. - Suggested
fieldkit_modules: [nim, rag]
(No alignment lens — series is Second Brain, not MTBM.)