fieldkit
Verified-on-Spark Python patterns, lifted from the AI Native Field Notes into one importable package. Every module is the tested distillate of the articles it appears under.
Install the package
$ pip install fieldkit▌ Ship AI features faster, cheaper, with less glue.
Every AI build pays the glue tax — days lost to retry logic, context-window math, pgvector schemas, and eval rubrics; token bills inflated by overflow 400s and missing backoff; brittle copy-paste lifted from a half-dozen articles per project. The patterns are right; assembling them by hand is slow and expensive.
Each pattern was first verified inside an article — KV-cache arithmetic, the OpenAI-compatible NIM client with its 8192-token preflight, the strict-context RAG pipeline from naive-rag-on-spark, the eval harness behind every evidence file. fieldkit is where those patterns live after they're tested.
NIM endpoints return 400 on quiet overflow — fieldkit catches it on the client
Exponential backoff (0.5 s → 8 s) and cold-start polling are baked in, not bolted on
pgvector tables, indexes, and dimensions stay in sync — one ensure_schema() call
Bench, Judge, refusal detection, and trajectory analysis ship as one harness
fieldkit is the tested distillate.
fieldkit in five imports.
Each module is the public surface of a working article. Read the API reference, drop the import in, ship.
Memory and feasibility math
Typed Python facade over the project's Spark capabilities map. Canonical KV-cache and weight arithmetic.
OpenAI-compatible inference client
OpenAI-compatible NIM client with retries, context-overflow preflight, and a chunker that respects the 8192-token ceiling.
Ingest → retrieve → rerank → fuse
Composable ingest → retrieve → rerank → fuse RAG pipeline backed by pgvector + a NIM embedder + the strict-context grounded prompt from `naive-rag-on-spark`.
Bench, judge, refusal, trajectory
Bench, Judge, Trajectory, the project's refusal detector — plus the v0.2 verifier-loop additions (AssertionGrader, PassAtK, AgentRun, MatchedBaseComparison) for agent + RL benchmarks.
fieldkit.training
Fine-tuning primitives for any RL or SFT loop on the Spark — a CPU-resident LoRA reference snapshot that sidesteps peft 0.19's offloader bug, and a pre/post weight-delta tracker for sanity-checking that gradients actually moved.
Smoke checks without writing Python
A thin Typer wrapper over the modules. Quick checks and smoke benchmarks without writing Python.
from fieldkit.capabilities import kv_cache_bytes, weight_bytes
from fieldkit.nim import NIMClient
from fieldkit.rag import Document, Pipeline
from fieldkit.eval import Bench, Judge, is_refusal
# 70B Llama 3.1 KV cache at 32-user × 16K ctx, FP16:
kv_cache_bytes(hidden=8 * 128, n_layers=80, ctx=16384, batch=32, dtype="fp16")
# → 171_798_691_840 (≈ 171.8 GB)
# Naive RAG end-to-end:
with NIMClient(base_url="http://localhost:8000/v1",
model="meta/llama-3.1-8b-instruct") as gen, \
Pipeline(embed_url="http://localhost:8001/v1",
pgvector_dsn="postgresql://spark:spark@localhost:5432/vectors",
generator=gen) as pipe:
pipe.ensure_schema()
pipe.ingest([Document(id=1, text="...", label="spark")])
print(pipe.ask("How much memory does the Spark have?")["answer"]) Four imports. One pipeline.
These four imports replace ~250 lines of glue from across the field notes — embed setup, retry policy, preflight checks, schema bootstrap, and strict-context prompting. Drop them into a fresh Python file and you have a working RAG.
- Retries baked in
NIMClient handles cold-starts, exponential backoff (0.5 s → 8 s), and connect timeouts so your pipeline doesn't fail under co-resident memory pressure.
- Preflight context check
8192-token preflight runs before every request — context overflow surfaces as a Python exception, not a NIM 400.
- Schema you can trust
Pipeline.ensure_schema() creates pgvector tables, indexes, and the right embedding dimension. Run it once and forget it.
- Strict-context prompting
The RAG prompt is verbatim from naive-rag-on-spark. Refusals are detected; trajectories are inspectable.
Every module ships from a working article.
These are the field notes that exercise one or more fieldkit modules end-to-end on the Spark. The article runs the math, ships the evidence, and the tested abstraction lives on as an importable class.
- Reading the Lineage Primitive — cxcscmu Auto-Research, Studied from release_artifacts fieldkit.capabilities fieldkit.training
- T²PO on Spark — When the Training Pool Says 28/32 and Held-out Says 9/158 fieldkit.capabilities fieldkit.eval fieldkit.training
- ClawGym on Spark — A 7B Base, A LoRA Adapter, and the +15 pp the Adapter Earned fieldkit.nim
- Pass@k After the Seventh Patch — Three Shapes ESamp Takes on Spark fieldkit.eval fieldkit.capabilities
- Two Patches Were Six — ESamp Lands at 97.4% on a Patched Spark fieldkit.eval fieldkit.capabilities
- Test-Time Distilling on Spark — Same Compute Envelope, Wider Semantic Reach fieldkit.eval fieldkit.capabilities
- AutoResearchBench on Spark — Two NIMs, One Bench, Two Failure Modes fieldkit.nim fieldkit.eval fieldkit.capabilities
- Looking Beyond Spark — KV-Cache Arithmetic at Inference fieldkit.capabilities
Quick wins from the shell.
Sanity-check inference math, smoke-test a pipeline, or run a benchmark
without writing a line of Python. fieldkit is on $PATH after install.
$ fieldkit version 0.1.0.dev0 $ fieldkit envelope "70B params fp8" ~70 GB weights; leaves ~50 GB for KV + activations + system; tight but possible $ fieldkit feasibility llama-3.1-70b --ctx 4096 --batch 32 --dtype fp8 weights (fp8): 70.0 GB KV cache (fp8): 21.5 GB (ctx=4096, batch=32) weights + KV: 91.5 GB $ fieldkit bench rag --table fieldkit_cli_bench_rag --out /tmp/bench.json
fieldkit · v0.2.0.post1
Build with verified patterns.
Read the API reference, browse the field notes, or grab the source. Apache-2.0; Python 3.11+.
Install the package
$ pip install fieldkit▌