fieldkit
Verified-on-Spark Python patterns, lifted from the AI Native Field Notes into one importable package. Every module is the tested distillate of the articles it appears under.
Install the package
$ pip install fieldkit▌ Ship AI features faster, cheaper, with less glue.
Every AI build pays the glue tax — days lost to retry logic, context-window math, pgvector schemas, and eval rubrics; token bills inflated by overflow 400s and missing backoff; brittle copy-paste lifted from a half-dozen articles per project. The patterns are right; assembling them by hand is slow and expensive.
Each pattern was first verified inside an article — KV-cache arithmetic, the OpenAI-compatible NIM client with its 8192-token preflight, the strict-context RAG pipeline from naive-rag-on-spark, the eval harness behind every evidence file. fieldkit is where those patterns live after they're tested.
NIM endpoints return 400 on quiet overflow — fieldkit catches it on the client
Exponential backoff (0.5 s → 8 s) and cold-start polling are baked in, not bolted on
pgvector tables, indexes, and dimensions stay in sync — one ensure_schema() call
Bench, Judge, refusal detection, and trajectory analysis ship as one harness
fieldkit is the tested distillate.
fieldkit in 18 imports.
Each module is the public surface of a working article. Read the API reference, drop the import in, ship.
Memory and feasibility math
Typed Python facade over the project's Spark capabilities map — canonical KV-cache and weight arithmetic for sizing what fits in 128 GB.
OpenAI-compatible inference client
OpenAI-compatible NIM client with retries, context-overflow preflight, and a chunker that respects the 8192-token ceiling.
Ingest → retrieve → rerank → fuse
Composable ingest → retrieve → rerank → fuse RAG pipeline backed by pgvector and a NIM embedder, with a strict grounded-answer prompt that refuses to answer outside its context.
Bench, judge, assertion, pass@k
Bench, Judge, and Trajectory primitives plus a refusal detector, assertion grader, and pass@k — the verifier loop behind every agent and RL benchmark on this site.
SFT/RL recipes, drivers + probes
Fine-tuning primitives for LoRA SFT and RL on the Spark — declarative recipes, an HF→Megatron converter, symmetric NeMo/Unsloth train-and-export drivers, reasoning-preservation probes, and weight-delta sanity checks.
Append-only trial log + prompt rendering
Append-only trial log plus deterministic prompt rendering — every experiment lands as one typed TSV row, and a Markdown lineage block briefs the next session on entry.
GGUF quantize + four-axis measure
GGUF quantize-and-measure pipeline over llama.cpp — convert, quantize, perplexity, speed, and thermal probes that emit the QuantReport every published artifact card is built from.
HuggingFace card + manifest + push
The HuggingFace push surface — model-card renderer, artifact manifest, and a dry-run-by-default hub adapter, so every Orionfold card ships with the same Spark-tested measurement quad.
Smoke checks without writing Python
A thin Typer wrapper over the modules. Quick checks and smoke benchmarks without writing Python.
Branded charts + hero tables from a manifest
Branded chart and hero-table builders — turns an artifact manifest into marketing-grade matplotlib figures and great_tables displays, styled by the bundled Orionfold theme.
Dual-path Spark/Colab runtime + scaffold
Dual-path notebook runtime and scaffolding — detects Spark vs Colab/Kaggle, opens a published GGUF behind one .chat() surface, and lays the skeleton the notebook-author fills with prose.
Install, serve & harden an agent harness
Deterministic spine for running an agent harness on the Spark — install, configure, serve, harden, and profile Hermes across the NIM, llama-server, vLLM, and Ollama lanes, with a vertical router over the Orionfold experts.
Operator cockpit + leaderboard sidecar
Operator cockpit for the Spark — a local FastAPI sidecar streaming telemetry, chat, and side-by-side compares over a SQLite store, with a leak-proof mirror that publishes the leaderboard to this site.
Per-run cost ledger + $/quality axis
Per-run cost ledger for the cockpit — persists what each cloud call cost and ranks lanes by $/task and $/quality-point alongside speed and quality. Ledger, not governor.
Multi-source, provenance-aware recall layer
Provenance-aware memory index over the Second Brain — multi-source ingest into knowledge cards, a coverage report against the article index, and an eval-gated re-index.
Allow / escalate / defer spend brake
The spend governor the autonomous job drain consults before each dispatch — allow, escalate, or defer against an explicit budget envelope. Governor, not meter.
The eval verifier is the reward model
The verifier-to-reward adapter for RLVR — turns any fieldkit.eval scorer into a reward signal with failure classes. The eval harness is the reward model; nothing is learned.
Closed-loop RLVR with a held-out-only gate
Closed-loop RLVR driver — a GRPO-style REINFORCE-with-KL loop gated by held-out-only checkpoint selection and a minimum-corpus floor. Orchestration ships; GPU seams inject.
from fieldkit.capabilities import kv_cache_bytes, weight_bytes
from fieldkit.nim import NIMClient
from fieldkit.rag import Document, Pipeline
from fieldkit.eval import Bench, Judge, is_refusal
# 70B Llama 3.1 KV cache at 32-user × 16K ctx, FP16:
kv_cache_bytes(hidden=8 * 128, n_layers=80, ctx=16384, batch=32, dtype="fp16")
# → 171_798_691_840 (≈ 171.8 GB)
# Naive RAG end-to-end:
with NIMClient(base_url="http://localhost:8000/v1",
model="meta/llama-3.1-8b-instruct") as gen, \
Pipeline(embed_url="http://localhost:8001/v1",
pgvector_dsn="postgresql://spark:spark@localhost:5432/vectors",
generator=gen) as pipe:
pipe.ensure_schema()
pipe.ingest([Document(id=1, text="...", label="spark")])
print(pipe.ask("How much memory does the Spark have?")["answer"]) Four imports. One pipeline.
These four imports replace ~250 lines of glue from across the field notes — embed setup, retry policy, preflight checks, schema bootstrap, and strict-context prompting. Drop them into a fresh Python file and you have a working RAG.
- Retries baked in
NIMClient handles cold-starts, exponential backoff (0.5 s → 8 s), and connect timeouts so your pipeline doesn't fail under co-resident memory pressure.
- Preflight context check
8192-token preflight runs before every request — context overflow surfaces as a Python exception, not a NIM 400.
- Schema you can trust
Pipeline.ensure_schema() creates pgvector tables, indexes, and the right embedding dimension. Run it once and forget it.
- Strict-context prompting
The RAG prompt is verbatim from naive-rag-on-spark. Refusals are detected; trajectories are inspectable.
Every module ships from a working article.
These are the field notes that exercise one or more fieldkit modules end-to-end on the Spark. The article runs the math, ships the evidence, and the tested abstraction lives on as an importable class.
- The Refusal Floor Is Trainable — What a Frozen Curveball Proved About Prompts vs Weights fieldkit.arena fieldkit.eval
- The Machine Manages Its Own Memory — and the Bug the Mocks Slept Through fieldkit.memory fieldkit.arena fieldkit.harness fieldkit.eval
- The Machine Improves Itself — Closed-Loop RLVR on a DGX Spark, Where the Eval Harness Is the Reward fieldkit.rl fieldkit.reward fieldkit.eval fieldkit.lineage
- The Gate Before the GPU — Deciding SFT vs RL vs RLVR Before You Spend the Run fieldkit.rl fieldkit.reward fieldkit.eval
- Cost-Routing the Hermes Harness — When Local Stops Being Enough on a DGX Spark fieldkit.harness fieldkit.eval
- The Hermes Vertical Router on a DGX Spark — One Brain Always Warm, Five Specialists Summoned on Demand fieldkit.harness
- Picking the Hermes Brain on a DGX Spark — When Throughput Stops Being the Answer fieldkit.eval fieldkit.harness
- Hermes Drives the Spark via fieldkit-as-MCP — The Agent That Operates Its Own Machine fieldkit.harness fieldkit.capabilities fieldkit.quant fieldkit.publish fieldkit.rag
Quick wins from the shell.
Sanity-check inference math, smoke-test a pipeline, or run a benchmark
without writing a line of Python. fieldkit is on $PATH after install.
$ fieldkit version 0.34.1 $ fieldkit envelope "70B params fp8" ~70 GB weights; leaves ~50 GB for KV + activations + system; tight but possible $ fieldkit feasibility llama-3.1-70b --ctx 4096 --batch 32 --dtype fp8 weights (fp8): 70.0 GB KV cache (fp8): 21.5 GB (ctx=4096, batch=32) weights + KV: 91.5 GB $ fieldkit bench rag --table fieldkit_cli_bench_rag --out /tmp/bench.json
fieldkit · v0.34.1
Build with verified patterns.
Read the API reference, browse the field notes, or grab the source. Apache-2.0; Python 3.11+.
Install the package
$ pip install fieldkit▌