fieldkit
Verified-on-Spark Python patterns, lifted from the AI Native Field Notes into one importable package. Every module is the tested distillate of the articles it appears under.
Install the package
$ pip install fieldkit▌ Ship AI features faster, cheaper, with less glue.
Every AI build pays the glue tax — days lost to retry logic, context-window math, pgvector schemas, and eval rubrics; token bills inflated by overflow 400s and missing backoff; brittle copy-paste lifted from a half-dozen articles per project. The patterns are right; assembling them by hand is slow and expensive.
Each pattern was first verified inside an article — KV-cache arithmetic, the OpenAI-compatible NIM client with its 8192-token preflight, the strict-context RAG pipeline from naive-rag-on-spark, the eval harness behind every evidence file. fieldkit is where those patterns live after they're tested.
NIM endpoints return 400 on quiet overflow — fieldkit catches it on the client
Exponential backoff (0.5 s → 8 s) and cold-start polling are baked in, not bolted on
pgvector tables, indexes, and dimensions stay in sync — one ensure_schema() call
Bench, Judge, refusal detection, and trajectory analysis ship as one harness
fieldkit is the tested distillate.
fieldkit in nine imports.
Each module is the public surface of a working article. Read the API reference, drop the import in, ship.
Memory and feasibility math
Typed Python facade over the project's Spark capabilities map. Canonical KV-cache and weight arithmetic.
OpenAI-compatible inference client
OpenAI-compatible NIM client with retries, context-overflow preflight, and a chunker that respects the 8192-token ceiling.
Ingest → retrieve → rerank → fuse
Composable ingest → retrieve → rerank → fuse RAG pipeline backed by pgvector + a NIM embedder + the strict-context grounded prompt from `naive-rag-on-spark`.
Bench, judge, assertion, pass@k
Bench, Judge, Trajectory, the project's refusal detector — plus the v0.2 verifier-loop additions (AssertionGrader, PassAtK, AgentRun, MatchedBaseComparison) for agent + RL benchmarks.
LoRA reference + weight-delta tracker
Fine-tuning primitives for any RL or SFT loop on the Spark — a CPU-resident LoRA reference snapshot that sidesteps peft 0.19's offloader bug, and a pre/post weight-delta tracker for sanity-checking that gradients actually moved.
Append-only trial log + prompt rendering
Append-only trial log + deterministic prompt rendering — the portable part of cxcscmu's Auto-Research-Recipes harness. A 17-column TSV per trial, a 10-class status enum, and the Markdown lineage block the next specialist reads at session entry.
GGUF quantize + four-axis measure
GGUF quantize + measure pipeline — wraps llama.cpp's `convert_hf_to_gguf.py` + `llama-quantize` + `llama-perplexity` + `llama-bench`, plus a pure-stdlib `nvidia-smi` thermal probe. Emits the `QuantReport` shape `fieldkit.publish.publish_quant` consumes. Non-GGUF formats (AWQ / GPTQ / EXL3 / MLX / NVFP4) are named stubs reserving the v0.5 API surface.
HuggingFace card + manifest + push
HuggingFace push surface — `ModelCard` (frontmatter + body renderer), `ArtifactManifest` (Phase-2 sync record), `HFHubAdapter` (lazy huggingface_hub wrapper, dry-run by default), `publish_quant` orchestrator. Every Orionfold artifact card carries the same Spark-tested measurement quad (perplexity, tok/s, thermal envelope, optional vertical-eval) — this module is what makes that shape deterministic.
Smoke checks without writing Python
A thin Typer wrapper over the modules. Quick checks and smoke benchmarks without writing Python.
from fieldkit.capabilities import kv_cache_bytes, weight_bytes
from fieldkit.nim import NIMClient
from fieldkit.rag import Document, Pipeline
from fieldkit.eval import Bench, Judge, is_refusal
# 70B Llama 3.1 KV cache at 32-user × 16K ctx, FP16:
kv_cache_bytes(hidden=8 * 128, n_layers=80, ctx=16384, batch=32, dtype="fp16")
# → 171_798_691_840 (≈ 171.8 GB)
# Naive RAG end-to-end:
with NIMClient(base_url="http://localhost:8000/v1",
model="meta/llama-3.1-8b-instruct") as gen, \
Pipeline(embed_url="http://localhost:8001/v1",
pgvector_dsn="postgresql://spark:spark@localhost:5432/vectors",
generator=gen) as pipe:
pipe.ensure_schema()
pipe.ingest([Document(id=1, text="...", label="spark")])
print(pipe.ask("How much memory does the Spark have?")["answer"]) Four imports. One pipeline.
These four imports replace ~250 lines of glue from across the field notes — embed setup, retry policy, preflight checks, schema bootstrap, and strict-context prompting. Drop them into a fresh Python file and you have a working RAG.
- Retries baked in
NIMClient handles cold-starts, exponential backoff (0.5 s → 8 s), and connect timeouts so your pipeline doesn't fail under co-resident memory pressure.
- Preflight context check
8192-token preflight runs before every request — context overflow surfaces as a Python exception, not a NIM 400.
- Schema you can trust
Pipeline.ensure_schema() creates pgvector tables, indexes, and the right embedding dimension. Run it once and forget it.
- Strict-context prompting
The RAG prompt is verbatim from naive-rag-on-spark. Refusals are detected; trajectories are inspectable.
Every module ships from a working article.
These are the field notes that exercise one or more fieldkit modules end-to-end on the Spark. The article runs the math, ships the evidence, and the tested abstraction lives on as an importable class.
- Orionfold/II-Medical-8B-GGUF on Spark — five medical-reasoning variants, MedMCQA mini-eval, ChatML reasoning format fieldkit.quant fieldkit.publish fieldkit.eval fieldkit.lineage
- Orionfold/SecurityLLM-GGUF on Spark — five cyber variants, CyberMetric mini-eval, MCQ letter scoring fieldkit.quant fieldkit.publish fieldkit.eval fieldkit.lineage
- Orionfold/Saul-7B-Instruct-v1-GGUF on Spark — five legal variants, LegalBench mini-eval, four-axis measurement card fieldkit.quant fieldkit.publish fieldkit.eval fieldkit.lineage
- Orionfold/finance-chat-GGUF on Spark — five variants, FinanceBench mini-eval, four-axis measurement card fieldkit.quant fieldkit.publish fieldkit.eval fieldkit.lineage
- Adaptive Turn Clipping on a Single Spark — A²TGPO, Studied from Source fieldkit.capabilities fieldkit.training fieldkit.lineage
- Reading the Lineage Primitive — cxcscmu Auto-Research, Studied from release_artifacts fieldkit.capabilities fieldkit.training fieldkit.lineage
- T²PO on Spark — When the Training Pool Says 28/32 and Held-out Says 9/158 fieldkit.capabilities fieldkit.eval fieldkit.training
- ClawGym on Spark — A 7B Base, A LoRA Adapter, and the +15 pp the Adapter Earned fieldkit.nim
Quick wins from the shell.
Sanity-check inference math, smoke-test a pipeline, or run a benchmark
without writing a line of Python. fieldkit is on $PATH after install.
$ fieldkit version 0.4.2 $ fieldkit envelope "70B params fp8" ~70 GB weights; leaves ~50 GB for KV + activations + system; tight but possible $ fieldkit feasibility llama-3.1-70b --ctx 4096 --batch 32 --dtype fp8 weights (fp8): 70.0 GB KV cache (fp8): 21.5 GB (ctx=4096, batch=32) weights + KV: 91.5 GB $ fieldkit bench rag --table fieldkit_cli_bench_rag --out /tmp/bench.json
fieldkit · v0.4.2
Build with verified patterns.
Read the API reference, browse the field notes, or grab the source. Apache-2.0; Python 3.11+.
Install the package
$ pip install fieldkit▌