fieldkit

Verified-on-Spark Python patterns, lifted from the AI Native Field Notes into one importable package. Every module is the tested distillate of the articles it appears under.

v0.34.1 Apache-2.0 Python 3.11+

Install the package

Terminal
$ pip install fieldkit
KV-cache math NIM client Naive-RAG Eval harness
The Problem

Ship AI features faster, cheaper, with less glue.

Every AI build pays the glue tax — days lost to retry logic, context-window math, pgvector schemas, and eval rubrics; token bills inflated by overflow 400s and missing backoff; brittle copy-paste lifted from a half-dozen articles per project. The patterns are right; assembling them by hand is slow and expensive.

35
articles already distill into fieldkit
/field-notes/
18
modules, one import each
fieldkit.{capabilities, nim, rag, eval, training, lineage, quant, publish, cli, viz, notebook, harness, arena, cost, memory, budget, reward, rl}
8192
token preflight catches NIM 400s
before the network call

Each pattern was first verified inside an article — KV-cache arithmetic, the OpenAI-compatible NIM client with its 8192-token preflight, the strict-context RAG pipeline from naive-rag-on-spark, the eval harness behind every evidence file. fieldkit is where those patterns live after they're tested.

Context overflow

NIM endpoints return 400 on quiet overflow — fieldkit catches it on the client

Retry gaps

Exponential backoff (0.5 s → 8 s) and cold-start polling are baked in, not bolted on

Schema drift

pgvector tables, indexes, and dimensions stay in sync — one ensure_schema() call

Eval blindness

Bench, Judge, refusal detection, and trajectory analysis ship as one harness

fieldkit is the tested distillate.

The Solution

fieldkit in 18 imports.

Each module is the public surface of a working article. Read the API reference, drop the import in, ship.

fieldkit.capabilities

Memory and feasibility math

Typed Python facade over the project's Spark capabilities map — canonical KV-cache and weight arithmetic for sizing what fits in 128 GB.

Read the API
fieldkit.nim

OpenAI-compatible inference client

OpenAI-compatible NIM client with retries, context-overflow preflight, and a chunker that respects the 8192-token ceiling.

Read the API
fieldkit.rag most-used

Ingest → retrieve → rerank → fuse

Composable ingest → retrieve → rerank → fuse RAG pipeline backed by pgvector and a NIM embedder, with a strict grounded-answer prompt that refuses to answer outside its context.

Read the API
fieldkit.eval

Bench, judge, assertion, pass@k

Bench, Judge, and Trajectory primitives plus a refusal detector, assertion grader, and pass@k — the verifier loop behind every agent and RL benchmark on this site.

Read the API
fieldkit.training

SFT/RL recipes, drivers + probes

Fine-tuning primitives for LoRA SFT and RL on the Spark — declarative recipes, an HF→Megatron converter, symmetric NeMo/Unsloth train-and-export drivers, reasoning-preservation probes, and weight-delta sanity checks.

Read the API
fieldkit.lineage

Append-only trial log + prompt rendering

Append-only trial log plus deterministic prompt rendering — every experiment lands as one typed TSV row, and a Markdown lineage block briefs the next session on entry.

Read the API
fieldkit.quant

GGUF quantize + four-axis measure

GGUF quantize-and-measure pipeline over llama.cpp — convert, quantize, perplexity, speed, and thermal probes that emit the QuantReport every published artifact card is built from.

Read the API
fieldkit.publish

HuggingFace card + manifest + push

The HuggingFace push surface — model-card renderer, artifact manifest, and a dry-run-by-default hub adapter, so every Orionfold card ships with the same Spark-tested measurement quad.

Read the API
fieldkit.cli

Smoke checks without writing Python

A thin Typer wrapper over the modules. Quick checks and smoke benchmarks without writing Python.

Read the API
fieldkit.viz

Branded charts + hero tables from a manifest

Branded chart and hero-table builders — turns an artifact manifest into marketing-grade matplotlib figures and great_tables displays, styled by the bundled Orionfold theme.

Read the API
fieldkit.notebook

Dual-path Spark/Colab runtime + scaffold

Dual-path notebook runtime and scaffolding — detects Spark vs Colab/Kaggle, opens a published GGUF behind one .chat() surface, and lays the skeleton the notebook-author fills with prose.

Read the API
fieldkit.harness

Install, serve & harden an agent harness

Deterministic spine for running an agent harness on the Spark — install, configure, serve, harden, and profile Hermes across the NIM, llama-server, vLLM, and Ollama lanes, with a vertical router over the Orionfold experts.

Read the API
fieldkit.arena

Operator cockpit + leaderboard sidecar

Operator cockpit for the Spark — a local FastAPI sidecar streaming telemetry, chat, and side-by-side compares over a SQLite store, with a leak-proof mirror that publishes the leaderboard to this site.

Read the API
fieldkit.cost

Per-run cost ledger + $/quality axis

Per-run cost ledger for the cockpit — persists what each cloud call cost and ranks lanes by $/task and $/quality-point alongside speed and quality. Ledger, not governor.

Read the API
fieldkit.memory

Multi-source, provenance-aware recall layer

Provenance-aware memory index over the Second Brain — multi-source ingest into knowledge cards, a coverage report against the article index, and an eval-gated re-index.

Read the API
fieldkit.budget

Allow / escalate / defer spend brake

The spend governor the autonomous job drain consults before each dispatch — allow, escalate, or defer against an explicit budget envelope. Governor, not meter.

Read the API
fieldkit.reward

The eval verifier is the reward model

The verifier-to-reward adapter for RLVR — turns any fieldkit.eval scorer into a reward signal with failure classes. The eval harness is the reward model; nothing is learned.

Read the API
fieldkit.rl

Closed-loop RLVR with a held-out-only gate

Closed-loop RLVR driver — a GRPO-style REINFORCE-with-KL loop gated by held-out-only checkpoint selection and a minimum-corpus floor. Orchestration ships; GPU seams inject.

Read the API
verified-on-Spark tested distillate Apache-2.0 Python 3.11+ pgvector + NIM
quickstart.py
from fieldkit.capabilities import kv_cache_bytes, weight_bytes
from fieldkit.nim import NIMClient
from fieldkit.rag import Document, Pipeline
from fieldkit.eval import Bench, Judge, is_refusal

# 70B Llama 3.1 KV cache at 32-user × 16K ctx, FP16:
kv_cache_bytes(hidden=8 * 128, n_layers=80, ctx=16384, batch=32, dtype="fp16")
# → 171_798_691_840  (≈ 171.8 GB)

# Naive RAG end-to-end:
with NIMClient(base_url="http://localhost:8000/v1",
               model="meta/llama-3.1-8b-instruct") as gen, \
     Pipeline(embed_url="http://localhost:8001/v1",
              pgvector_dsn="postgresql://spark:spark@localhost:5432/vectors",
              generator=gen) as pipe:
    pipe.ensure_schema()
    pipe.ingest([Document(id=1, text="...", label="spark")])
    print(pipe.ask("How much memory does the Spark have?")["answer"])
Quickstart

Four imports. One pipeline.

These four imports replace ~250 lines of glue from across the field notes — embed setup, retry policy, preflight checks, schema bootstrap, and strict-context prompting. Drop them into a fresh Python file and you have a working RAG.

  • Retries baked in

    NIMClient handles cold-starts, exponential backoff (0.5 s → 8 s), and connect timeouts so your pipeline doesn't fail under co-resident memory pressure.

  • Preflight context check

    8192-token preflight runs before every request — context overflow surfaces as a Python exception, not a NIM 400.

  • Schema you can trust

    Pipeline.ensure_schema() creates pgvector tables, indexes, and the right embedding dimension. Run it once and forget it.

  • Strict-context prompting

    The RAG prompt is verbatim from naive-rag-on-spark. Refusals are detected; trajectories are inspectable.

Without Python

Quick wins from the shell.

Sanity-check inference math, smoke-test a pipeline, or run a benchmark without writing a line of Python. fieldkit is on $PATH after install.

Terminal zsh
$ fieldkit version
0.34.1

$ fieldkit envelope "70B params fp8"
~70 GB weights; leaves ~50 GB for KV + activations + system; tight but possible

$ fieldkit feasibility llama-3.1-70b --ctx 4096 --batch 32 --dtype fp8
weights (fp8):       70.0 GB
KV cache (fp8):      21.5 GB  (ctx=4096, batch=32)
weights + KV:        91.5 GB

$ fieldkit bench rag --table fieldkit_cli_bench_rag --out /tmp/bench.json
Full CLI reference

fieldkit · v0.34.1

Build with verified patterns.

Read the API reference, browse the field notes, or grab the source. Apache-2.0; Python 3.11+.

Install the package

Terminal
$ pip install fieldkit