Reading
Font size
Line height
Reader theme
Explainers
Settings save to this browser only.
The Machine Improves Itself — Closed-Loop RLVR on a DGX Spark, Where the Eval Harness Is the Reward
Closed-loop RLVR on one box: an eval→reward→fine-tune loop where the Spark's own verifiers ARE the reward — no learned reward model. The hero finding is defensive: pick the checkpoint on a frozen held-out split, never the training pool, or the loop reports success while it regresses.
Series Machine that Builds MachinesTerms in this piece3
- RLVRReinforcement Learning from Verifiable Rewards. Instead of a learned reward model scoring outputs (RLHF), a deterministic verifier — a checker that returns pass/fail or a graded score — supplies the reward directly. It works when correctness is checkable: math answers, structured-output conformance, code that compiles, a claim that validates. The 2026 reasoning-model wave (R1-class models) is largely RLVR at scale.
- GRPOGroup Relative Policy Optimization. For each prompt, sample a group of K rollouts, score each with the reward, and compute each rollout's advantage as its score minus the group mean (optionally divided by the group's spread). The group is the baseline a value network would otherwise estimate — so GRPO drops the learned critic entirely. Single-GPU-friendly; the algorithm behind most 2026 open reasoning models.
- Held-out splitA subset of the corpus carved off before training starts and never used to compute a gradient — used only to measure whether the policy is generalizing or memorizing the rollout pool. "Frozen" means the split is fixed before step 0 (here, heldout_frac=0.2 of a ≥100-row corpus) so it can't drift into the training signal. Checkpoint selection reads this split and nothing else.
The meta-program opener ended on an admission: of the three beats a self-improving loop needs — engine, hands, pane — the pane was the least built, and the loop wasn’t yet closed-loop in the strong sense. “A fully autonomous eval → reward → fine-tune → re-eval cycle, where a verifier’s score directly drives the next training run with no human in the middle, isn’t wired. The pieces exist; the wiring is the work ahead.” This article is that wiring. The pane shipped (the Arena control plane), the hands shipped (the budget-governed overnight drain), and now the engine — fieldkit.rl plus fieldkit.reward — closes the loop. The box can improve a model from its own measured signal.
The claim that makes this more than another fine-tuning post is where the reward comes from. Reinforcement learning on language models is usually gated by the most expensive component in the stack: a reward model — a separately trained network, fed by a human-annotation pipeline, that scores outputs. The disruptive move in 2026-era post-training is to delete it. If you already have a deterministic verifier that can decide whether an answer is correct — a regex that checks IRAC structure, a judge that scores patent-claim validity against seven dimensions, a Spearman correlation against a prior-art ranking — then that verifier is the reward function. You don’t learn a reward; you already wrote one. On a DGX Spark, where the seven fieldkit.eval verifiers were built across a dozen articles before any of this, the reward model was sitting on disk the whole time.
Why this matters for a personal AI builder
On a cloud platform, the reward model is the moat and the meter. You rent it, or you pay an annotation vendor to feed it, and either way the thing that decides “is this output good” lives behind someone else’s API. That single dependency is what keeps reinforcement fine-tuning out of reach for an individual — not the GPU, the judgment. The Spark inverts it. When the verifier is a function you wrote, audited, and can cat, you own the reward function outright. There is no annotation pipeline to fund, no reward-model API to call mid-rollout, no network hop between “the model produced an answer” and “here is its score.” The loop closes on one box, under one user, against models you quantized yourself.
That ownership is the edge-builder’s version of the whole arc. The corpus is yours, the GPU is yours, the agent loop is yours — and now the part that on every other platform belongs to a vendor, the signal that drives learning, is also yours. The independence isn’t “no cloud bill.” It’s that the thing teaching the model has no owner but you, and you can read every line of it before you trust it with a gradient.
Where this sits in the stack — the loop, and its one load-bearing defense
GRPO is the algorithm under the hood, and its relevance to a single box is specific: it drops the value network. Classic policy-gradient RL needs a learned critic to estimate a baseline; GRPO replaces that critic with a group — sample K answers to the same prompt, score them all with the verifier, and let the group’s mean be the baseline. No critic to train, no reward model to host. That is what makes reinforcement fine-tuning fit in 128 GB: the only model resident is the one you’re training, plus one inference lane to sample from it.
The architecture is a loop of five beats, and the diagram below is the anatomy of one step. But the beat that earns the accent isn’t the clever one — it’s the defensive one. The whole loop hinges on a single rule that is easy to get wrong and catastrophic when you do: the checkpoint you ship is selected on a frozen held-out split, never on the training pool. Skip that, and the loop will reliably tell you it succeeded while the model regressed. I’ll spend the journey on why, because it’s the most expensive lesson the Spark taught this arc.
The loop is dispatched, not clicked. An 8.5-hour run can’t be a synchronous button — so rl_run is an Arena job kind drained overnight by the budget-governed scheduler, single-lane, after the recall layer is asked “has this been tried?” and the cost ledger prices RL-vs-pay. That is the entire reason the pane and hands were built before the engine: an autonomous training loop on a no-auto-push box needs somewhere to safely land its output. The engine is the payload; the control plane is the truck.
The journey — a proven run, and the engine it became
This is not a fresh result. The feasibility was proven months ago in clawgym on Spark with GRPO: a single GB10, a 42-task pool drawn 8-per-step at K=4 (a 32-rollout bundle), 34 GRPO steps in 8.5 hours, with a binary task-grader as the reward and no learned reward model. The agent’s task-completion went from 0 of 158 to 154 of 158 — a 97.5-point lift — with mean turns down 58% and wall-clock down 62%. The textbook RLVR claim (“under 100 examples, a single GPU, the verifier scores directly”) held on a desktop. What fieldkit.rl does is productize that run — the one that actually ran — with the three corrections the proven version taught, baked in so the next vertical doesn’t relearn them.
The first correction is the one the abstract roadmap got wrong. The plan named Unsloth-GRPO and NeMo-RL — the library names you’d reach for. Neither drove the working run. A hand-rolled REINFORCE-with-KL loop of roughly 280 lines did, with a kill-and-restart of vLLM between steps to load the updated adapter. So fieldkit.rl wraps that loop, and treats the named libraries as a documented fallback lane, not the default. The cautionary precedent is real: a pinned-vLLM RL recipe has burned this arc before on aarch64 + CUDA-13 wheel gaps.
The reward is a thin adapter, exactly as the thesis promises. Any fieldkit.eval scorer becomes a reward callable — and crucially, the reward is not a bare bit:
from fieldkit.reward import RewardAdapter, group_advantage
from fieldkit.eval import irac_structure
# the verifier IS the reward — no learned reward model is trained or hosted
reward = RewardAdapter(irac_structure, pass_threshold=0.75)
rewards = reward.score_group(rollouts) # one Reward(success, failure_class, auxiliary) each
rewards[0].success # True if ≥3 of 4 IRAC components present
rewards[0].scalar # 0.0 / 0.25 / 0.5 / 0.75 / 1.0 — dense partial credit
rewards[0].failure_class # FailureLabel.KEEP, .DISCARD, or .CRASH on a raising verifier
adv = group_advantage(rewards) # the group is the baseline — GRPO drops the critic
That failure_class field is the second correction, and it reuses something already built. A binary keep/revert reward mode-collapses: in the trajectory-distillation work, a 42-row corpus produced 5-of-5 training keeps on a single knob and 0-of-8 held-out generalization. The fix is a categorical signal — (success, failure_class, auxiliary) — and the categories were already shipped as fieldkit.lineage.FailureLabel, the same 10-class enum the autoresearch loop uses to label what a trial was worth. The reward and the loop’s lineage record share one vocabulary because they’re the same enum; nothing was invented to densify the gradient.
The loop itself takes the GPU as injected seams — and this is where the article has to be honest about what shipped:
from fieldkit.rl import GRPOConfig, RLLoop, gpu_seams
cfg = GRPOConfig(base="patent-strategist-base", vllm_pin="0.10.2",
group_k=4, tasks_per_step=8, heldout_every=10, corpus_min=100)
# the three GPU seams: a vLLM sampler, the REINFORCE+KL trainer, the held-out eval.
# gpu_seams() RAISES until a pinned aarch64+CUDA-13 vLLM is vendored into the
# fieldkit[rl] extra. A test injects fakes; the Arena run_rl_loop tool calls this.
sampler, trainer, heldout_eval = gpu_seams(cfg)
loop = RLLoop(cfg, reward=reward, bench=bench, # bench = the patent gold JSONL
sampler=sampler, trainer=trainer, heldout_eval=heldout_eval)
snapshot = loop.run() # LineageSnapshot — the rl_run card
snapshot.summary()["selected_on"] # "heldout" — never the pool
The shipped v0.20.0 engine is the orchestration — the split, the group math, the gate scheduling, the held-out-only checkpoint pick, the lineage record — with torch and vLLM behind seams that never import at module load. The real GPU backend is a documented fast-follow: vendor a pinned vLLM with an aarch64 + CUDA-13 wheel and the proven REINFORCE loop into the fieldkit[rl] extra, and gpu_seams resolves. Until then, callers inject their own. I’d rather show you the seam than pretend the loop has run end-to-end through this code — it hasn’t. The 97.5-point number is the predecessor run’s; the engine is the predecessor’s lessons made reusable.
Verification — what success looks like, and why the obvious metric lies
Here is the third correction, and the reason the diagram’s accent is where it is. On the training pool, the loop converges beautifully. In the T²PO run, at step 45 the training-pool task-pass hit 87.5% — 28 of 32. The same checkpoint, scored on the 158-task held-out set, passed 9. That’s 5.7%. An 81.8-percentage-point inversion: the strongest training-side checkpoint was the worst held-out checkpoint. If you select the model you ship by watching the pool number climb, you will ship the regression, and the loop will report a triumph the whole way down.
So the engine encodes the defense structurally, not as advice. RLLoop carves a frozen held-out split before step 0, runs the held-out gate every heldout_every (≤10) steps, and selects the published checkpoint with argmax over held-out scores only — summary()["selected_on"] is the string "heldout", and a unit test proves the selector picks the held-out-best step while the pool climbs monotonically past it. The held-out eval is itself dispatched as an Arena eval_rerun job, so the gate is a control-plane artifact you can audit in the leaderboard, not a manual step someone can skip under deadline. Success on this machine isn’t “the loss went down.” It’s “the held-out curve peaked at step N, we shipped step N, and the lineage card shows exactly that.”
The run’s record is a LineageSnapshot — the same fieldkit.lineage card the rest of the arc uses, one Trial logged per step with its FailureLabel, plus a held-out-gate trial per eval. No new store, no new schema. That snapshot is what a future “living model” product renders as a public delta chart: not a marketing curve, but the actual held-out trajectory with the selected step marked.
Tradeoffs, gotchas, and the honest gaps
The sharpest gotcha is the one above: the metric you’d naturally trust is the one that lies. Everything else is downstream of taking that seriously.
The second is that RLVR is not a corpus-quality lever, and it’s easy to mistake it for one. The same T²PO run plateaued at roughly 47.7% per-assertion accuracy against an estimated synthetic-noise floor near 80% — and spending more wall-clock past the held-out peak bought nothing (the uncertainty-guided variant ran 18.5 hours to GRPO’s 8.5 and landed worse). When the held-out curve flattens well below the ceiling, that’s a signal about your data, not your steps. The move is to improve the corpus — better synthesis, curation, a cleaner gold set — not to crank the step count.
The third is the runtime tax, and it’s the one optimization worth naming. Of a ~15-minute step, the rollouts take ~13 minutes, the trainer step takes ~22 seconds, and the vLLM kill-and-restart to load the new adapter takes ~3.5 minutes. That restart is ~25% of wall-clock and the only eliminable quarter — the top fast-follow is hot-LoRA-swap in vLLM (/v1/load_lora_adapter) so the lane never restarts. Note what this means: the trainer is not the bottleneck. Speeding up the 22-second step is a rounding error; killing the 3.5-minute restart is the win.
And the fourth is the envelope, which the whole design respects. The Spark holds one serving lane in 128 GB; the training run is ~50 GiB of base weights plus ~28 GiB trainer plus ~20 GiB vLLM — about 98 of 128, a ~30 GiB margin — which means trainer resident, one vLLM lane, no second model. You don’t stack a critic and a policy and a judge; you take turns. That constraint isn’t a limitation to apologize for — it’s why GRPO (no critic) and a verifier-reward (no reward model) were the right choices and not just convenient ones. The algorithm and the hardware agree.
What this unlocks
Three things become possible the week the GPU backend lands, none of which need anything past one Spark and a corpus you trust.
First: take a domain you have a verifier for — structured extraction, a compliance checklist, a graded rubric — and reinforcement-fine-tune a 7–8B LoRA toward it overnight, with your verifier as the reward and zero annotation budget. Wrap the scorer in a RewardAdapter, set corpus_min honestly (≥100 rows; 42 mode-collapsed), point the gate at a frozen split, and let the budget-governed drain run it while you sleep. The output is a model measurably better on your metric, selected on held-out, with a lineage card that proves it.
Second: the living model. A model re-RLVR’d on a cadence against a bench that keeps freshening, sold not on a static benchmark but on the public delta chart from its LineageSnapshot — a model whose whole pitch is that it keeps getting measurably better, with the receipts. That’s the first §5 product launch this arc is staking now and shipping when the loop runs end-to-end.
Third: the recursion the book has been pointing at. When an rl_run lifts a bench past a threshold, the same machinery that drafts these articles can auto-scaffold the write-up — the engine’s output becoming the next iteration’s input not just for training, but for publishing. That’s the loop the meta-program opener drew with a dashed arc; this engine is what finally makes the arc a wire.
Closing
The Machine-that-Builds-Machines arc opened by naming three beats and admitting the engine was the one with nowhere to land. It has a home now. The pane watches, the hands dispatch, and the engine improves a model from a reward function you wrote, audited, and own — selected on a held-out split so the improvement is real and not just a number that climbed. The Spark is the first machine where one person holds the entire loop, including the reward, and can read every line of the thing that teaches the model before trusting it with a gradient. That’s the edge-builder’s version of closed-loop RL: not renting the judgment, owning it.
What’s left is honest and small: vendor the one pinned-vLLM backend that turns the injected seams into a run, and let the held-out curve speak. The feasibility is proven, the corrections are baked in, the control plane is waiting. The machine that builds machines can finally improve the machine it built — and on a desk, you can watch every loop of it close.