What it is
The reward half of the closed-loop RLVR engine
(_SPECS/rlvr-loop-v1.md §5, Phase 3 / Bet 1 of the “machine that builds
machines” roadmap). It realizes the disruptive move the whole roadmap points at:
the eval harness is the reward model. fieldkit.eval already ships seven
deterministic verifiers — patent_claim_validity (7-dim judge),
office_action_argument (4-dim judge), prior_art_relevance (Spearman ρ),
irac_structure (4-checklist regex), mcq_letter, numeric_match,
is_refusal — plus exact_match / contains. Those are exactly the
well-formed verifiers GRPO/RLVR scores rollouts against, with no learned
reward model. fieldkit.reward is a thin adapter over them; it adds no
scoring logic of its own (RV-2).
It is the scorer that fieldkit.rl.RLLoop consumes step by step. Where
fieldkit.eval measures and fieldkit.lineage records, this module
rewards — the one new translation layer between an offline verifier and an
online RL gradient.
The reward is a tuple, not a scalar (RV-3)
Binary keep/revert mode-collapses: the project saw 5/5 train keeps on one
knob and 0/8 held-out generalization from a 42-row corpus. So the reward is a
(success, failure_class, auxiliary) tuple. The categorical failure_class
densifies the signal — and it is already modeled as
fieldkit.lineage.FailureLabel (a 10-class string enum with an
is_informational predicate). The reward reuses FailureLabel; it does not
invent an enum. That means the loop’s per-step Trial label is the reward
signal — one vocabulary, two consumers (RV-7).
The surfaces
Reward— the(success, failure_class, auxiliary)tuple.successis the pass bit;failure_classis aFailureLabel(KEEPon a pass,CRASHwhen the verifier raised,DISCARDfor a scored miss);auxiliarycarries the densifying signal — the raw verifierscore(partial credit for the 0.25-granularityirac_structure, the ρ forprior_art_relevance), arefusalflag, thescorername..scalaris the valuegroup_advantagestandardizes (the partial-credit score when present, else1.0/0.0);.is_informationalmirrors the label;.to_dict()is the serializable view.RewardAdapter— wraps onefieldkit.evalscorer (Callable[[predicted, expected, **kw], float]) so a rollout becomes aReward. Construct with theverifier, apass_threshold(the score at/above which the rollout is a keep —1.0for binary scorers, lower for graded ones), andscorer_kwargs(forwarded to the scorer, filtered to the kwargs it accepts — a judge forpatent_claim_validity,rel_tolerancefornumeric_match)..score(rollout)returns the reward;.score_group(rollouts)scores a whole rollout bundle. A verifier that raises is caught and returns aCRASH-labelled reward, so one bad rollout never sinks the group.group_advantage(rewards, *, normalize_std=True, eps=1e-8)— GRPO’s value-network-free baseline: standardize the group’s scalar rewards by subtracting the group mean (the baseline a critic would estimate) and, whennormalize_std, dividing by the group spread (guarded byeps). A degenerate group (every rollout scoring the same) yields an all-zero advantage vector — no spurious gradient. Withnormalize_std=Falsethe centered advantages are returned without the spread division.RewardError— raised on a caller mistake (a non-callable verifier, a rollout with noprediction). A verifier that raises mid-score is not this error — it becomes aCRASH-labelledReward.
What it is not
It does not ship a new scorer (it wraps the built fieldkit.eval ones, RV-2),
train anything (that’s fieldkit.rl), or call an LLM of its own — a judge-backed
verifier brings its own fieldkit.eval.Judge; this module just calls it.
Deterministic Python, pure stdlib + fieldkit.eval / fieldkit.lineage — no
torch, no numpy (invariant #4).