Reading

T²PO on Spark — When the Training Pool Says 28/32 and Held-out Says 9/158

T²PO's two deltas on the Phase 6 ClawGym harness: mean turns 5.00 → 4.61, task_complete 154/158, but the per-assertion ceiling stays flat at 47.7%. The strongest training-side step (45) is the worst held-out checkpoint — pool saturation lies on a single Spark.

Series Frontier Scout

№34 fine-tuning NeMo 09 May 2026 ~18.5 hours wall (50 T²PO steps + three evals) advanced NVIDIA DGX Spark Manav Sehgal

Terms in this piece3

T²POThe Token-and-Turn Policy Optimization paper (arXiv 2605.02178, ICML 2026 spotlight) layers two uncertainty-guided controls on top of GRPO. Token-level: cap each assistant turn at num_think_tokens to bound the chain-of-thought budget. Turn-level: Test-time Distillation Sampling (TDS) — measure per-token entropy of the candidate turn, resample if entropy disagrees with the prior turn by an eta_threshold margin, up to max_try retries. The thesis is that uncertainty-aware exploration finds a better policy per gradient step than vanilla GRPO does at the same wall budget.
Test-time Distillation SamplingTDS is T²PO's turn-level mechanism for resampling under controlled uncertainty. After vLLM generates a candidate turn, the driver computes mean per-token entropy from the top-20 logprobs and compares it to the prior turn's entropy. Turns where the entropy delta is small but non-zero — |ΔH| ∈ (0, eta_threshold) — are regenerated, on the theory that those are the turns where the policy is least sure between two strategies and resampling produces useful exploration. Turns with zero or large entropy deltas are accepted as-is.
GiGPO step advantagesGroup-in-Group Policy Optimization extends GRPO's single trajectory-level advantage with a second per-turn advantage. For K rollouts of the same task, GiGPO groups at the same turn-index across the K and computes a turn-N advantage from per-turn signals (here: did the bash command succeed). Each assistant token's gradient weight becomes α·A_traj + β·A_step[turn_id]. ClawGym's continuous shell observations don't admit upstream's anchor-state matching, so this run uses the simpler same-turn-index grouping.

The Phase 6 GRPO article ended with a clean number — 34 steps, +97.5 pp on task_complete, mean turns collapsed 12 → 5. The pool converged at step 35 because every K=4 group on the 8-task batch saturated at SUCCESS, the gradient went to zero, and the loop exited the way it was supposed to. The next question was whether two algorithmic additions on top of GRPO — a token-level chain-of-thought cap and a turn-level uncertainty-resample — could push the per-step rollout count down further by not generating turns the policy had nothing left to learn from. The T²PO paper (ICML 2026) names the additions and reports the gains on cluster-scale runs.

This piece reproduces those two deltas on the same Phase 6 ClawGym harness — same model (Qwen 2.5 7B + LoRA), same SFT init, same 158-task held-out eval — and the headline does not read the way I expected. Mean turns drops 5.00 → 4.61. task_complete ticks up 154 → 154 (parity). Per-assertion stays put at 47.7%, identical to where it sat at step 25 of training, identical to where Phase 6 GRPO landed at step 34. The lift T²PO is reported to deliver did not materialize on a single Spark; what showed up instead is a set of findings about Spark-scale RL itself.

The most useful one — and the load-bearing claim of this article — is that the training-side pool-pass metric does not predict held-out generalization at this scale. Step 45 had the run’s strongest training-side pool task_pass (28 of 32, 87.5%) and the run’s weakest held-out task_pass (9 of 158, 5.7%). The strongest step on the training pool was the worst step on held-out. Held-out generalization at K=4 with 8 tasks per step samples a distribution different enough from the held-out 158 that pool saturation tells you almost nothing about the adapter you’d ship. That’s a Spark-scale RL finding, not a T²PO finding, and it’s the part of this run worth a deep-dive.

Why this matters for a personal AI builder

There’s a version of “RL on a personal box” where the training-side metric and the held-out metric move together, the loop terminates when the training metric saturates, and you ship the last adapter the loop saved. That version is what a cluster does: hundreds of parallel rollout workers, thousands of tasks per gradient step, training-side variance close enough to the eval distribution that the loss curve and the eval curve look like the same shape on different axes. On that machine, the loop’s natural endpoint is the right adapter to keep.

On a Spark, with 8 tasks per step and K=4 rollouts each, the training pool is a 32-rollout sample of the policy’s current on-distribution behavior — and that behavior is shaped by the same gradient updates the metric is supposed to be measuring. Pool saturation can mean “the policy solves this task family”; it can also mean “the policy has memorized the 8 tasks this step happened to sample.” When the pool is small relative to the held-out set the article actually scores against, the second story dominates. The right adapter to ship is the one that wins on held-out, not the one the loop’s pool-converge terminator stops on. This article is what it costs to learn that with one machine, a five-day-old paper, and a willingness to let the box run overnight.

Architectural context — what T²PO adds to GRPO, in one turn

The Phase 6 GRPO loop is a kill-and-restart cycle: sample 8 tasks, run K=4 rollouts each at temperature 0.8, compute group-relative advantages, REINFORCE-with-KL on the bundle, restart vLLM with the new adapter. T²PO leaves that outer loop intact and changes what happens inside a rollout’s individual turn. Two pieces, both running between when vLLM emits a candidate assistant turn and when the rollout commits it.

The two T²PO additions sit between vLLM emitting a candidate turn and the rollout committing it. Token-level: cap the candidate at 450 think tokens. Turn-level: if entropy disagrees with the prior turn by under 0.3, regenerate — up to twice. The +33% wall per rollout is the cost; the question is whether the policy reaches a better minimum because of it.

The token-level cap is one config knob. num_think_tokens=450 flows through to vLLM as the max_tokens on every generate call, and that’s it — the cap fires whether or not the turn would have been longer. The turn-level addition is more interesting. Each generated turn carries token-by-token logprobs back from vLLM (logprobs=True, top_logprobs=20), the rollout driver computes mean per-token entropy from the top-20 distribution, and TDS compares it to the prior turn’s entropy. If |H_t − H_{t-1}| lands in (0, 0.3) — small but non-zero, the regime where the policy is “between” two strategies — the turn is regenerated, up to max_try=2 times. The implementation is roughly 120 LOC of glue around vLLM’s existing OpenAI-shaped completions endpoint.

The third piece — and the one that requires the trainer to know about T²PO — is GiGPO step-level credit assignment. GRPO computes one advantage per rollout from the trajectory’s terminal reward; GiGPO additionally assigns a per-turn advantage based on whether each turn’s bash command succeeded (exit_code=0 ∧ ¬parse_error). The per-token policy loss weights each assistant token by α·A_traj + β·A_step[turn_id], where the trainer flag --gigpo-step-w 1.0 enables β = 1.0 (β = 0 reverts to vanilla GRPO).

The journey — 50 steps, three evals, and a flat ceiling

The kickoff was a 9-second-per-rollout faster start than Phase 6 (smoke validated end-to-end on 2 tasks × K=4 in 266 s wall). The full run took 18.5 hours over 50 gradient steps with two evals at step 25 and step 50, plus a third post-hoc eval against step 45’s adapter when the per-step CSV showed step 45 had the run’s strongest training-side metrics. Mean TDS regenerations per rollout: 6.39 — TDS fired aggressively, as the smoke had warned. Total trainer wall: 51.2 minutes; the rest of the 18.5 hours was rollouts (50 × ~17 minutes) plus three evals (~36 minutes each). KL stayed small the whole run (max 0.0034) and the weight-delta L2 held remarkably constant at ~0.0625, which says the loop was making consistent-magnitude updates without cumulative drift.

The training-side trajectory is the one I want to show first because it’s the part that did improve cleanly:

step	groups used	task_pass on pool	TC on pool	mean turns
1	7/8	20/32	23/32	7.31
11	7/8	8/32	30/32	5.03
25	4/8	12/32	32/32	3.78
45	1/8	28/32	32/32	3.66
50	3/8	4/32	29/32	4.53

Mean turns dropped from 7.3 to under 4 by step 23 and stayed there. task_complete first hit 100% at step 25 and held 32/32 thirteen times across the next 24 steps. By step 28, only one of eight sampled groups was producing usable advantage variance — the rest had K=4 rollouts all returning identical rewards, GRPO’s natural mute condition. Step 45 was the run’s standout step on every metric: 28 of 32 rollouts passed their tasks, every rollout stopped via task_complete, and mean turns sat at 3.66. The pool-converge terminator didn’t fire because the loop’s threshold is all groups producing zero advantage — usually one stayed productive, and the loop ran the full 50.

Then I ran the held-out eval at step 25 and step 50, plus the post-hoc step 45:

step	task_pass	per-asrt	mean turns	TC	Δ vs P6 GRPO@34
@25	12/158	47.6% (371/780)	5.37	148/158	−0.6 pp
@45	9/158	47.8% (373/780)	4.87	150/158	−2.5 pp
@50	11/158	47.7% (372/780)	4.61	154/158	−1.3 pp

Pool peak (87.5% at step 45) coincides with held-out trough (5.7% at step 45) — the same adapter looks like the run's best on a 32-rollout pool and its worst on the 158-task held-out set.

The right column is the one to pay attention to: T²PO trails Phase 6 GRPO@34 on task pass at every checkpoint — by 0.6, 2.5, and 1.3 percentage points. The middle column is the load-bearing one for the negative result: per-assertion sits at 47.6 / 47.8 / 47.7%, three flat numbers spanning 25 gradient steps. Whatever T²PO is buying at the per-token weight or the entropy resample, it is not lifting the per-assertion ceiling. Mean turns is the only metric that improves monotonically across the three evals (5.37 → 4.87 → 4.61), and the gap to Phase 6 GRPO closes from +0.37 turns at eval-1 to −0.39 turns at eval-2. The model is getting genuinely faster as training progresses; it is not getting more correct.

Verification — what success looks like on a Spark RL run

The thing the loop was supposed to accomplish, it accomplished. Compared to Phase 5 SFT — the article’s actual baseline, since it’s what the SFT-init adapter started from — the held-out 158 numbers move the way RL on top of SFT should make them move: task pass 10 → 11 (+0.6 pp), per-assertion 46.8 → 47.7% (+0.9 pp), mean turns 12.0 → 4.61 (−61%), task_complete 0/158 → 154/158 (+97.5 pp). Every metric is in the right direction. The shape of “RL unlearned the never-stop failure mode SFT taught” reproduces exactly. The Phase 6 number is the ceiling, not the floor.

The loop’s mechanical success looks like a clean exit log (=== loop complete in 66692s ===), a per-step CSV that fills out monotonically, a weight-delta L2 that holds steady step over step, and three eval-step directories whose comparison.json files share the same shape and units. It looks like vLLM coming back up in 190–220 seconds at every step boundary and never failing the 360-second cold-start timeout. It looks like memory falling to 116 GiB free between trainer and rollout phases, climbing to ~28 GiB used during trainer steps with vLLM down, and never tripping the OOM landmine the Spark’s unified memory has caught me on before. None of those numbers move the held-out per-assertion percentage, but they’re what makes the experiment a real measurement instead of a crash.

Tradeoffs, gotchas, surprises

The biggest surprise is the one named already: pool task_pass and held-out task_pass disagreed by 81.8 percentage points at step 45. I went into the run thinking the natural endpoint was wherever the loop’s pool-converge terminator decided; I came out thinking the loop should periodically eval against held-out and that trajectory is what you steer on. The cost of running an eval against held-out is real (~36 minutes per eval, three evals burn ~1.8 hours) but trivial against the run’s 18.5 total. Phase 7 of this arc would set --eval-every 10 instead of 25 and treat the held-out eval curve as the schedule’s ground truth.

The second surprise is the per-assertion ceiling. I expected T²PO’s entropy-aware resample to find higher-quality candidate turns at marginal-uncertainty boundaries — turns that the model would have committed to with vanilla GRPO but where a regenerate-and-recheck would land on a more-correct command. The mean TDS regen rate of 6.39/rollout says it did fire aggressively. The flat per-assertion numbers say the regenerated turns are not, in aggregate, more correct than the original ones — they’re roughly the same quality, just averaged over more samples. That can mean the eta_threshold of 0.3 is too generous (most turns fall in (0, 0.3), so most turns are getting resampled and the resample is closer to a temperature-perturbation than a directed retry), or it can mean the underlying policy’s per-turn entropy is not actually correlated with per-turn correctness on this benchmark. Both are testable in a Phase 8.

The third surprise is on wall-time accounting. Phase 6 GRPO ran 34 steps in 8.5 hours; T²PO ran 50 steps in 18.5 hours. Per-step wall went from 15 minutes to 22 minutes — a +47% step cost. The arithmetic line is +33% per rollout from TDS regen overhead × 1.5× more steps = ~2× total wall, which matches. What surprised me is that the held-out per-assertion numbers don’t cash that wall in for accuracy. I paid 10 hours for trajectories that don’t move the metric I care about.

What this unlocks

The negative result is itself a thing you can build on. First: a held-out-driven schedule for any RL-on-Spark loop. Replace the loop’s pool-converge terminator with a held-out eval every 10 steps and a “best held-out so far” adapter pointer. The third eval (step 45) cost 36 minutes and would have changed which adapter I shipped if it had run inside the loop instead of after it. Two new lines in t2po_loop.sh’s eval cadence buy a different stopping rule.

Second: an extracted post-hoc-eval driver. The eval_step.sh script that ran the step-45 eval is now in the repo at articles/t2po-uncertainty-guided-rl-on-spark/scripts/, parametric over step number and pool path, reusable for any T²PO or GRPO run. If a future loop does converge on the held-out trajectory, the same script confirms the choice. If it doesn’t, the same script finds the actual peak.

Third: a fieldkit primitive that’s now ready to graduate. T²PO’s TDS regenerate path needed exactly the per-turn message reconstruction that Phase 6 GRPO does inline at grpo_train.py:reconstruct_messages(). Two consuming use cases is what fieldkit.agents.replay_messages_from_trajectory was waiting on; the next fieldkit cut promotes it from [Unreleased] to v0.3.

Closing

The Phase 6 GRPO article ended on a clean +97.5 pp claim. This one ends on a flat 47.7%. That’s not a worse result; it’s a different finding. Phase 6 was about the algorithm doing what the algorithm promises. This piece is about the loop’s metric not being the metric you should be optimizing on a single-Spark RL run. The held-out eval is what generalizes; the pool task_pass is what’s most recently been trained on. The 32-rollout sample at K=4 with 8 tasks per step is too small a window into the held-out 158 to trust as a stopping rule, and the gap is large enough — 81.8 percentage points at the run’s peak — to flip which adapter you ship.

What this one machine lets one person do is run three of these experiments a week and learn what to measure. Not what’s the best algorithm — that’s what cluster runs are for. What’s the right metric to terminate the loop on, what’s the right eval cadence, what’s the right pool size to make pool saturation actually mean something. Those questions don’t have published answers because cluster-scale runs don’t have to ask them. The Spark does. Next up: the held-out-driven schedule, with --eval-every 10 and a “best so far” adapter pointer, and the question of whether a smaller eta_threshold (say, 0.1) would convert the TDS regen overhead into actual per-assertion lift.