Tag

#benchmarks

Articles tagged "benchmarks" — 2 entries.

Article №31 inference Foundation ~3 hours of measurement · ~one line of patch
Frontier Scout

Pass@k After the Seventh Patch — Three Shapes ESamp Takes on Spark

Patches were six. The Pass@k harness surfaced a seventh — a one-line slice in the residual tap that only fires once batches shrink mid-run. Once cleared, ESamp takes three shapes: flat on saturated cells, lifting both rates on instruct headroom, and +6.67pp pass@8 on the unsaturated reasoning cell.

uses fieldkit.evalfieldkit.capabilities

Article №30 inference Foundation ~2 hours of patching · ~30 minutes of measuring
Frontier Scout

Two Patches Were Six — ESamp Lands at 97.4% on a Patched Spark

Article #2 closed at two patches. Applying them surfaced six — including the silent return-shape adapter that broke the consumer's port. Once cleared, ESamp lands at 97.4% of baseline on patched Qwen 2.5 7B, within 1.4 pp of the paper's reference.

uses fieldkit.evalfieldkit.capabilities