Tag
#data-prep
Articles tagged "data-prep" — 2 entries.
Machine that Builds Machines
The Trainer Was Fine, the Corpus Wasn't: Three Misdiagnoses on a Patent-Specialist Fine-Tune
Five thousand rows of synthetic patent reasoning, two clean 131-minute LoRA trains, three rounds of confident diagnosis — and none of them found the bug. The bug was the corpus all along. A field report on the cheapest mistake to make on the Spark.
Machine that Builds Machines
The Data-Path Envelope — When Real Tokens Beat Random Tokens at Pretrain Throughput
Curator-cleaned wikitext-103 (109M tokens, 417 MiB packed) feeding the same 354M GPT pretrain loop from A2. Eight configs swept; data-path overhead is 0.01–0.04% across all of them. New peak: 14,980 tok/s — slightly above A2's random-token ceiling.