Tag

#pretrain

Articles tagged "pretrain" — 4 entries.

Article №24 training Foundation ~30 minute read · math + economics, no GPU required
Looking Beyond Spark

Derisking the Cloud Pretrain — How a $5K Spark Saves $50K on H100 Rentals

The Spark is too small for a serious pretrain — but it's the right size for the recipe-search that precedes one. Cull 100 candidate architectures down to 3 on one Spark for ~$1 of electricity, then book the cloud node knowing what to train. The expected savings per campaign run into the thousands.

Article №23 foundations Foundation ~15 minute read · no GPU required
Looking Beyond Spark

What the Agent Actually Built — Five Articles in Plain English, and Why You Probably Don't Want to Train From Scratch

Five technical articles in one day built an unattended AI research loop on a desk for $0.02 of electricity. The plain-English readout: what the agent built (not a usable model), what it changes for one person, and a four-tier roadmap from LoRA in minutes to from-scratch in weeks.

Article №20 training NeMo ~2 hours — 5 min for the corpus pull, 45 min for a derived container build, 2 min for the Curator pipeline + 40s tokenize, 3 min for the 8-config sweep, the rest is reading the numbers
Machine that Builds Machines

The Data-Path Envelope — When Real Tokens Beat Random Tokens at Pretrain Throughput

Curator-cleaned wikitext-103 (109M tokens, 417 MiB packed) feeding the same 354M GPT pretrain loop from A2. Eight configs swept; data-path overhead is 0.01–0.04% across all of them. New peak: 14,980 tok/s — slightly above A2's random-token ceiling.

Article №18 training NeMo ~3 hours — 90 min for two container pulls (PyTorch 30 GB, NeMo Framework Megatron Backend 70 GB), 30 min for the matched scripts, 10 min for the two pretrain runs and analysis
Machine that Builds Machines

NeMo Framework on the Spark — What It Earns Over a Hand-Rolled train.py

Same 354M GPT, same 100 steps, same random tokens — once in a hand-rolled train.py against vanilla PyTorch, once via Megatron-Core inside the NeMo Framework container. Same hardware (GB10, 128 GB unified). The framework earns +5.8% throughput and 30% less GPU memory.