Tag

#training

Articles tagged "training" — 7 entries.

Article №24 training Foundation ~30 minute read · math + economics, no GPU required
Looking Beyond Spark

Derisking the Cloud Pretrain — How a $5K Spark Saves $50K on H100 Rentals

The Spark is too small for a serious pretrain — but it's the right size for the recipe-search that precedes one. Cull 100 candidate architectures down to 3 on one Spark for ~$1 of electricity, then book the cloud node knowing what to train. The expected savings per campaign run into the thousands.

Article №23 foundations Foundation ~15 minute read · no GPU required
Looking Beyond Spark

What the Agent Actually Built — Five Articles in Plain English, and Why You Probably Don't Want to Train From Scratch

Five technical articles in one day built an unattended AI research loop on a desk for $0.02 of electricity. The plain-English readout: what the agent built (not a usable model), what it changes for one person, and a four-tier roadmap from LoRA in minutes to from-scratch in weeks.

Article №22 agentic NeMo ~3 hours — 90 min to scaffold the loop, 73 min for the unattended run, the rest is reading the trajectory
Machine that Builds Machines

The Autoresearch Loop — 50 Iterations of an LLM Editing Its Own Trainer Overnight

NIM Llama 3.1 8B drives a structured-perturbation agent loop against a 354M GPT pretrain. 50 iterations, 73.4 min wall, 0.07 kWh of electricity. 8 keeps, 42 reverts, 0 rail blocks, 0 crashes. Best result: val_bpb 10.8534, +0.93% over baseline at d_model=768.

Article №20 training NeMo ~2 hours — 5 min for the corpus pull, 45 min for a derived container build, 2 min for the Curator pipeline + 40s tokenize, 3 min for the 8-config sweep, the rest is reading the numbers
Machine that Builds Machines

The Data-Path Envelope — When Real Tokens Beat Random Tokens at Pretrain Throughput

Curator-cleaned wikitext-103 (109M tokens, 417 MiB packed) feeding the same 354M GPT pretrain loop from A2. Eight configs swept; data-path overhead is 0.01–0.04% across all of them. New peak: 14,980 tok/s — slightly above A2's random-token ceiling.

Article №19 training NeMo ~30 min once the NeMo container is on disk — 7.4 min wall for the 16-config sweep, the rest is reading the numbers
Machine that Builds Machines

The GB10 Pretrain Envelope — Sweeping Batch, Sequence, and Precision on One Spark

Same 354M GPT, same training loop, swept across micro-batch (2,4,8,16), sequence length (1024,2048), and precision (bf16,fp8). 16 configurations, 30 steps each. Peak: 14,266 tokens/sec at batch=16, seq=1024, fp8 — 18% above the hand-rolled PyTorch baseline.

Article №18 training NeMo ~3 hours — 90 min for two container pulls (PyTorch 30 GB, NeMo Framework Megatron Backend 70 GB), 30 min for the matched scripts, 10 min for the two pretrain runs and analysis
Machine that Builds Machines

NeMo Framework on the Spark — What It Earns Over a Hand-Rolled train.py

Same 354M GPT, same 100 steps, same random tokens — once in a hand-rolled train.py against vanilla PyTorch, once via Megatron-Core inside the NeMo Framework container. Same hardware (GB10, 128 GB unified). The framework earns +5.8% throughput and 30% less GPU memory.

Upcoming training NeMo Framework + Llama 3.1 8B planned ~2 days of wall-clock, one long weekend
Machine that Builds Machines

Continued Pre-training on a DGX Spark — NeMo Framework Without a Cluster

When does it make sense to continue pre-training on a single GB10 box, and when is it a category error? A planned run that pushes NeMo Framework, Megatron-LM parallelism, and BF16 mixed precision against the 128 GB unified-memory wall with a small domain corpus.