Heterogeneous Scientific Foundation Model Collaboration — Spark reproduction notes

Wrap a domain foundation model (Pangu-Weather) as a Triton tool, drive it from a NIM-served Llama 3.1 8B planner via NemoClaw, and show when specialist routing beats language-only reasoning — all inside the Spark 128 GB envelope.

Series Machine that Builds Machines

Upcoming agentic NemoClaw planned · 02 May 2026 ~30 min read intermediate NVIDIA DGX Spark Manav Sehgal

Source paper

arXiv: 2604.27351 — Heterogeneous Scientific Foundation Model Collaboration
Repo: (none at promotion time — paper introduces “Eywa” framework but no code released)
Popularity: 41/100 · 181 HF upvotes · 0 citations

Frontier Scout verdict

spark-feasible — every component fits comfortably in the 128 GB envelope (8B planner + ≤3 specialists + Guardrails ≤ 50 GB), and the agent shell maps directly onto NemoClaw’s existing tool-routing protocol; the only friction is reconstructing the framework from the paper’s prose since code isn’t out yet.

Proposed Spark recipe

Pick a concrete domain for the first walkthrough — weather forecasting via Pangu-Weather (~3B) is the cleanest demo because input/output is tensor-on-disk rather than tokens.
Serve the planner LLM via NIM: llama-3.1-8b-instruct container exposes the OpenAI-compatible chat endpoint already documented in nim-first-inference-dgx-spark. NIM_GPU_MEM_FRACTION=0.4 to reserve room for the specialist.
Serve the specialist FM via Triton with a Python backend wrapping the model’s native inference (trtllm-and-triton-on-spark shows the pattern). Expose it as predict_weather(state_tensor) -> forecast_tensor.
Build the EywaAgent loop inside NemoClaw as a custom skill: planner receives a question, decides whether to call the specialist, marshals the structured inputs, gets back a tensor, and renders a natural-language summary.
Add NeMo Guardrails at the planner boundary so off-domain prompts route to a refusal rather than calling the specialist with garbage inputs.
Measure two things: planner-only vs Eywa accuracy on a held-out domain question set (paper claims improvement on structured-data tasks), and end-to-end latency budget.

Full recipe with stack-map references in evidence/spark-recipe.md.

Open questions for the experiment

No public repo — Eywa framework must be reimplemented from the paper’s prose. ~200–400 lines of Python expected.
Specific scientific FMs the paper evaluates aren’t enumerated — pick Pangu-Weather (or ESM-2 for proteins) as the canonical specialist and call out the deviation.
No accuracy numbers in the abstract to anchor the “is orchestration helping” claim until the full paper or a code release arrives.

Source paper

Frontier Scout verdict

Proposed Spark recipe

Open questions for the experiment

Suggested article shape