← AI Native Field Notes
Reading
Font size
Line height
Reader theme
Explainers
Settings save to this browser only.
Claw-Eval-Live on Spark — Spark reproduction notes
Stand up Claw-Eval-Live sandboxed-workflow protocol on Spark via NemoClaw + OpenShell, mock the business-service backends, run Llama 8B vs Nemotron 49B with deterministic-trace + LLM-judge grading, and chart where local agents land vs the paper 66.7 percent ceiling.
Series Machine that Builds MachinesSource paper
- arXiv: 2604.28139 — Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
- Project page: claw-eval-live.github.io (no GitHub repo discoverable at promotion time)
- Popularity: 25/100 · 22 HF upvotes · 0 citations
Frontier Scout verdict
spark-feasible — 8B agent + lightweight service mocks + sequential sandbox runs sit comfortably below 50 GB, and NemoClaw + OpenShell are exactly the verified primitives this benchmark needs (nemoclaw-vs-openclaw-dgx-spark, autoresearch-agent-loop); the active blocker is the unreleased dataset, not the hardware envelope.
Proposed Spark recipe
- Wait or proxy — if the 105-task release isn’t out, hand-author 5 representative tasks per family (HR, multi-system business, local workspace repair) using the paper’s task structure as a template.
- Stand up the sandbox via NemoClaw — each task gets a fresh OpenShell container with the workspace pre-populated from a fixture tarball. Use the
cat | openshell sandbox execworkaround (sinceopenshell sandbox uploadis broken on v0.0.26). - Mock the business services as Flask/FastAPI processes inside the same network namespace — HR API, ticketing API, file-workspace state. Audit-log every request to a JSONL.
- Serve the agent under test via NIM. Run two side-by-side:
llama-3.1-8b-instructandnemotron-super-49b(or 70B fp8 if the box has been freshly booted). Tool-call against mocked services + sandbox shell. - Build the grader: deterministic checks come from audit log + workspace diff (file-state checksums, service-state asserts). Semantic checks via Llama 8B as judge.
- Score and compare: per-task-family pass rates, mirror the paper’s “leaderboard rank vs overall completion” finding on the smaller scale, and call out whether local-first models exhibit the same HR / multi-system bottleneck pattern.
Full recipe with stack-map references in evidence/spark-recipe.md.
Open questions for the experiment
- No repo, no dataset URL as of eval time. Article either holds for the release or proxies a hand-authored subset (honest if framed as protocol replication, not benchmark reproduction).
- Service mocking is real engineering — risk of becoming “how I built mocks” rather than “what the agent did.” Use the simplest possible mocks (3 endpoints each, single-table SQLite state).
- The 13 models the paper evaluates aren’t named in the abstract; local leaderboard will be Spark-stack subset rather than directly comparable.
- Judge contamination risk: agent + judge from same family biases the score. Use different families (Llama agent / Nemotron judge) or rely on deterministic-only checks for the headline number.
Suggested article shape
- Stage: observability
- Series: Autoresearch
- Tags: agentic, benchmark, sandboxing, evals, llm-as-judge, audit-log, nemoclaw, openclaw
- Voice: essay on why “did the agent do the thing” is harder to grade than “did the agent say the right thing” — and what verifiable execution traces buy you when the leaderboard model still tops out at 66.7%.