Reading
Font size
Line height
Reader theme
Explainers
Settings save to this browser only.
Orionfold/II-Medical-8B-GGUF on Spark — five medical-reasoning variants, MedMCQA mini-eval, ChatML reasoning format
Five GGUF variants of Intelligent-Internet/II-Medical-8B (Qwen3-8B + DAPO reasoning recipe) measured on a DGX Spark. Q5_K_M lands at 36.4 tok/s, 5.45 GB, and 52% on a MedMCQA n=50 mini-eval — above F16. First reasoning recipe in the series.
Series Machine that Builds MachinesTerms in this piece3
- ChatMLThe chat-formatting convention introduced with OpenAI's ChatML spec and adopted by the Qwen family. Each turn is wrapped in <|im_start|>role / <|im_end|> markers — <|im_start|>user, <|im_start|>assistant, <|im_start|>system. Distinct from Llama-2's [INST]…[/INST], Mistral's <s>[INST], and Zephyr's <|user|>. The GGUF carries the template in its metadata so most loaders auto-detect; the trap is that older preflight harnesses key on file-name suffixes and miss it.
- DAPODirect Advantage Policy Optimization — a preference-tuning variant in the DPO family. Like DPO it learns from pairs of preferred and rejected responses without needing an explicit reward model, but reformulates the loss to track an advantage estimate. The II-Medical-8B authors report DAPO + supervised fine-tuning lifted HealthBench from baseline Qwen3-8B to a score comparable to OpenAI's o1 reasoning model on medical-specific items.
- MedMCQAA multi-choice medical-Q&A benchmark of ~194K questions sourced from Indian medical entrance exams (AIIMS, NEET-PG). Each row has a question, four options, and a single correct-option pointer (cop) across 21 medical subjects and 2,400 healthcare topics. Long-tail subject coverage makes it a stricter test of medical breadth than USMLE-derived benches.
Today on the Spark: Orionfold/II-Medical-8B-GGUF ships — five GGUF variants of Intelligent-Internet/II-Medical-8B, a Qwen3-8B base with an SFT + DAPO reasoning recipe tuned for clinical Q&A. Same four-axis card shape as the finance, legal, and cyber releases before it; same publishing surface; same lineage trail. What changes this week is that the model under the card is the first one in the series with a <think> block — and that single shift exposed a generation-budget assumption the prior three cards never had to face.
The narrative thread: after finance numeric reasoning, legal binary classification, and cyber MCQ, medical is the first vertical to ship a model that thinks before it answers. The generation budget — the default 256 token n_predict every prior preflight ran with — became a load-bearing parameter overnight. At 256 the F16 preflight scored 2/5 not because the model didn’t know the medicine, but because the <think> block burned the budget before any letter token landed. At 1024 the same model swept clean and the full quantize-plus-measure cycle ran on numbers that actually reflected capability.
This article is the publishing receipt for the medical-vertical release: the Spark-measured numbers, the new ChatML preflight branch, the variant picker for downstream use, and the honest gotchas the card inherits from being a reasoning recipe rather than a plain SFT.
Spark-tested numbers
The card under each variant on HuggingFace carries these numbers verbatim. They were produced by fieldkit.quant.measure_perplexity_gguf, llama-bench, a thermal-probe wrapper, and fieldkit.eval.VerticalBench with the cyber-vintage mcq_letter scorer over a 50-question MedMCQA subset (sampled deterministically from openlifescienceai/medmcqa’s validation split — the test split ships with masked labels and would have produced a uniformly-zero scoreboard).
| Variant | Size | Perplexity (wikitext-2) | tg tok/s | pp tok/s | MedMCQA (n=50, mcq_letter) |
|---|---|---|---|---|---|
| F16 | 15.3 GB | 16.27 | 15.94 | 2262.2 | 48% (24/50) |
| Q8_0 | 8.11 GB | 16.30 | 28.42 | 2523.3 | 48% (24/50) |
| Q6_K | 6.26 GB | 16.01 | 32.80 | 2332.2 | 46% (23/50) |
| Q5_K_M ⭐ | 5.45 GB | 16.24 | 36.36 | 2579.5 | 52% (26/50) |
| Q4_K_M | 4.68 GB | 16.55 | 43.57 | 2773.2 | 42% (21/50) |
Three observations worth narrating:
- Q5_K_M lands above F16 on both perplexity and the medical bench. Its perplexity (16.24) sits a hair under F16’s 16.27 — within wikitext-2 sampling noise, but the direction is unusual; you expect lossy quantization to push perplexity up, not down, and a sub-F16 number usually means the F16 reference was on the unlucky tail of the sample. Its MedMCQA score is 52% vs F16’s 48% — 4 percentage points, two questions out of fifty, comfortably inside the n=50 binomial noise floor (~7pp 95% CI). Either number alone would be unsurprising; together they read as a genuine sweet spot rather than a fluke.
-
The Q8_0 anomaly didn’t show up this time. Finance and legal both saw Q8_0 slower than F16 (8.9 vs 11.5 and 7.3 vs 10.9 tg tok/s — suspected at the time to be a thermal-scheduling artefact of running it last in the sweep). On cyber, Q8_0 was 30.3 vs F16’s 17.5 — 1.7× faster. On medical, the same pattern: 28.4 vs F16’s 15.9, 1.78× faster. The split now divides four verticals two-and-two — finance and legal slow, cyber and medical fast — and the cleanest hypothesis is that the slow ones were continued-pretrain-flavored models (finance-chat is a continued-pretrain SFT, Saul is a heavy domain-pretrained SFT) while the fast ones are chat-tune-only shapes (SecurityLLM is Zephyr-DPO, II-Medical is SFT+DAPO on top of base Qwen3). The thermal-scheduling explanation never fully fit; the model-shape correlation does.
-
The MedMCQA spread is tight across variants. F16 = 48, Q8 = 48, Q6 = 46, Q5 = 52, Q4 = 42. Six percentage points top-to-bottom on a 50-question bench is well inside what binomial sampling allows for a four-option MCQ at this scale. The take-home: lossy quantization did not measurably damage medical reasoning capability for this model — and the small disagreements between variants are noise the card surfaces honestly rather than smoothing away.
Variant picker
| Variant | When to reach for it |
|---|---|
| Q5_K_M | Default pick — the sweet-spot variant. 5.45 GB, 36.4 tok/s, 52% on MedMCQA (highest of the five), perplexity essentially equal to F16. The one to download first. |
| Q4_K_M | Throughput pick. 4.68 GB, 43.6 tok/s, 42% on MedMCQA. When you’re scanning a corpus and human-reviewing top hits — the 10-point bench delta vs Q5_K_M is recoverable downstream if your loop has a reviewer. |
| Q6_K | Lowest-perplexity pick. 6.26 GB, 32.8 tok/s, 46% on MedMCQA — perplexity 16.01 is the cleanest of the five against the wikitext-2 reference. Reach for it when you want minimum F16-drift on general-language work and don’t mind the throughput cost vs Q5_K_M. |
| Q8_0 | Lossless-feeling pick. 8.11 GB, 28.4 tok/s, 48% on MedMCQA — matches F16’s bench score, perplexity within 0.03. Use it when you want F16 quality at 53% the size and 1.78× the speed. |
| F16 | Reference only. 15.3 GB, 15.9 tok/s, 48% on MedMCQA. No quantization — use for measurement / baseline / debugging quant-induced regressions, not for production. |
Using this release
The card on HuggingFace ships the same three snippets every Orionfold quant card ships, derived from model_license=apache-2.0, chat_format=chatml, and recommended_variant=Q5_K_M. Reproduced here for read-through.
Pull a variant (Q5_K_M is the default pick on this card):
huggingface-cli download Orionfold/II-Medical-8B-GGUF model-Q5_K_M.gguf \
--local-dir ./models/ii-medical-8b
Serve it via llama-server (OpenAI-compatible HTTP API at http://127.0.0.1:8080/v1). The reasoning recipe means the model produces a <think> block before its answer — give it room or it will get cut off mid-thought:
llama-server -m ./models/ii-medical-8b/model-Q5_K_M.gguf \
-c 4096 -ngl 99 -t 8 \
-n 1024 \
--host 0.0.0.0 --port 8080
In-process via llama-cpp-python (note chat_format="chatml" — II-Medical-8B uses Qwen3’s ChatML template, <|im_start|> / <|im_end|>, not Llama-2’s [INST] or Zephyr’s <|user|>):
from llama_cpp import Llama
llm = Llama(
model_path="./models/ii-medical-8b/model-Q5_K_M.gguf",
n_ctx=4096, n_gpu_layers=99, chat_format="chatml",
)
out = llm.create_chat_completion(
messages=[
{"role": "user",
"content": "A 56-year-old man presents with sudden onset of severe "
"tearing chest pain radiating to the back. BP 180/100, "
"wider pulse pressure on the right than left arm.\n\n"
"Which is the most likely diagnosis?\n"
"A) Acute pericarditis\n"
"B) Aortic dissection\n"
"C) Pulmonary embolism\n"
"D) Myocardial infarction\n\n"
"Reply with only the single letter A, B, C, or D."}
],
max_tokens=1024,
temperature=0.0,
)
print(out["choices"][0]["message"]["content"])
LM Studio loads the GGUF directly and reads the ChatML template from the GGUF metadata. Ollama needs a Modelfile pointing at the GGUF plus a TEMPLATE block matching the ChatML shape; recent Ollama versions read the embedded template automatically, but verify before relying on it.
What changes between verticals — and what doesn’t
Three things changed for medical; one didn’t.
Did not change: fieldkit. This is the headline of the fourth release. fieldkit v0.4.2 shipped two weeks ago to land the publishing-surface polish (neutral default prompts, manifest recommended_variant); the medical card consumed those changes without needing any new ones. fieldkit.publish.publish_quant already accepts vertical_eval= (variant → score dict), vertical_eval_name= (the column header), chat_format= (template hint for snippet rendering), and recommended_variant= (the manifest’s Q5_K_M sticker). Swapping cybermetric for medmcqa needed zero new symbols and zero behavior changes. The PyPI package version on this release’s commit is the same 0.4.2 the cyber release shipped on.
Changed: the merge script. MedMCQA ships as four splits on HuggingFace — train (182K), validation (4.2K), test (6.1K, labels masked), and dev. The new scripts/medmcqa_merge.py samples 50 rows from validation deterministically (seed 42), formats each as a 4-option MCQ prompt with the same {id, text, answer, task} JSONL shape VerticalBench.from_jsonl(..., format="legalbench") already consumes. The script logs the letter distribution (A=17 / B=15 / C=13 / D=5 for this seed — slightly D-light, which is the population shape MedMCQA ships with at validation) and a per-subject histogram for sanity. Same downstream consumer; no fieldkit code touched. Picking validation over test matters — the test split’s cop (correct-option-pointer) is -1 on every row, which would have produced a uniformly-zero scoreboard masquerading as a benchmark failure.
Changed: the preflight prompt-format detector. The existing _detect_prompt_format in scripts/g3_preflight_bench.py recognized three families — Llama-2-chat from README phrases, Mistral-Instruct from tokenizer_config.json’s [INST] markers, and Zephyr from the <|user|> shape. ChatML’s <|im_start|> was falling through to the unwrapped-prompt fallback — silently — and the model was being preflight-scored on bare raw questions, which any reasoning recipe would mishandle. The fix is a new chatml branch in both _detect_prompt_format and _format_prompt, plus a <|im_start|> token-search added to the format-detection precedence. Five lines of detection, twelve of wrapping. The prior three cards never tripped it because none of them used ChatML; this one did, and the lessons-on-the-way pattern from the cyber card (Zephyr branch added when Zephyr arrived) repeated cleanly.
Changed: the chat-template wrapper. Same one-function add as cyber. The measurement script gained a _wrap_chatml function alongside _wrap_inst and _wrap_zephyr, and the per-vertical dispatch table got an entry: {"medmcqa": _wrap_chatml}. The card’s HF README snippet renders chat_format="chatml" because the manifest carries it, and llama-cpp-python recognizes the literal string verbatim. Three lines of wiring; no fieldkit changes.
The cleanest signal that the surface continues to generalize as designed: the medical card and the cyber card render with the same four-axis table shape, the same three run snippets, the same Methods link convention. Only the column header, the numbers, the chat_format value, and the recommended-variant pin differ.
On the reasoning-recipe generation budget
CyberMetric’s gold answer was a single letter and the model’s job was to emit it directly. MedMCQA’s gold answer is also a single letter — but the model’s path to it now includes a deliberative <think>…</think> block. The cyber generation budget could comfortably sit at 256 tokens because the entire response shape was “Answer: X” plus maybe a justification sentence. The medical generation budget can’t.
Two practical consequences for downstream use. First, the inference cost shape changes: a single MedMCQA query at Q5_K_M produces ~600 tokens at 36.4 tok/s, so wall-clock per question is ~16 seconds, not the ~2 seconds a non-reasoning 8B at the same throughput would take on a 70-token direct answer. Second, the KV-cache budget changes: a 4096-context server holds ~14 reasoning turns before eviction kicks in, not the ~58 short-answer turns the same context would hold. If you’re building a multi-turn medical assistant on this model, planning for that 8× turn-density delta is the difference between a clean session and a context-overrun cliff. The base Qwen3-8B’s native context is 40,960 tokens, so push -c higher if your workload needs longer histories — but you’ll burn unified memory proportionally, and on Spark that’s the gating constraint.
A note on the MedMCQA subset
MedMCQA ships ~194K total questions across train / validation / dev / test. Evaluating all of validation per variant would take ~14 hours per variant; for a 5-variant card on a single Spark, that’s the wrong cost shape. The 50-question subset trades fidelity for tractability while staying defensible:
- Sampled from the validation split — labels intact, not the test split where
cop=-1masks every answer. - Seed 42, so reruns reproduce. Letter distribution (A=17 / B=15 / C=13 / D=5) is slightly D-light, consistent with the validation-population shape.
- Subjects span (roughly) anatomy, pharmacology, pathology, microbiology, biochemistry, medicine, surgery, OB/GYN, pediatrics, psychiatry, and public health. The histogram is in the merge-script log; the merged JSONL doesn’t carry per-row subject tags because the bench loader doesn’t need them.
A more authoritative score would extend the subset to the full 4.2K-question validation split and run it once per release, not per variant. The 50-question card is the publishable score — comparable across releases, runnable per-variant in under 25 minutes — not the authoritative one.
Thermal envelope notes
Sustained-load minutes (probed via nvidia-smi at 10-second intervals during the bench sweep) ranged from 18.1 min (Q4_K_M) to 48.9 min (F16). The pattern is the inverse of throughput — smaller variants generate faster, get hotter faster, back off the GPU sooner — and matches the same shape observed on every prior vertical card. Q5_K_M’s 20.6-minute sustained envelope is comfortably above what a typical 50-question MedMCQA sweep takes (~22 minutes wall, of which most is generation), so no throttling event interrupts a single-bench run.
The Q8_0 anomaly is the part worth re-narrating. After three verticals the slow-Q8 pattern looked load-bearing; after four it looks model-specific. Finance and legal Q8 were slower than F16 (0.77× and 0.67× respectively); cyber and medical Q8 were dramatically faster (1.73× and 1.78×). The pattern that fits all four data points is what kind of fine-tune the upstream applied: heavy continued-pretrain SFTs (finance-chat, Saul) seem to produce Q8 weight distributions the GB10’s tensor-core path handles less efficiently than smaller-quant variants; chat-tune-only shapes (Zephyr-DPO, SFT+DAPO) don’t. One more vertical with each shape would confirm or fold the hypothesis; for now the medical card carries the measurement as-recorded.
Methods + reproducibility
The full release pipeline lives in scripts/g3_build_first_quant.sh. For II-Medical-8B, the invocation is:
HF_VENV=/tmp/fk \
MODEL_ID=Intelligent-Internet/II-Medical-8B \
LLAMA_CLI_NPREDICT=1024 \
./scripts/g3_build_first_quant.sh all
The case statement at the top of the script auto-resolves MODEL_LICENSE=apache-2.0, CHAT_FORMAT=chatml, VERTICAL_BENCH=medmcqa, and ARTICLE_SLUG=becoming-a-medical-curator-on-spark from the model ID, so no per-vertical env vars need passing manually. The two non-default env vars matter: HF_VENV=/tmp/fk overrides the skill’s canonical /tmp/fk-test path (which was stale on this Spark; the override pattern is the resilient one), and LLAMA_CLI_NPREDICT=1024 is the reasoning-budget bump the prior verticals didn’t need. The pipeline runs: preflight → download → preflight-bench (5-question MedMCQA gate against FP source weights — passed at 2/5 with the n_predict=256 default, then a clean 5/5 once n_predict was raised) → probe → quantize (5 variants) → measure (4 axes per variant) → publish-dryrun → publish.
The lineage rows for this release live at evidence/lineage-II-Medical-8B/results.tsv (one row per variant, hypotheses + measurements + bench source). The merged MedMCQA JSONL the measure step consumed lives at /home/nvidia/data/eval-benches/medmcqa/medmcqa_merged.jsonl — produced by scripts/medmcqa_merge.py from the upstream openlifescienceai/medmcqa dataset.
End-to-end wall time on the Spark was approximately 5 hours, decomposed: ~32 minutes for the source download (16 GB of safetensors over unauthenticated HF), ~18 seconds for the F16 GGUF convert (fast), ~30 seconds for the preflight bench, ~10 minutes for the 5-variant quantize, and ~2h 30min for the four-axis measurement sweep (5 variants × ~30 min per variant — perplexity + tok/s probe + thermal-overlapped 50-question MedMCQA sweep). The HF upload then ran detached via the v0.4.0 resilient pusher (hf_push_resilient.py, upload_large_folder API with num_workers=1 — the slow-upstream profile lessons from the Saul release carry forward); upload wall-clock was 2h 32min 33s for 40 GB across 5 GGUF files plus README + .gitattributes.
What this unlocks
Three concrete uses for the artifact downloaded:
- A local clinical-Q&A console behind your own retrieval layer. Wire
llama-serveron Q5_K_M behind a thin web UI, point it at a PubMed mirror or your own clinical-notes corpus, and you have a private medical-reasoning chat that never sends a query off the box. The 5.45 GB footprint leaves headroom on a 128 GB Spark to run a Retriever NIM and a pgvector store alongside, so the full RAG + reasoning loop fits without ever stepping off-device. - A reasoning-trace exporter for second-opinion workflows. The
<think>block is itself a deliverable — for a learner, for a peer reviewer, for a charting audit. Capture it alongside the answer letter, and the model becomes a documented-reasoning generator, not just a classifier. Q5_K_M’s 36.4 tok/s makes a 600-token trace land in ~17 seconds — slow enough that you’d batch it, fast enough that interactive use is comfortable. - A bench-locality probe for your own corpus. The same
g3_measure_variants.pyshape the card uses generalizes — point it at your own MCQ-shaped JSONL of in-house cases, run it across all five variants, and the curve tells you whether your domain agrees with MedMCQA on which quant to pick. The variant that wins MedMCQA may not be the one that wins your bench; the harness is now small enough that running both takes an afternoon, not a week.
Closing
Four verticals down, one machine. The publishing surface has now absorbed four different chat templates (Llama-2, Mistral, Zephyr, ChatML), four different scorers (numeric_match, contains, mcq_letter × 2), four different license tiers (Llama-2 community, MIT, Apache-2.0 × 2), and a reasoning recipe on top — without fieldkit itself shipping a single new symbol since v0.4.2. The configuration-shape thesis from the cyber release held: a fourth vertical-curator cycle was a half-day of script polish, not a refactor.
What this means for a personal AI power user on one Spark: a fourth domain-specialized 8B with calibrated four-axis numbers, downloadable as a 5.45 GB file, runnable behind llama-server in two commands, and audit-trailed end-to-end through the same lineage table the other three releases live in. The medical card is up — watch the Orionfold org page for what’s next.
Catalog page: /artifacts/quants/ii-medical-8b-gguf/ — the same four-axis card rendered on this site, with the sweet-spot variant highlighted on a heatmap row.