← Quantizations
Quant · GGUF · 5 variants

ii-medical-8b-gguf

Quantization of Intelligent-Internet/II-Medical-8B .

HF Orionfold/II-Medical-8B-GGUF License free apache-2.0 Published

What this model does

Intelligent-Internet's II-Medical-8B is a Qwen3-8B base carrying an SFT + DAPO reasoning recipe — it walks a differential inside a think block before it answers, which is what a clinical-Q&A console wants and what a 15.3 GB checkpoint can't do on a small consumer GPU. This release ships five GGUF variants (Q4_K_M at 4.68 GB and 43.6 tok/s up to F16) so the reasoning loop runs offline, each with a four-axis Spark-measured card: wikitext-2 perplexity, sustained tok/s, thermal-envelope minutes, and a MedMCQA score. Orionfold's contribution is the distribution + measurement layer; Intelligent-Internet did the reasoning fine-tune.

Use cases

  • A local clinical-Q&A console behind your own retrieval layer, fully offline
  • Medical-reasoning experiments where the visible think-chain is the point
  • Picking a quant variant by workload shape, not just RAM budget

Audience — Local-LLM power users and clinical-informatics builders who want an offline medical-reasoning model on a consumer GPU with the reasoning trace visible — not a hosted API and not a medical device.

Spec matrix

Ranks within each column drive the heatmap. Lower perplexity, higher throughput, higher vertical eval — the sweet-spot row balances all three.

Vertical bench: MedMCQA (n=50, mcq_letter)
Variant Perplexity Spark tok/s Vertical eval
Q4_K_M 16.5500 43.57 0.42
Q5_K_M Sweet spot 16.2418 36.36 0.52
Q6_K 16.0139 32.80 0.46
Q8_0 16.2957 28.42 0.48
F16 16.2676 15.94 0.48

Methods

Read the field note Orionfold/II-Medical-8B-GGUF on Spark — five medical-reasoning variants, MedMCQA mini-eval, ChatML reasoning format Five GGUF variants of Intelligent-Internet/II-Medical-8B (Qwen3-8B + DAPO reasoning recipe) measured on a DGX Spark. Q5_K_M lands at 36.4 tok/s, 5.45 GB, and 52% on a MedMCQA n=50 mini-eval — above F16. First reasoning recipe in the series. Open article

Known drift

Disclosed limitations with explicit bounds — the scope is named, not implied.

Reasoning models need a generous n_predict (≥1024)
A clinical-MCQ reasoning trace runs 400–800 tokens before the closing think tag, and the answer is 1 token after it. At n_predict=256 the budget runs out mid-differential and the answer never lands — set n_predict to 1024 or more. A measurement gotcha, not a model defect.
MedMCQA accuracy ceiling (8B, n=50 mini-eval)
MedMCQA (n=50, mcq_letter) lands 42–52% across the five variants, peaking at Q5_K_M (26/50) — an 8B reasoning ceiling on a 50-question mini-eval, not a quantization failure. Indicative, not a clinical validation.
Not medical advice
An 8B reasoning model inherited from the upstream base — for study, retrieval-grounded drafting, and triage UX, not diagnosis or treatment decisions. No clinical-grade validation is claimed.