Quantizing Gemma 4's MTP drafters

7 minutes read

A while after Gemma 4 came out, Google released a set of Multi-Token-Prediction (MTP) drafters for it. Pairing a large model (the verifier) with a tiny MTP drafter (the assistant) enables speculative decoding: the small model guesses the next few tokens, and the big model verifies them in one parallel pass. When it works, the throughput gain is most of the way to free.

To put it shortly: MTPs are one of the rare free lunches we get with LLMs. Or well, on paper, like with free lunches, that’s the assumption.

These are notes on getting that working locally. The previous post was about why this should have been easier than it was.


If you try to convert the Gemma 4 MTP drafter weights using mainline ggml-org/llama.cpp, the converter rejects them:

Model Gemma4AssistantForCausalLM is not supported

Mainline supports the base Gemma 4 models, but it does not know what a gemma4_assistant is. The converter lacks the class, and the runtime lacks the MTP API. The runtime PR #18886 has been in draft for months. A companion converter PR #22738 was opened and closed unmerged with one of four tasks done.

The working path is a fork: ikawrakow/ik_llama.cpp, specifically the feat/gemma-4-mtp branch from PR #1744. The resulting GGUFs only load in that fork. Downstream wrappers — Ollama, LM Studio, jan.ai, llama-cpp-python — will all reject them until they backport the changes.

This is the cost of running new architectures the week they ship: you swap inference backends to match the model, not the other way around.


Drafters are tiny. The E2B drafter’s MTP head is 78M parameters; the 31B’s drafter is roughly 470M; the 26B-A4B’s is about 408M.

Their job is to predict tokens accurately enough that the verifier accepts them, so acceptance rate dominates everything else. Aggressive quantization causes the verifier to reject more drafts, and the pipeline becomes slower than running the verifier alone.

The quantization policy for drafters is the inverse of what you’d do for a large base model:

imatrix calibration is not available for these drafters at the moment. PR #1744 builds the gemma4_mtp drafter graph with a hardcoded GGML_ASSERT(has_target_ctx), so standalone llama-imatrix runs abort in llama_decode before producing anything. The IK fork’s signature quants (IQ4_KS, IQ4_KSS, IQ5_KS) derive their precision-per-bit advantage from imatrix-guided scale selection, so without an imatrix they collapse to roughly K-quant quality at the same bit budget. Shipping them under those labels would be misleading.

Quantizing directly from bf16/f16 to Q4_K_M/Q8_0 is the realistic option until the tooling catches up. At ≥4 bits the precision gap from missing the imatrix is small — single-digit percent on most benchmarks. At Q3 and below it would matter, which is the other reason Q3 isn’t a sensible target here.

One empirical surprise: on a 4060 Laptop GPU, pairing unsloth’s Gemma-4-E4B-it-Q4_K_M.gguf verifier against both a Q8_0 and a Q4_K_M drafter at --draft-max 3, the Q4_K_M drafter beat the Q8_0 by ~13% in total throughput. The smaller drafter’s faster draft step outweighed the acceptance-rate cost. The “always pick the highest-bit drafter” rule assumes you are memory-bandwidth-rich and compute-poor; on a laptop GPU you are neither.

Benchmark your own silicon.


llama-quantize silently embeds the absolute on-disk paths of your imatrix file and calibration corpus into the GGUF metadata. There is no flag to disable this at quantize time. The *-f16.gguf intermediate is clean; only quantized files have the leak.

Inspect any quantized GGUF and the two offending keys are sitting there:

nix-shell -p python3Packages.gguf --run \
  'gguf-dump --no-tensors model.gguf'
KeyWhat it leaks
quantize.imatrix.fileAbsolute path to the imatrix file on the build host
quantize.imatrix.datasetAbsolute path (and filename) of the calibration corpus

Uploading the raw output of a quantization pipeline publishes your infrastructure layout — usernames, ZFS pool names, directory structure — alongside the weights.

The fix is gguf_new_metadata from the Python gguf package, which rewrites the file with selected keys removed without touching the tensors:

nix-shell -p python3Packages.gguf --run '
  for f in /srv/models/<MODEL>-GGUF/*.gguf; do
    [ "$f" = "*-f16.gguf" ] && continue
    python3 -m gguf.scripts.gguf_new_metadata \
      --remove-metadata quantize.imatrix.file \
      --remove-metadata quantize.imatrix.dataset \
      --force \
      "$f" "$f.clean"
  done
'

Each rewrite runs at ~5–15s per GiB and shrinks the file by exactly the size of the two strings removed (≈190 bytes).

Verify before uploading:

gguf-dump --no-tensors model-clean.gguf | grep -v INFO

kv_count should be exactly 2 less than the original. The imatrix.*_count integer fields stay (those carry no path information). A smoke test confirms the rewritten file still loads:

llama-completion -m model-clean.gguf --jinja \
  -p 'The capital of France is' -n 20 --temp 0

Coherent output means the rewrite didn’t corrupt anything.


Sanitized quants in hand, the next step is pairing. A drafter is not a standalone language model: it has to be paired with a verifier that shares its exact vocabulary. For Gemma 4 that vocabulary is 262,144 tokens.

The size of that vocabulary is also the reason the MTP speedup applies to structured output. The drafter sees the same <|tool_call|>, <|tool_response|>, and <|channel|> tokens as the verifier, so tool-calling agents and structured-generation workflows benefit on the full response, not just the natural-language parts. The vocabulary itself isn’t quantized — it lives in a separate GGUF section.

A working ik_llama.cpp invocation:

llama-server \
    --model gemma-4-31B-it-Q8_0.gguf \
    --model-draft gemma-4-31B-it-assistant-Q8_0.gguf \
    --spec-type mtp \
    --draft-max 3 \
    --draft-p-min 0.0 \
    -ngld 99 \
    --n-gpu-layers 99 \
    --ctx-size 32768 \
    -ctk q8_0 -ctv q8_0 \
    -b 1024 -ub 1024 \
    --jinja

The relevant levers:

PR #1744’s reference numbers, 31B verifier + 31B drafter at Q8_0/Q8_0 on a data-center GPU:

RunThroughputAcceptance
Baseline (no MTP)~21 t/s
MTP --draft-max 1~35 t/s~89%
MTP --draft-max 2~44 t/s~83%
MTP --draft-max 3~49 t/s~74%
MTP --draft-max 4~49 t/s~64%

A bit more than 2× throughput, after all of the above.


The throughput gains are real, and on the right hardware they are substantial. The work it takes to surface them is also real: undocumented CLI divergences, an obscure fork, manual metadata sanitization, quantization rules that invert the usual advice. None of this is in the model card or the upstream README. The path to a working setup is reconstructed from PR threads and stack traces.

The next post in this series takes the result and runs it on a 2016 Xeon with no GPU, which is where the command line stops being a paragraph.