Open weights are not open source

5 minutes read

I’ve been quantizing Gemma 4 and it’s MTP drafters recently. The next two posts are about how that went. This one is about why it was so much work.

When a lab releases a model under an open license, what they almost always release is the weights. Sometimes a tokenizer. Occasionally a short technical report. Rarely the training data, data-processing code, training code, evaluation harness, or the recipes used to produce the published quantizations. People still call this open source, and I think that’s misleading.

If you can’t rebuild it, you can’t fork it. If you can’t fork it, the word doesn’t mean what it used to.

Calling it open is like calling an executable binary open because you can execute it. Unless you consider assembly code open source, I think it’s really obvious why this misses the mark, and is exellent marketing spin.


The people releasing models don’t seem to feel responsible for whether anyone can run them, or have good results. Its more like box ticking. Here it is?

There is no shared vocabulary for how a model should be packaged, or which inference engine it expects, or what quantization recipe was used to make the file on whatever hub they pick. The settings that actually decide whether a model is usable on a given machine — mlock, KV-cache dtype, flash attention, MoE routing, draft model pairing — live in PR descriptions and Discord screenshots. The model card lists the benchmark scores and links a HuggingFace space.

So you download the weights, get a result that’s slow or incoherent or both, and you don’t know whether the model is bad, your quantization is bad, your runtime is bad, or you’ve just configured something wrong. Most people, reasonably, conclude it’s just their hardware, or that open models don’t actually work for normal users.


The tools that have grown up to paper over this make the problem worse, not better. Ollama is the obvious example. It picks a quantization, runtime, context and batch sizes for you, and doesn’t tell you what it picked or why. When the output is bad, there’s no thread to pull. You can’t tell if you’re holding it wrong because the thing was designed so you can’t hold it at all. It’s just a blackbox!

This wouldn’t matter much if the defaults were good. They aren’t. A default Q4_0 of a small model with no imatrix calibration and a mismatched chat template will be much worse than the same model run carefully, and the user has no way of knowing the difference exists.

AND NO ONE REALLY TALKS ABOUT HOW MUCH THIS MATTERS!


Concretely: a while after Gemma 4 came out, Google released a set of Multi-Token-Prediction drafters for it, designed to make the base models faster via speculative decoding. The obvious thing to do is download the GGUF and run it under llama.cpp.

That doesn’t work, and the reasons it doesn’t work are a fair sample of the genre:

None of this is in the model card. None of it is in the upstream README. You find it by trying things and reading stack traces.

No one does this. Seriously. I mean yes. Part of it is that they just use what the people that did bother to do it made, fair. But there’s like 4 quants for this on huggingface at the time of writing, one is mine. FOUR! For the state of the art MTP/MOE models!


The hardware to run a 30B-class MoE at reading speed isn’t exotic. A five-year-old workstation with enough RAM will do it. The bottleneck isn’t the silicon, it’s everything between the weights file and a working configuration: which fork, which quant, which flags, which template, which version of which template. Ohh, and getting it all running, good luck unless you’re… literally an infrastructure engineer? Most people give up somewhere in the middle and go back to the API, and I don’t blame them.

I do think the labs releasing these models could make this much less bad than they have. Publishing the imatrix corpus would help. Publishing the quantization recipe would help. Picking one reference runtime and keeping the model working on it would help. Treating the deployment story as part of the release, rather than something the community is expected to reconstruct, would help.

Until that changes, “open source” for language models mostly means “someone else has to figure out how to run it, and none of the freedoms that used to mean open source are actually present”, and the people best positioned to figure it out are charging by the token.


I’ll write up the actual quantization steps — which flags, which traps, what the metadata leak looks like, how to scrub it — and what it took to run the result on a 2016 Xeon with no GPU, in two follow-up posts this week. Thanks for reading :)