PAATRA

What if small models are wasting most of their parameters?

The Problem

Take a 270-million-parameter language model — small enough to run on a phone. Now look at where those parameters actually go. In Gemma 3 270M, 170 million of the 270 million parameters are a vocabulary embedding table: a lookup structure that maps 256,000 tokens to vectors. That's 63% of the model dedicated to knowing how to represent language. The remaining 100 million parameters do everything else — reasoning, knowledge, task performance. The model is spending more capacity on its dictionary than on its brain.

This isn't a quirk of one model. Llama 3.2 1B dedicates roughly 21% of its parameters to embeddings. Gemma 2 2B sits around 29%. The pattern is consistent: the smaller the model, the worse the ratio gets. And the reason is always the same. These small models inherit their vocabulary from much larger siblings — models 10x or 100x their size — that were designed for broad multilingual coverage across every domain. The vocabulary was built for a different job.

Google noticed this. The Gemma 2 technical report explicitly acknowledges that the large embedding counts come from inheriting the Gemini vocabulary. They noticed. They didn't fix it. And the scaling laws the field relies on — Chinchilla and its successors — don't address it either. Chinchilla tells you the optimal total number of parameters given your compute budget. It says nothing about how to split those parameters between vocabulary and reasoning. It treats all parameters as fungible. They aren't. If you're deploying a model to tutor grade 5 students in English and math, you don't need 256,000 tokens. You need maybe 10,000. The rest is a guest room nobody uses, in an apartment that's already too small.

What We're Exploring

We think vocabulary size and reasoning capacity inside a small model are competing for the same fixed budget — and the field has been getting the allocation wrong.

The argument is orthogonal to existing scaling laws. Chinchilla optimizes total parameter count given compute. We're investigating the split within a fixed parameter budget: how many parameters should go to vocabulary, and how many to the transformer layers that actually reason? For a bounded-complexity task — say, grade 5 math or reading comprehension — there should be a point where vocabulary saturates. Adding more tokens doesn't help. Every parameter above that saturation point is capacity that could have gone to reasoning instead.

The picture we're working toward: map the vocabulary-capacity curve across domains and model sizes. Find the inflection points. Show that a model with a right-sized vocabulary and the freed parameters reinvested in reasoning depth can match a much larger model on tasks within its domain.

Getting there raises questions we find genuinely open:

Saturation boundaries. Where exactly does vocabulary stop helping for a given domain complexity? Existing evidence suggests linguistic capacity saturates early, but nobody has mapped this against downstream task performance. If vocabulary overhead doesn't actually hurt — if models learn to route around it — the whole argument weakens.

Scaling at the bottom. Does reasoning capacity scale linearly with parameters at very small model sizes? There may be a minimum viable depth below which a model simply can't learn certain patterns, regardless of how many parameters you give it. The sweet spot might not be where the theory predicts.

Generality. If vocabulary saturation points are consistent across model families — if the same curves hold whether you start from Llama, Gemma, or Qwen — then this is a general principle, not a model-specific trick.

We're testing with a sub-100M student model distilled from a 3B teacher, evaluated against grade 5 benchmarks in English, math, and science. The goal is parity with the teacher on bounded-complexity tasks — not because the small model is smarter, but because the large model's vocabulary is wasted capacity at that level.

Status

Active Research