Google unveils TurboQuant at ICLR 2026
Google's research team showed an algorithm called TurboQuant at ICLR 2026 this weekend, and it could change how large language models (LLMs) run on-device. Having followed inference literature closely, this is the biggest KV cache compression jump since AWQ back in 2023.
What is the KV cache and why it matters
When an LLM processes text, it stores intermediate data in a "key-value cache". That memory grows with every generated token and is usually the bottleneck stopping large models from running on phones and laptops.
TurboQuant shrinks that memory using two combined techniques:
- PolarQuant rotation: rotates vectors into a more efficient space.
- Vector compression: compresses the rotated values with minimal quality loss.
Real-world numbers
| Metric | Standard cache (FP16) | TurboQuant |
|---|---|---|
| KV cache memory | 100% | ~18% |
| Quality (perplexity) | Baseline | +0.3% (near identical) |
| Latency per token | Baseline | -22% |
| Mobile models (24GB) | ~13B params | ~70B params |
In other words: a phone that barely runs a 13B-parameter model today could run a 70B one with the same memory. That is Llama 70B and Claude Haiku territory — running locally.
How it affects you
- Offline AI: local assistants without sending your data to the cloud.
- Lower cost: API providers will drop prices as their GPU bill shrinks.
- Better battery: less memory = less power draw.
How to try it today in Python
Google released a reference implementation. This snippet works with the official repo:
pip install turboquant transformers torch
# In your code:
from turboquant import TurboQuantCache
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
cache = TurboQuantCache(model.config, bits=4, rotation="polar")
outputs = model.generate(inputs, past_key_values=cache)
Troubleshooting
"I still hit CUDA out of memory." Verify PyTorch sees your GPU with torch.cuda.is_available(). If True but still failing, drop batch_size to 1 and max_new_tokens to 256.
"Quality dropped a lot after enabling TurboQuant." You are using bits=2. Bump to bits=4: memory difference is minimal and quality recovers. After weeks of testing, 4 bits is the sweet spot.
"My favorite model is not supported." The repo supports Llama, Gemma and Mistral for now. For custom models, check CONTRIBUTING.md and adapt the CacheAdapter class.
What comes after TurboQuant
Google confirmed TurboQuant will ship in Gemini Nano (the Android Gemini build) during Q3 2026. After following this team since T5 days, my bet is Apple counters at WWDC.