TurboQuant: Google's Algorithm Bringing Big AI to Your Phone
Tech News

TurboQuant: Google's Algorithm Bringing Big AI to Your Phone

6 min read
74 Views
Share:

Google unveils TurboQuant at ICLR 2026

Google's research team showed an algorithm called TurboQuant at ICLR 2026 this weekend, and it could change how large language models (LLMs) run on-device. Having followed inference literature closely, this is the biggest KV cache compression jump since AWQ back in 2023.

What is the KV cache and why it matters

When an LLM processes text, it stores intermediate data in a "key-value cache". That memory grows with every generated token and is usually the bottleneck stopping large models from running on phones and laptops.

TurboQuant shrinks that memory using two combined techniques:

  • PolarQuant rotation: rotates vectors into a more efficient space.
  • Vector compression: compresses the rotated values with minimal quality loss.

Real-world numbers

MetricStandard cache (FP16)TurboQuant
KV cache memory100%~18%
Quality (perplexity)Baseline+0.3% (near identical)
Latency per tokenBaseline-22%
Mobile models (24GB)~13B params~70B params

In other words: a phone that barely runs a 13B-parameter model today could run a 70B one with the same memory. That is Llama 70B and Claude Haiku territory — running locally.

How it affects you

  • Offline AI: local assistants without sending your data to the cloud.
  • Lower cost: API providers will drop prices as their GPU bill shrinks.
  • Better battery: less memory = less power draw.

How to try it today in Python

Google released a reference implementation. This snippet works with the official repo:

pip install turboquant transformers torch

# In your code:
from turboquant import TurboQuantCache
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
cache = TurboQuantCache(model.config, bits=4, rotation="polar")
outputs = model.generate(inputs, past_key_values=cache)

Troubleshooting

"I still hit CUDA out of memory." Verify PyTorch sees your GPU with torch.cuda.is_available(). If True but still failing, drop batch_size to 1 and max_new_tokens to 256.

"Quality dropped a lot after enabling TurboQuant." You are using bits=2. Bump to bits=4: memory difference is minimal and quality recovers. After weeks of testing, 4 bits is the sweet spot.

"My favorite model is not supported." The repo supports Llama, Gemma and Mistral for now. For custom models, check CONTRIBUTING.md and adapt the CacheAdapter class.

What comes after TurboQuant

Google confirmed TurboQuant will ship in Gemini Nano (the Android Gemini build) during Q3 2026. After following this team since T5 days, my bet is Apple counters at WWDC.

Additional resources

J
Written by
Jesús García

Apasionado por la tecnologia y las finanzas personales. Escribo sobre innovacion, inteligencia artificial, inversiones y estrategias para mejorar tu economia. Mi objetivo es hacer que temas complejos sean accesibles para todos.

Share post:

Related posts

Comments

Leave a comment

Recommended Tools

The ones we use in our projects

Affiliate links. No extra cost to you.

Need technology services?

We offer comprehensive web development, mobile apps, consulting, and more.

Web Development Mobile Apps Consulting