Google announced the release of Multi-Token Prediction drafters for the Gemma 4 family of open AI models, introducing speculative decoding capabilities designed to significantly improve inference speed and responsiveness for developers.
The company said the new MTP drafters can deliver up to a three-times speed increase without degrading output quality or reasoning performance.
Google explained that traditional large language model inference remains constrained by memory bandwidth limitations, where processors spend substantial time transferring model parameters rather than generating outputs. Multi-Token Prediction addresses this bottleneck through speculative decoding, allowing lightweight draft models to predict multiple future tokens simultaneously while larger target models verify the generated outputs in parallel.
The company said the architecture improves efficiency by enabling applications to generate multiple verified tokens in approximately the same time previously required for generating a single token.
Google highlighted multiple use cases for the faster inference capabilities, including coding assistants, autonomous AI agents, voice applications, edge-device AI workloads, and offline AI systems running on consumer hardware.
According to Google, pairing Gemma 4 models with their corresponding MTP drafters enables lower latency, faster local development workflows, improved on-device performance, and reduced battery consumption while preserving output accuracy.
The company also noted that the draft models share activations and KV cache resources with the target models to reduce redundant computation and improve hardware efficiency. Additional optimizations were implemented for edge models and Apple Silicon environments.
Google said the Gemma 4 MTP drafters are available immediately under the same Apache 2.0 open-source license as Gemma 4 and can be used across platforms including Hugging Face, Kaggle, MLX, transformers, vLLM, SGLang, Ollama, and Google AI Edge Gallery.
Google previously announced Gemma 4 as its most capable open AI model family, which the company said surpassed 60 million downloads within the first several weeks following release.
KEY QUOTES:
“By using Multi-Token Prediction (MTP) drafters, Gemma 4 models reduce latency bottlenecks and achieve improved responsiveness for developers.”
Olivier Lacombe, Google