Google Introduces DiffusionGemma, Delivering Up To 4x Faster Text Generation

Google unveiled DiffusionGemma, an experimental open-source model designed to dramatically accelerate text generation through diffusion-based techniques, offering up to four times faster inference than traditional autoregressive large language models.

Released under the Apache 2.0 license, DiffusionGemma is a 26-billion-parameter Mixture of Experts model that activates only 3.8 billion parameters during inference. Built on Google’s Gemma 4 architecture and Gemini Diffusion research, the model generates entire blocks of text simultaneously rather than producing tokens sequentially.

According to Google, DiffusionGemma can generate more than 1,000 tokens per second on a single NVIDIA H100 GPU and more than 700 tokens per second on NVIDIA GeForce RTX 5090 hardware.

Unlike conventional language models that create text one token at a time, DiffusionGemma generates 256 tokens in parallel and uses bi-directional attention, enabling every token to attend to all others. This approach provides advantages for applications such as code infilling, in-line editing, mathematical graphs and amino acid sequence generation.

The model also incorporates iterative self-correction capabilities, allowing it to evaluate and refine entire blocks of text during generation.

Google emphasized that DiffusionGemma is intended primarily for researchers and developers exploring speed-sensitive interactive workflows rather than production environments. The company noted that standard Gemma 4 models continue to provide higher output quality for applications where accuracy is paramount.

DiffusionGemma shifts the decoding bottleneck from memory bandwidth to compute, maximizing GPU utilization for local inference. Google said the performance benefits are strongest for low- and medium-batch workloads on a single accelerator, while large-scale cloud deployments may continue to favor autoregressive architectures.

The company is releasing model weights through Hugging Face and providing support across tools including MLX, Hugging Face Transformers and vLLM. Additional support is being developed for llama.cpp, while fine-tuning capabilities are available through Hackable Diffusion, Unsloth and NVIDIA NeMo.

Google also collaborated with NVIDIA to optimize DiffusionGemma for enterprise and consumer hardware, including GeForce RTX 5090 and RTX 4090 GPUs, Hopper and Blackwell systems, and NVIDIA’s DGX platforms.

The model represents Google’s latest effort to advance open AI research and explore alternative architectures capable of enabling real-time, interactive AI experiences.

KEY QUOTES:

“DiffusionGemma is designed for researchers and developers exploring speed-critical, interactive local workflows such as in-line editing, rapid iteration, and generating non-linear text structures.”

Brendan O’Donoghue, Research Scientist, Google

“DiffusionGemma delivers up to 4x faster text generation on GPUs by moving beyond the sequential token-by-token processing of typical autoregressive large language models and generating entire blocks of text simultaneously.”

Sebastian Flennerhag, Research Scientist, Google

“While autoregressive Gemma 4 models remain the standard for high-quality production outputs, DiffusionGemma opens the door to exploring new workflows that prioritize speed, parallel layout generation and interactive local inference.”

Google Research Team

Google Introduces DiffusionGemma, Delivering Up To 4x Faster Text Generation

Consumer Tech