Google: Gemma 4 QAT Models Reduce Memory Requirements For Mobile And Laptop AI

Google DeepMind announced new Quantization-Aware Training (QAT) versions of its Gemma 4 family, enabling developers to run powerful AI models more efficiently on mobile devices, laptops, and consumer GPUs while significantly reducing memory requirements.

The announcement comes two months after the launch of Gemma 4 and follows a series of updates that expanded the model family’s capabilities, including the introduction of Multi-Token Prediction (MTP) for faster inference and the release of a 12B model designed to bridge the gap between the E4B and 26B Mixture-of-Experts models.

According to Google DeepMind, the new QAT checkpoints are designed to make Gemma 4 more practical for local deployment on edge devices. By simulating quantization during training, QAT minimizes the quality degradation that can occur when models are compressed after training.

The release includes optimized checkpoints for the widely used Q4_0 quantization format as well as a new mobile-focused quantization format. Using this specialized mobile format, Google DeepMind reduced the memory footprint of the Gemma 4 E2B model to approximately 1 GB, enabling the deployment of advanced AI capabilities on a broader range of devices.

Quantization is a widely used technique that lowers memory requirements and can improve inference speed by reducing the precision of model weights. Traditional Post-Training Quantization (PTQ) applies compression after model training is complete, which can lead to some performance loss. In contrast, QAT incorporates quantization directly into the training process, allowing the model to better adapt to lower-precision representations and preserve quality.

Google DeepMind applied its QAT approach to the Q4_0 format across the Gemma 4 lineup. For the E2B and E4B edge models, the company developed a mobile-optimized quantization architecture to improve efficiency on smartphones and other edge devices.

Several key optimizations were introduced to improve mobile performance. These include static activations that precompute scaling factors during training, reducing runtime overhead on mobile processors. The company also implemented channel-wise quantization aligned with mobile accelerator architectures, enabling more efficient execution.

In addition, Google DeepMind introduced targeted 2-bit quantization for token generation components while maintaining higher precision in core reasoning layers. The company also optimized embeddings and key-value caches, reducing active memory consumption and enabling longer conversations without exhausting available memory.

For users deploying text-only applications, memory usage can be reduced even further by omitting unnecessary audio and vision encoders. Google DeepMind noted that the text-only Gemma 4 E2B model, without Per-Layer Embeddings, can operate using less than 1 GB of memory.

To support adoption across the AI ecosystem, Google DeepMind partnered with several popular development platforms and tools. The company released model weights through Hugging Face, including GGUF formats compatible with llama.cpp and compressed tensor formats for vLLM deployments. Unquantized checkpoints are also available for developers who want to convert models into other supported formats.

The company said the QAT models are supported across a broad range of deployment environments, including llama.cpp, Ollama, LM Studio, LiteRT-LM, Transformers.js, SGLang, vLLM, MLX, Hugging Face Transformers, and Unsloth. Developers can also use MTP-enabled QAT checkpoints to maintain the inference speed benefits of Multi-Token Prediction while leveraging quantized models.

Google: Gemma 4 QAT Models Reduce Memory Requirements For Mobile And Laptop AI

Consumer Tech