Microsoft Launches Three New MAI Models For Speech, Voice, And Image Generation

Microsoft announced three new in-house AI models, MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2, expanding its capabilities across speech recognition, voice generation, and image creation. The models are now available through Microsoft Foundry and the MAI Playground, with the company emphasizing improved performance, speed, and cost efficiency.

MAI-Transcribe-1 is designed for speech-to-text transcription and delivers state-of-the-art performance across the top 25 most-used languages based on the FLEURS benchmark. The model is optimized for real-world environments and delivers batch transcription speeds 2.5 times faster than Microsoft’s previous Azure Fast offering. Microsoft also highlighted its strong price-to-performance positioning compared to other large cloud providers.

MAI-Voice-1 focuses on generating natural, expressive speech, preserving speaker identity across long-form content while enabling developers to create custom voices with only a few seconds of audio. The model can generate up to 60 seconds of audio per second and is built for high GPU efficiency, supporting scalable deployment for enterprise use cases. It is already being integrated into Copilot experiences, including audio-based features and podcasts.

MAI-Image-2 enhances Microsoft’s image generation capabilities, delivering faster performance while maintaining high-quality outputs suitable for professional creative workflows. The model has ranked among the top three on the Arena.ai leaderboard and is being rolled out across Microsoft products, including Bing and PowerPoint. Early enterprise adoption includes WPP, which is using the model for large-scale creative production.

Microsoft is positioning these models as “better, faster, and cheaper” than competing offerings, with aggressive pricing aimed at developers and enterprise customers. Pricing starts at $0.36 per hour for MAI-Transcribe-1, $22 per million characters for MAI-Voice-1, and $5 per million tokens for text input and $33 per million tokens for image output for MAI-Image-2.

The company also emphasized its broader vision of “humanist AI,” focusing on building models that are aligned with human needs and designed for safe, responsible deployment. The models were developed with built-in guardrails and enterprise-grade governance through Microsoft Foundry.

The launch reflects Microsoft’s broader strategy to build its own AI infrastructure and models while making them widely accessible to developers and enterprises. By combining performance gains with cost efficiency and ecosystem integration, Microsoft aims to accelerate adoption and strengthen its position in the competitive AI market.

KEY QUOTES:

“Introducing MAI-Transcribe-1, alongside MAI-Voice-1 and MAI-Image-2. World-class quality at lightning speeds, now available at the most competitive prices.”

Mustafa Suleyman, CEO of Microsoft AI

“MAI-Image-2 is a genuine game-changer. It’s a platform that not only responds to the intricate nuance of creative direction, but deeply respects the sheer craft involved in generating real-world, campaign-ready images.”

Rob Reilly, Global Chief Creative Officer at WPP

Microsoft Launches Three New MAI Models For Speech, Voice, And Image Generation

Consumer Tech