Microsoft Unveils Maia 200, A New Inference Accelerator To Cut AI Token Costs

By Amit Chowdhry ● Today at 4:38 PM

Microsoft has introduced Maia 200, a first-party AI accelerator designed specifically for inference workloads, positioning it as a step change in the economics of AI token generation and in the company’s broader heterogeneous AI infrastructure strategy.

The company said Maia 200 will be deployed to serve multiple models across Azure, including the latest GPT-5.2 models from OpenAI, and to deliver improved performance per dollar for products such as Microsoft Foundry and Microsoft 365 Copilot.

Microsoft described Maia 200 as an inference-focused system built on TSMC’s 3-nanometer process with native FP8 and FP4 tensor cores, and said the chip’s design emphasizes not only raw compute but also sustained utilization at scale. According to Microsoft, each Maia 200 chip contains more than 140 billion transistors, delivers more than 10 petaFLOPS of FP4 and more than 5 petaFLOPS of FP8 performance, and operates within a 750W SoC thermal design power envelope.

The company also highlighted a redesigned memory subsystem featuring 216GB of HBM3e delivering 7 TB per second of bandwidth, paired with 272MB of on-chip SRAM and specialized data movement engines intended to keep large models fed and increase token throughput.

At the systems level, Microsoft said Maia 200 introduces a two-tier scale-up network design based on standard Ethernet, supported by a custom transport layer and a tightly integrated NIC to deliver performance and reliability without proprietary fabrics. Microsoft said each accelerator exposes 2.8 TB per second of bidirectional dedicated scale-up bandwidth and enables predictable collective operations across clusters of up to 6,144 accelerators. Within each tray, four accelerators are connected via direct, non-switched links, and the same communication protocols extend across intra- and inter-rack networking via the Maia AI transport protocol to support scaling with fewer hops and reduced stranded capacity.

Microsoft said Maia 200 is already deployed in its US Central datacenter region near Des Moines, Iowa, with the US West 3 region near Phoenix, Arizona, planned next, followed by additional regions. The company also tied the system to internal model development, saying the Microsoft Superintelligence team will use Maia 200 for synthetic data generation and reinforcement learning as part of efforts to improve next-generation in-house models, with the accelerator’s narrow, precision-oriented design positioned to speed the generation and filtering of domain-specific synthetic data.

To support developer adoption, Microsoft said it is previewing a Maia software development kit that includes PyTorch integration, a Triton compiler, an optimized kernel library, and access to a low-level programming language, alongside a Maia simulator and a cost calculator intended to help teams optimize workloads earlier in the development cycle.

 

Exit mobile version