Cloudflare announced that key members of Ensemble AI are joining the company to help accelerate its work in AI infrastructure and make it easier for developers to run powerful AI models efficiently at scale.
Ensemble AI was founded in 2023 in San Francisco and has focused on making large models faster, smaller, and more cost-effective to serve without sacrificing quality. The team has developed approaches to model compression and efficient inference designed to reduce the memory, compute, and deployment overhead of large language models and multimodal architectures.
Cloudflare said the addition of Ensemble AI’s talent will strengthen its ability to support developers as AI becomes a core part of application development. The company noted that the economics of inference are becoming increasingly important as models grow larger, workloads become more dynamic, and customers expect AI applications to run globally with speed, reliability, and affordability.
Ensemble AI’s work has focused on preserving structure inside modern AI models while reducing the cost of running them. This includes NdLinear, a drop-in replacement for standard linear layers in transformer models that operates directly on multidimensional activations rather than flattening the structure away. The approach is designed to help models preserve meaningful axes such as heads, channels, spatial dimensions, and other structured representations while reducing parameter count and compute.
The team also developed NdLinear-LoRA, an efficient adaptation method designed to reduce the number of trainable parameters required for fine-tuning large models. Cloudflare said these techniques complement other model efficiency efforts, including quantization and vector quantization.
Cloudflare plans to apply the team’s expertise to Workers AI, its serverless GPU-powered inference platform running on Cloudflare’s global network. The company said inference cost remains one of the biggest barriers to scaling AI applications, and improvements in model size, memory footprint, throughput, and GPU utilization can make AI more accessible and economical for developers and customers.
The Ensemble AI team will focus on improving the economics of serving large language models and other advanced AI architectures, with an emphasis on model efficiency, GPU utilization, and scalable deployment.
Cloudflare said the move builds on its existing AI infrastructure work, including its inference engine Infire, tensor compression techniques such as Unweight, and its platform for running extra-large language models.