Unstructured: LLM Data Preprocessing Solutions Provider Raises $40 Million

By Amit Chowdhry ● Mar 14, 2024

Unstructured – a leader in ingestion and preprocessing for large language models (LLMs) – announced a $40 million Series B funding round. This funding round was led by Menlo Ventures with participation from Databricks Ventures, IBM Ventures, Sacramento Kings Chairman Vivek Ranadivé, Datastax CEO Chet Kapoor, Allison Pickens of the New Normal Fund, and NVentures (NVIDIA’s venture capital arm), and existing investors Madrona, Bain Capital Ventures (BCV), and Mango Capital.

Tim Tully of Menlo Ventures joined the board of directors in connection with the funding round. Including this funding round, the company has raised $65 million. Unstructured plans to use this funding round to grow its team and accelerate the development of data preprocessing tooling for LLMs.

Globally, over half of organizations increased investments in generative AI programs in the past year. However, the increase of this transformational technology presents a major challenge.

While the surfacing of the modern data stack in the past decade unlocked structured data for advanced analytics, there has historically not been an equivalent set of tooling for 80+% of enterprise data that is unstructured. This includes files like emails, documents, images, videos, and other data organizations. To address this gap, Unstructured is the first and only company that can ingest and pre-process all unstructured data into formats ready for foundation model use.

Since its founding in 2022, Unstructured has been at the forefront of the productization of enterprise LLMs. It enables organizations to automate the transformation of messy and unstructured data into formats necessary for retrieval-augmented generation (RAG) and LLM fine-tuning.

Unstructured’s technology has emerged as a major piece of infrastructure not only for delivering LLM-ready data to vector databases but also for driving major performance improvements across LLM applications without any customization. Unstructured’s open-source library has been downloaded over 6 million times and is used by over 12,000 code bases and 45,000+ organizations.

Earlier this year, the company released its commercial SaaS API and has over 1,000 paying customers. In February, Unstructured announced its enterprise platform – which is the first solution to continuously extract raw unstructured data from existing databases, transform over 30 file types into LLM-ready formats, and automatically loads this data into a vector database for RAG.

Developers and data scientists spend over 75% of their time preparing data. Unstructured’s solution removes the critical barrier to moving LLM pilots into production. This real-time continuous data access that Unstructured provides means LLMs are kept up to date and have access to knowledge specific to organizations.

KEY QUOTES:

“Over the last decade the emergence of the modern data stack has enabled analytics products to take advantage of the cloud and structured data to deliver incredible value to organizations, but the development of LLMs nested in a RAG architecture has enabled a similar shift for the world of unstructured data. For the first time, developers are able to interact with all of their data through large foundation models. This new data stack rests on four key components: LLMs, orchestration frameworks, new cloud storage solutions, and ingestion and preprocessing tooling. A critical bottleneck to realizing the emerging value of LLMs is the ability to ingest and preprocess any human-generated data into an LLM-ready format. 2024 will be the year of moving LLM prototypes into production and organizations of all types and sizes are hungry to build out these architectures efficiently and at scale. Automating the process of structuring data and seamlessly delivering it into storage is critical for enterprises that want to build solutions on this new tech stack and go to market quickly.”

  • Brian Raymond, CEO and Founder of Unstructured

“Unstructured has built an exceptional cloud AI platform to help developers build data pipelines for RAG, AI applications, chatbots, and more. It has become the preferred way developers build AI applications and assemble data pipelines. People in the industry know that RAG quickly became the industry standard. Soon they will understand that Unstructured is the tip of the RAG spear.”

  • Tim Tully, Partner at Menlo Ventures

“Generative AI is key to gathering useful, intelligent insights from the massive amounts of data that enterprises create everyday. Unstructured is an emerging leader in data ingestion and preprocessing, working to make AI more accessible, useful, and powerful for all.”

  • Mohamed “Sid” Siddeek, corporate vice president and Head of NVentures at NVIDIA

“Unstructured is turning the data challenge into opportunity — helping businesses optimize for AI. We are proud to invest in a company that shares our mission of driving AI for business and empowering enterprises to unlock greater insights from their data.”

  • Thomas Whiteaker, Investment Partner at IBM Ventures

“We are thrilled to invest and partner with the Unstructured team. Unstructured is rapidly becoming a critical technology for delivering RAG-ready data to the Databricks platform and more than 120 customers are already using its best-in-class data preprocessing tool. We look forward to growing our partnership and accelerating enterprise adoption of generative AI.”

  • Andrew Ferguson, VP of Corporate Development and Ventures at Databricks

 

Exit mobile version