Cleanlab: Automated Data Curation Solution Company Secures $25 Million

By Annie Baker • Oct 16, 2023

Cleanlab – the company behind the automated data curation solution used to increase the dollar value of every data point in enterprise AI, LLM, and analytics solutions – announced it had raised $25 million in Series A funding. This funding round was co-led by Menlo Ventures and TQ Ventures; Menlo Ventures’ Matt Murphy and TQ’s Schuster Tanger will join the board. Existing investor Bain Capital Ventures (BCV) and new investor Databricks Ventures also joined in this funding round, which brings Cleanlab’s total funding to $30 million.

The company helps drive profitability, and for today’s businesses, revenue is directly linked to data-driven analytics decisions and generative AI solutions. Bad data costs the U.S. alone over $3 trillion, and 80% of time spent by enterprises is manually improving the data quality.

Cleanlab is the first enterprise solution that reliably adds intelligent metadata automatically and removes most of the work. Plus, it turns messy and real-world data into valuable inputs for various models. This process increases the reliability and profit margin of enterprise analytics, LLM, and AI decisions. Cleanlab also automatically identifies the majority of a dataset containing no issues, improving the profit margins of enterprise pipelines by avoiding expensive data quality and annotation for the majority of data.

Cleanlab’s novel AI algorithms were developed in-house by the founders – all of whom are PhDs in Computer Science from MIT and published researchers. And the team’s proprietary approach to automated data curation builds on the confident learning field created by the Cleanlab team – enabling them to pioneer an enterprise-ready product.

More than 10% of Fortune 500 companies (including AWS, JPMorgan Chase, Google, Oracle, and Walmart) and a variety of innovative startups (like ByteDance, HuggingFace, and Databricks) use Cleanlab to find and fix problems in sizable structured and unstructured visual, text, and tabular datasets. Whether building an LLM for enterprise, tagging intents in chatbot text data, or objects in visual navigation data, Cleanlab increases the dollar value of every data point in your dataset by automatically analyzing and correcting outliers, ambiguous data, and mislabeled data.

The company also announced that its flagship automated data curation platform, Cleanlab Studio, has launched several new features that address unreliable LLM outputs. And Cleanlab’s Trustworthy Language Model (TLM) produces high-quality LLM outputs such as ChatGPT, Falcon, and similar LLMs. Plus, it also adds a trustworthiness reliability score to all LLM outputs. Cleanlab Studio identifies and fixes issues in all dataset types, including text, image, and tabular data. TLM extends Cleanlab Studio’s capabilities for adding intelligent metadata to help automate reliability and quality assurance for systems that rely on LLM outputs, synthetic data, and generated content. Cleanlab’s Trustworthy Language Model is now available in Beta today with Cleanlab Studio at cleanlab.ai.

KEY QUOTES:

“After working with companies like Microsoft and Tesla to get their AI-driven products to function better and helping MIT and Harvard detect cheating, it became clear that mislabeled and poorly curated data was the core issue behind these challenges. It’s the culmination of over a decade of work to introduce Cleanlab Studio, which reimagines what AI and analytics can do for people and enterprises now that we can automate data curation and reliability.”

– Cleanlab Co-Founder and CEO Curtis Northcutt

“While most of the investment in generative AI is chasing the biggest, baddest, and best model, the reality is that there is a massive complimentary opportunity that can shave billions off those efforts and lead to a better outcome. That is Cleanlab. Cleanlab’s amazing team of ML researchers and practitioners has built a data curation platform that fundamentally improves models via better, cleaner data.”

– Matt Murphy, Partner at Menlo Ventures

“We are thrilled to partner with Curtis, Jonas and Anish, the eminent authorities on data-centric AI. They have developed a solution to a large and pressing problem for enterprises across almost all industries: namely, ambiguous and wrongly labeled data. In addition to an exceptional team and superior technology, Cleanlab also has real world results from customers that point to Cleanlab’s effectiveness around percent accuracy improvement, percent reduction in labeled transactions required to train models, and dollar reduction in enterprise costs.”

– Schuster Tanger, Co-Managing Partner of TQ Ventures

“Cleanlab is well-designed, scalable, and theoretically grounded: It accurately finds data errors, even on well-known and established datasets. After using it for a successful project at Google, Cleanlab is now one of my go-to libraries for dataset cleanup.”

– Patrick Violette, Senior Software Engineer at Google