Cleanlab – an automated solution for boosting the accuracy of enterprise artificial intelligence (AI), LLM, and analytics solutions – recently announced its $5 million seed investment round led by Bain Capital Ventures. This flagship product, Cleanlab Studio, is the only enterprise solution for evaluating and correcting errors in large structured data (e.g., tabular data and spreadsheets) and extensive unstructured data (e.g., visual data, LLM generated data, conversational data, etc.).
Most companies are adopting AI models and business intelligence (BI) solutions but aren’t utilizing the full range of their data to train the model. And data and label quality issues like outliers, label errors, and data shifts often make the data too poor to be valuable input for reliable business intelligence, training of ML models, or fine-tuning of LLMs.
Inaccurate data costs U.S. businesses $3.1 trillion per year and growing, according to research from IBM. And using Cleanlab, organizations like Amazon, Google, Walmart, Deloitte, Wells Fargo, and many others have dramatically cut costs and time spent on data quality by automating the correction of errors in their datasets. Plus, Cleanlab is designed to work with most kinds of datasets, including text, images, and tabular/CSV/JSON data.
Cleanlab solves this problem for enterprises by analyzing unreliable, real-world datasets to find and fix errors and generate an improved dataset, and uses that improved dataset and AI-generated new labels, freeing up precious engineering resources to focus on problem-solving, not data curation and model training.
Cleanlab already created the most popular open-source library for data-centric AI, used by thousands of data scientists to automatically diagnose issues in real-world data through algorithms running on top of any existing ML model. But diagnosis alone does not work for companies that don’t have the model or interfaces to fix the issues they’ve identified. And to serve this broader market, the company introduced Cleanlab Studio, an enterprise application that seamlessly handles correcting data issues and reliable model deployment.
Curtis Northcutt, Jonas Mueller and Anish Athalye, all three PhDs from MIT, founded Cleanlab after working on a new area of AI known as confident learning, invented by Northcutt during his Ph.D. at MIT while working with Isaac Chuang (pioneer of the quantum computer).
Using Cleanlab Studio, both individual data scientists and enterprise teams get more value out of their data by automating the process of finding and fixing outliers, label issues, and other data issues in image, text, and tabular datasets, enabling them to train more reliable models and derive more accurate analytics and insights. Different from other solutions in this space, Cleanlab Studio handles model training for you with state-of-the-art auto-ML, requires no hyper-parameter tuning or model selection, no code, and no machine learning expertise to deliver an improved dataset, ML model, and business insights in significantly less time.
Before Cleanlab, Co-Founder and Chief Scientist Jonas Mueller built Amazon’s auto-ML solution, which all AWS auto-ML jobs use today. Co-Founder and CTO Anish Athalye holds 5k+ citations for several groundbreaking works demonstrating where AI solutions are broken and how to improve them. By coupling Curtis’s work to auto-fix issues in most datasets with Jonas’s work to auto-train ML models on any dataset with Anish’s work in secure systems, the team was able to create Cleanlab Studio to achieve its mission to make AI more accessible and more effective for humanity. And Cleanlab Studio integrates with most common data and ML workflows, uploading large datasets at internet-bandwidth times and scales for enterprises.
On June 1, 2023, Databricks announced its partnership with Cleanlab to bring automatic data correction to both structured and unstructured datasets via the Databricks platform through the Cleanlab Studio integration. And in 2021, Cleanlab was nominated for the best paper award at NeurIPS. In 2022, Cleanlab published five peer-reviewed papers NeurIPS and ICML conferences/workshops and in 2023, Cleanlab’s executive team taught MIT’s course on Data-centric AI.
Cleanlab is actively working with organizations training large models or developing business intelligence and analytics solutions on image, text, tabular, and other data types.
KEY QUOTES:
“We often forget that like humans, artificially intelligent solutions embody imperfection. The next evolution of AI is being able to characterize this imperfection: understanding, finding, and fixing errors in the data it’s trained on. Everyone can relate to Cleanlab because it works like how you do: if you are taught wrong things, you perform worse on the exam. Cleanlab automates data curation and correction to produce more accurate models in less time. We don’t guarantee perfection. We guarantee improvement. Cleanlab breaks AI’s glass ceiling by providing accessibility and reliability for AI solutions.”
— Cleanlab AI co-founder & CEO Curtis Northcutt
“A major risk with LLMs is ‘garbage-in, garbage-out’ in that if they’re trained on messy data that contains bias, inaccuracy, or nonsensical information, their outputs will often contain similar issues. There’s also great opportunity in better data curation, since LLM performance is still largely data-bound, as Deepmind’s Chinchilla paper (and others have shown). Cleanlab is the easiest way to curate data for training and fine-tuning, and an integral part of the emerging infra stack that supports modern AI.”
— Aaref Hilaly, partner at Bain Capital Ventures
“Cleanlab helped us improve accuracy by 28%, while reducing the number of labeled transactions required to train the model by more than 98%.”
— David Muelas Recuenco, Expert Data Scientist at BBVA (Banco Bilbao Vizcaya Argentaria), one of the largest financial institutions in the world when discussing how Cleanlab reduced their costs for dataset curation and model training by over 98%.
“Using Cleanlab AI, we’ve increased model accuracy by 15 percent, and reduced training iterations by one-third. Our team has been extremely impressed with the accuracy, speed and ease-of-use that Cleanlab provides.”
— Steven Gawthorpe, Senior Managing Consultant Data Scientist at Berkeley Research Group