Databricks Acquires Lilac To Shape AI Products With Better Data

By Amit Chowdhry • Mar 21, 2024

Databricks announced that it has acquired Lilac. Lilac is a scalable and user-friendly tool for data scientists to search, cluster, and analyze any text dataset, focusing on generative AI. Lilac can be used for various use cases — from evaluating the output from large language models (LLMs) to understanding and preparing unstructured datasets for model training. Integrating Lilac’s tooling into Databricks will help customers accelerate the development of production-quality generative AI applications using their enterprise data.

Data is at the heart of any LLM-based system, whether it is preparing datasets for training models, evaluating model outputs, or filtering Retrieval-Augmented Generation (RAG) data. And understanding these datasets is essential for building quality GenAI apps. However, analyzing unstructured text data can become highly cumbersome and extremely difficult in the age of GenAI. This process has historically been impaired by manual and labor-intensive methods lacking scalability. These traditional methods are time-consuming and also daunting, so they deter from attempting them.

Lilac makes the exploration of unstructured data easy as it is a tool for data scientists and AI researchers to explore, understand, and modify text datasets in a manageable way.

Lilac offers a scalable solution that encourages and facilitates interaction with data. And with an intuitive user interface and AI-augmented features, Lilac empowers data scientists and researchers to explore data clusters, derive new data categories using human feedback and classifiers, and tailor datasets based on these insights. The team behind Lilac specifically built their product to enable analysis of model outputs for bias or toxicity, preparation of data for RAG, and fine-tuning or pre-training LLMs.


“Lilac’s core mission aligns with Databricks’ commitment to provide customers with end-to-end GenAI capabilities. Their open source project has already captivated a wide audience within the data science and AI research communities — including our own Mosaic AI team, which has been leveraging Lilac to curate data over the past year. Lilac’s founders, Daniel Smilkov and Nikhil Thorat, each spent a decade at Google honing their expertise in developing enterprise-scale data quality solutions. We are thrilled to bring their experience, team, and technology to Databricks.”

“With Databricks Mosaic AI, our goal is to provide customers with end-to-end tooling to develop high-quality GenAI apps using their own data. Lilac’s technology will make it easier to evaluate and monitor the outputs of their LLMs in a unified platform, as well as prepare datasets for RAG, fine-tuning, and pre-training. We look forward to sharing more as we integrate Lilac’s technology into Databricks. Stay tuned!”

– Statement from Databricks