CelerData: How This Company Reduces Data Migration And Direct Analysis Costs In Data Lakes

By Amit Chowdhry • Dec 21, 2023

CelerData is a company that enables enterprises to quickly and easily grow their business with a blazing-fast analytical engine that is 3X the performance/cost of any other solutions on the market. Pulse 2.0 interviewed CelerData co-founder and chief operating officer (COO) Andy Ye to learn more about the company.

Andy Ye’s Background

After obtaining his master’s degree in computer science in 2008, Ye worked at several prominent search engine and e-commerce companies. And Ye said:

“This gave me a great professional foundation, and eventually provided the opportunity to join a startup in a key technical role just as mobile internet was really taking off. The product I was working on was similar to a mobile version of Google Analytics and later acquired by Alibaba.”

“From there, my work experience has consistently revolved around the field of data analysis, which has helped me to better understand the challenges enterprises face in data analytics and how to address them. With that knowledge I embarked on my most recent entrepreneurial ventures: StarRocks and CelerData.”

Formation Of CelerData

How did the idea for CelerData come together? Ye shared:

“After more than a decade in big data each, my partner, James Li, and I saw the need for a powerful analytics system that could satisfy the constantly changing requirements of business users. Four years ago we started planning StarRocks, an open source project to address those core challenges we’d encountered in the field.”

“Existing analytics systems could only deliver query response times in the 10s of seconds at best, with very expensive hardware costs. Major enterprise customers with large data systems, Airbnb for example, needed sub-second responses to their analytic queries at a much lower cost.”

“Under the earlier systems, and most of today’s, data freshness was measured in hours and enterprises needed second-level data freshness to respond faster to changes in business conditions. Various analytics systems were being used in different scenarios, which made it difficult for enterprises to simplify their architectures while analyzing offline data and real-time data simultaneously.”

“The primary question was how could we make enterprise analytics simpler and more efficient to shorten the time from raw data to business value?”

“That sparked a journey that led to the creation of the StarRocks project, our open source version. In the past four years, StarRocks developed rapidly and passed several significant milestones. It’s been adopted in production environments by hundreds of large enterprises worldwide and became a successful Linux Foundation project involving tens of thousands of engineers.”

“Last year, to better support the development of the StarRocks open-source community and provide enterprises with a cost-effective and feature-rich data lake analytics cloud product, we founded CelerData.”

Favorite Memory

What has been Ye’s favorite memory working for the company so far? Ye shared:

“Our team has ambitious goals and loves a good challenge. One of our core values when we started was, ‘think big, achieve the impossible,’ and we continue to practice this. At the beginning of last year, our cloud product was just starting its development, and we had no paying customers, but the whole team came together to turn this product into a success.”

“R&D worked tirelessly to design and iterate the product. Sales and customer success stepped up to find our first set of seed customers. For me, it was amazing to watch the product grow from idea to adoption in such a short timeframe and since launching earlier this year our cloud product has already benefited numerous customers.”

Core Products

What are the company’s core products and features? Ye explained:

“Our core product is CelerData Cloud, a platform for high-performance data analytics. CelerData Cloud is built on StarRocks, a popular online analytical processing (OLAP) database for multi-dimensional subsecond analysis. Powered by its SIMD-optimized execution engine, Cost-Based Optimizer enabled query planning and in-memory Massively Parallel Processing (MPP) compute architecture, it’s able to handle multi-table complex OLAP queries more efficiently than any other solution on the market.”

“CelerData Cloud is known for its highly-scalable sub-second query performance, enabling real-time analytics through the real-time ingestion of fresh data. With native support for data upserts and the ability to efficiently perform complex aggregated queries at scale, CelerData Cloud users can eliminate the tedious streaming preprocessing data pipeline, which is one of the biggest cost factors of real-time analytics, allowing them to effectively go ‘pipeline-free.’ CelerData Cloud also streamlines data processes with open data lakes, serving as a query engine and providing the performance of a data warehouse without the need for proprietary systems or duplicating data.”

Challenges Faced

After asking Ye about bottlenecks that might have come up while building his company, he noted:

“At this stage, gaining the necessary market visibility is crucial for us. We have a deep understanding of user needs, and an amazing product. What we’re focused on now is growing the user base and collecting feedback so that we can continue to optimize and improve our offerings.”

Evolution Of CelerData’s Technology

How has the CelerData’s technology evolved since launching? Ye pointed out:

“Our customers want to analyze real-time data with low latency and we’ve emphasized that in designing and developing the product. Low latency is just one important factor though, customers also wanted to minimize the resource costs of their data analysis, which led to the development of CelerData Cloud.”

“CelerData Cloud has become an important tool for customers to reduce the cost of data migration and perform direct analysis in data lakes since its launch, and we’re investing significant R&D efforts into data lake analytics. Data lake analytics will play a major role in the future of big data analysis. When we can perform every kind of data analysis on a data lake, we’ll no longer need multiple complex data analysis technology stacks, which is our goal for every customer.”

Significant Milestones

What have been some of CelerData’s most significant milestones? Ye cited:

“We donated StarRocks to the Linux Foundation in early 2023, which helped the project grow significantly. Hundreds of large enterprises worldwide are now using StarRocks in production environments, with nearly 10,000 people contributing to the community.”

“The launch of our CelerData Cloud product was also a major milestone. CelerData Cloud allows our customers to save significant human resources on system operations while enjoying CelerData’s professional technical services and support. We had a really successful launch and already have many customers testing and using the product.”

“In May, we officially released StarRocks 3.0. Version 3.0 represented an important step for StarRocks towards data lake analytics. In this version, StarRocks not only introduced its new storage-compute separation architecture, but also gained the ability for unified catalog management. Users can use StarRocks to analyze data in mainstream data lake formats, including Apache Hive, Apache Hudi, Apache Iceberg and Delta Lake. Users looking for high-speed data analysis no longer need to migrate data, and can simply utilize StarRocks for low-latency data lake analysis.”

Customer Success Stories

After asking Ye about customer success stories, he highlighted:

“By using StarRocks, Airbnb was able to eliminate 80% of their denormalization pipeline and streamline data ingestion with their Minerva metrics management platform. This platform offers over 30,000 metrics across 7,000 dimensions and stores over 6 Petabytes of data, standardizing business metrics and serving more than 100 teams at Airbnb.”

“Previously, Minverva used Apache Druid and Presto as its query layer. The multi-table JOIN performance of these systems was unsatisfactory, forcing engineers to perform denormalization in a separate data pipeline before ingesting it into Minerva for serving. This was resource-intensive, expensive, and wasteful.”

“By eliminating the vast majority of denormalization jobs with StarRocks, the platform’s data processing became more streamlined and efficient. This not only reduced the complexity of data ingestion but also made schema changes simpler and more agile. For Airbnb, the adoption of StarRocks has significantly improved the efficiency, scalability, and responsiveness of the Minerva platform.”

Differentiation From The Competition

What differentiates CelerData from its competition? Ye affirmed:

“Being developed on top of StarRocks has allowed CelerData to offer industry-leading performance, especially with JOIN operations. CelerData not only decreases hardware costs, but also simplifies the way users do real-time and batch analytics. We’re the only solution that enables enterprises to take their real-time analytics ‘pipeline free.’”

“The StarRocks project has an extremely active open source community made up of thousands of users and contributors from major enterprises like Airbnb and Pinterest. This makes getting started with the project and troubleshooting setup much easier than other solutions on the market.”

Future Goals

What are some of the company’s future goals? Ye concluded:

“Our long-term goal is to continue helping enterprises perform their data analysis faster and more efficiently. To achieve this goal, we will maintain our commitment towards making StarRocks a successful open-source project and CelerData Cloud a leader in the cloud analytics space. We warmly welcome any data engineers to join the StarRocks community on Slack, use our products, and provide feedback. We love connecting with the larger community about how we can continue to improve what we do for them.”