Protege: Interview With Co-Founder & CEO Bobby Samuels About The AI Data Platform

By Amit Chowdhry • Yesterday at 10:25 AM

Protege is an AI data platform that securely connects private data holders in industries like healthcare and media with AI developers, enabling the compliant licensing and exchange of real-world data. Bobby Samuels co-founded the company and serves as its CEO, leading its overall strategy, execution, and capital formation. Pulse 2.0 interviewed Protege co-founder and CEO Bobby Samuels to learn more.

Bobby Samuels’ Background

Could you tell me more about your background? Samuels said:

“Before founding Protege, I spent my career working at the intersection of data, privacy, and infrastructure. I previously led teams at companies like Datavant and LiveRamp, where I focused on building neutral data ecosystems in highly regulated environments. That experience shaped how I think about trust, governance, and what it actually takes to operationalize sensitive data at scale.”

“Those lessons became increasingly relevant as AI adoption accelerated. While models and compute advanced rapidly, it became clear that data, especially real-world, proprietary data, was becoming the limiting factor. Protege is a direct extension of that background and the belief that AI progress depends on getting data right.”

Formation Of The Company

How did the idea for the company come together? Samuels shared:

“Protege emerged from a shared realization among the founding team that AI’s next leap wouldn’t come from bigger models alone, but from access to better data. The issue isn’t necessarily that there isn’t enough data, as AI training data will not become scarce anytime soon. That isn’t the issue.”

“While public datasets may become exhausted, vast amounts of proprietary data can and will become available at an increasing pace —  if we can figure out how to responsibly unlock and use it at scale. Our founding team had all worked in privacy-first data environments before, and we saw an opportunity to apply those principles to the current AI development context. As a result, we decided to build a platform that enabled licensed access, compensated data holders, and made real-world data of everyday human activities usable for modern AI development. We started in healthcare and quickly moved to media and other forms of content that’s reflective of the real world we live in.”

Favorite Memory

What has been your favorite memory working for the company so far? Samuels reflected:

“In my time at Protege, what I’m most proud of is the team we’ve built. So there’s not one specific memory, but instead getting to work with a group of incredibly dedicated, kind, focused people.”

Protege Platform

Can you tell me more about the Protege platform? Samuels explained:

“Protege is an AI data platform that enables access to trusted, real-world datasets at scale. We aggregate private and proprietary data in partnership with hundreds of data providers across domains like healthcare, media, audio, and motion capture, and curate it into AI-ready datasets for training, evaluation, and benchmarking. Beyond access, we provide the technical and governance layer (curation, de-identification, and structured licensing) that allows AI builders to use sensitive data confidently while ensuring data providers are protected and compensated.”

Challenges Faced

Have you faced any challenges in your sector of work recently? Samuels acknowledged:

“One of the biggest challenges has been changing how people think about data sourcing for AI. For a long time, the assumption was that public web data was sufficient. To put that into perspective, a single Common Crawl release (a full snapshot of the open web) is roughly 419 terabytes, while the total volume of data generated globally is projected to exceed 175 zettabytes — which is roughly 175,000,000,000 Terabytes! As a result, the open internet represents only a small fraction of the world’s data. As reliance on web-scale scraping has run into quality, legal, and ethical limits, there’s been a growing learning curve around the need for licensed, real-world data.”

“We addressed that by focusing on education and execution, showing both AI builders and data holders that there’s a scalable, responsible alternative that works in practice.”

Evolution Of The Company’s Technology

How has the company’s technology evolved since launching? Samuels noted:

“Early on, the focus was on enabling secure access to real-world data, but over time, as customer needs became more defined, the platform evolved toward curated, fit-for-purpose datasets rather than bulk volume. In media, for example, broad requests like “massive volumes of video content” quickly gave way to more precise needs, such as footage with specific camera movements or lighting conditions. Watching that shift unfold has been extremely useful for us, serving as a clear north star for how we shape our product roadmap.”

“Today, we support multiple stages of the AI lifecycle and a growing range of data modalities, reflecting how AI development itself has matured. That evolution has been driven directly by what teams need to deploy models reliably in real-world environments.”

Significant Milestones

What have been some of the company’s most significant milestones? Samuels cited:

“One of the earliest milestones was launching Protege in 2024 alongside our seed round, which validated that access to real-world training data was a real and urgent bottleneck for AI builders. From the start, the platform was designed to enable secure, governed access to proprietary data—something that had historically taken months or years of negotiation—and early adoption confirmed strong market demand for a better approach.”

“In 2025, we closed a $25 million Series A that allowed us to significantly expand the platform, deepen partnerships across healthcare and media, and grow into new data verticals. That same year, we acquired Calliope Networks, marking our expansion beyond healthcare into premium media and video data, and bringing deep expertise in content licensing and access to hundreds of thousands of hours of high-quality content. More recently, we announced a $30 million Series A-1 to continue scaling the platform and data network as demand for licensed, real-world data accelerates.”

Customer Success Stories

Can you share any specific customer success stories? Samuels highlighted:

“One example is our partnership with Gradient Health, a leading provider of large-scale medical imaging data. Gradient had deep expertise in ingesting and de-identifying imaging studies, but AI developers increasingly needed those images combined with other healthcare data (like clinical notes and patient histories) at a much greater scale and speed. By working with us, Gradient was able to license HIPAA-compliant, de-identified imaging data alongside additional modalities, giving AI builders access to the multimodal datasets they need to train and evaluate advanced healthcare models.”

“Together, we helped shorten data delivery timelines from months to weeks, unlock new enterprise AI deals, and generate seven-figure net new licensing revenue for Gradient Health in under a year. The partnership demonstrates how aggregating data across sources, applying targeted curation, and maintaining strict privacy standards can create meaningful value for both data holders and AI teams without compromising trust or compliance.”

Funding/Revenue

Are you able to discuss funding and/or revenue metrics? Samuels revealed:

“We can share that we recently closed a $30 million Series A extension led by Andreessen Horowitz, bringing total funding to $65 million since founding. The round included pro rata participation from existing investors such as Footwork, CRV, Bloomberg Beta, Flex Capital, and Shaper Capital. Beyond the capital itself, what’s been most encouraging is the continued growth in demand we’re seeing, particularly from healthcare and media organizations looking for high-quality, curated data that can be used responsibly for AI development. The additional funding gives us greater flexibility to invest in the platform, broaden our data coverage, and scale the team and infrastructure needed to support customers as they move from experimentation into real-world deployment.”

Differentiation From The Competition

What differentiates the company from its competition? Samuels revealed:

“Protege doesn’t rely on scraped or synthetic data, and we design for privacy, governance, and rights preservation upfront, not as an afterthought. We did a privacy review before we wrote a single line of code. In building our platform around licensed, real-world data from the start, we are able to focus our efforts on curation and use-case specificity. As AI builders move from experimentation to deployment, the need for trusted, representative data becomes non-negotiable. With our team’s background and expertise, we are uniquely positioned to ethically source and curate high-quality datasets that AI teams can rely on in real-world applications.”

Future Company Goals

What are some of the company’s future goals? Samuels emphasized:

“Near term, our focus is on expanding into new data domains and formats, deepening partnerships with leading institutions, and continuing to evolve the platform to support the full AI development lifecycle.”

More broadly, we aim to become the central platform for licensed, real-world data in AI and a leading voice on how data should be sourced and used responsibly.”

Additional Thoughts

Any other topics you would like to discuss? Samuels concluded:

“One topic worth highlighting is the broader shift underway in how AI systems source data. Ongoing copyright lawsuits and fair-use challenges, now moving through U.S. courts and involving creators, publishers, and AI developers, are forcing the industry to confront how training data is acquired and used. As legal scrutiny increases and expectations around quality and accountability rise, the model is moving away from extraction and toward collaboration.”

“We believe this shift toward licensed, compensated, real-world data is not just inevitable, but foundational to building AI systems that people and institutions can actually trust.”