Poindexter Labs: Interview With Founder & CEO Jocelyn D’Arcy About The Expert Reasoning Data Company

By Amit Chowdhry Jul 3, 2026

Poindexter Labs is an expert reasoning data company building high-fidelity training data for frontier AI models using an academic peer-review model rather than factory workflows. Pulse 2.0 interviewed Poindexter Labs founder and CEO Jocelyn D’Arcy to learn more.

Jocelyn D’Arcy’s Background

Could you tell me more about your background? D’Arcy said:

I’m an accidental founder. I spent 20 years as a maths teacher and senior leader at highly academic schools, so my whole career was academic: high-trust environments where the entire point of review is to make the work better.

Then a couple of years ago, I got a message on LinkedIn asking if I wanted to do maths Olympiad problems in my spare time for $50 an hour. I thought, “Are you kidding?* I would do that for free!” That’s how I fell into AI data: first as a senior reviewer at Scale, then as a technical project manager at Labelbox.

Formation Of The Company

How did the idea for the company come together? D’Arcy shared:

I was working on projects where 90% of the data we produced was thrown away. Not because it was bad, but because of the workflow.

Karen Hao writes widely about what she calls the “exploitative” for data annotators with the unpredictability of work and draconian systems that strip contractors of agency. Personally, I’d wake up three or four times a night to check my phone and see if I was still on a project.

The consequence is that when working within this framework, the only rational action is never to send work forward. I’d open a task, and my first thought was always: what’s the fastest reason I can find to send this back? If I couldn’t find one, sometimes I’d just skip it entirely. Because sending anything forward was too risky. Getting one task wrong and they kick you off the project and strip your credentials.

I realized this isn’t a quality problem; it’s a structural incentive problem. The whole review process was adversarial by design, and it is killing throughput.

The crazy part is that quality assurance for expert knowledge generation doesn’t need to be invented from scratch. Academics have been doing it for centuries. Peer review exists to improve work, not to bin it. I just had to build a platform that worked that way.

Favorite Memory

What has been your favorite memory working for the company so far? D’Arcy reflected:

It was the weekend before my birthday. I was on the chat with some of my subject leads – they’re contractors, not employees – and I mentioned I might look a little older next time we spoke because it was my birthday over the weekend. That evening I went to a tech event with one of them. And he had a gift for me. A blanket with the Poindexter glasses superimposed onto my LinkedIn profile pic. Because I often work from bed, I always have a blanket.

That was the moment I thought – okay, we’ve actually built something here. These are people who chose to work with us, who could be anywhere, and they did that. A lot of them have also invested their own savings in the company. They’re not contractors in the traditional sense. They’re part of it.

Core Products

What are the company’s core products and features? D’Arcy explained:

The platform is called Syncronus, and it’s built around one idea: collaborative peer review. A contributor claims a task, writes their prompt, and when they think it’s ready, they mark it for peer review. The same group of people who write the tasks then come in and make edits. Crucially, their name is on the work, and it is visible to everyone. If a reviewer spots something too big to fix themselves, they message the author and work together to fix it.

Because the workflow is designed to rescue tasks, not reject them, we deliver 98+% of tasks we create, with an accuracy of 99+%.

And we do something nobody else does at the front door: every contributor sits a live 20-minute subject-knowledge interview before they’re accepted. Not a Fiverr-style marketplace quiz that everyone shares the answers to, but a genuine, AI-proof, Oxbridge-style interview.

Challenges Faced

Have you faced any challenges in your sector recently, and how did you overcome them? D’Arcy acknowledged:

The hardest thing has been getting a direct lab contract. We’d been producing data that ended up inside the majority of major frontier models – but under other companies’ names, through intermediary relationships. I couldn’t go to a lab and say, you know that delivery you just got from X? We did that. I’m under NDA. So I couldn’t use our own track record as a reference.

And it’s entirely a trust industry. Even when labs receive data, they can’t verify all of it. In the end it came down to relationships – getting in front of the right researcher, demonstrating credibility directly, and being patient. We signed our first direct lab contract in May 2026. That’s the milestone everything else is built on.

Evolution Of The Company’s Technology

How has the company’s technology evolved since launching? D’Arcy noted:

The peer-review workflow was always the core.. What’s evolved is everything around it. We came off Google Sheets and codified our workflow through our platform, Syncronus, which is now available on GitHub for others to use. We’ve also built the industry’s most sophisticated tooling with API calls for auto-review, plagiarism detection, misconception reporting and similarity checking.

The next phase is environments and the data pipeline, which is part of what the seed round is funding. The insight is that the most valuable thing we can offer a lab isn’t just tasks, it’s the ability to find exactly where a model is going wrong and hand them the targeted data to fix it. Right now, a lab can spend millions on a dataset that’s circling a problem they haven’t even precisely named yet. We can name it.

Significant Milestones

What have been some of the company’s most significant milestones? D’Arcy cited:

$1.6 million in revenue in the first six months, entirely bootstrapped – before we took a single pound of outside investment. That told me the model worked.

A paper accepted at ACL – the Association for Computational Linguistics. That’s a peer-reviewed venue. We looked at the performance of frontier models across 800 original Olympiad-style maths problems, stratified by topic and prompt length. What we found was that across all models, the single greatest predictor of failure was prompt length – even at ranges nowhere near approaching the context window limit. That finding has direct implications for how labs should be thinking about training data design, and it came directly out of the work we do with contributors every day.

And then the first direct lab contract in May 2026. That’s the one that matters most. Everything we’d done up to that point was building toward that.

Customer Success Stories

Can you share any specific customer success stories? D’Arcy highlighted:

I can’t name clients, but I can tell you what the researcher at the lab we work with said to me: “you’re our favourite data company to work with. Any other company – when I ask a question, they have to go through four people before they can get me an answer. With you, I ask one thing and you answer immediately. You understand every part of the process.”

That’s the thing the industry consistently gets wrong. The people running projects at the large platforms can’t actually evaluate the data they’re shipping. They don’t have the domain expertise. So there’s this enormous distance between the person writing the task and the person talking to the lab. We collapsed that entirely. I started as a domain expert. I understand the work from the inside.

Funding/Revenue

Are you able to discuss funding and/or revenue metrics? D’Arcy revealed:

Yes. We did $1.6 million in revenue in our first six months, bootstrapped. We raised a £2 million seed round in 2026, which was oversubscribed. Our investors include Episode One and Octopus Ventures’ First Cheque Fund, alongside a number of individual angels – including, notably, our own contributors who invested their personal savings.

I’d rather talk about delivery rates than revenue, honestly. We deliver more than 99% of contracted data. The industry average is somewhere between 5 and 60%. That number is the business case. If you’re a lab spending ten million on a data contract, half of that spend is covering the cost of the discarded tasks.

Total Addressable Market

What total addressable market is the company pursuing? D’Arcy assessed:

The honest answer is I’m less interested in TAM as a framing than in who the actual customers are. The customers that matter are the ML researchers at frontier labs – OpenAI, DeepMind, Anthropic, Meta, Cohere, Mistral, xA, AWSI. They’re the people who know whether data is good, and they’re the ones who recommend vendors upward to procurement. Win the researchers, and the commercial relationships follow.

The broader market – enterprise SaaS, financial services, legal – that’s a future opportunity as the platform matures. But right now we’re focused on the people who are training the most capable models in the world, and doing it properly.

Differentiation From The Competition

What differentiates the company from its competition? D’Arcy affirmed:

The workflow. Everything else follows from that.

The platforms I came from were built for drawing boxes around stop signs. That was the original use case. Cheap labor, simple tasks, factory model. And then AI got more capable, the problems got harder, and the industry just kept applying the same workflow. Except now they needed Olympiad medalists and PhDs to do the work. And they were managing them like gig workers.

What happens is that the incentives collapse. Reviewers never want to approve a task because approving something and having it rejected downstream gets you penalized. So they reject everything. Or skip everything. And the people writing the tasks stop caring about quality because a good task is as likely to come back as a bad one. The industry delivers 5 to 60% of contracted data and treats that as normal.

We deliver more than 99%. Because we built the workflow around what the work actually requires. Peer review. Collaboration. Transparency. Named feedback. No one loses a task because a reviewer needs to protect their job security.

Future Company Goals

What are some of the company’s future goals? D’Arcy emphasized:

Growing direct lab relationships – that’s the immediate priority. Each one is its own milestone. We want to be one of the top three players in this space by revenue within the next few years.

The platform build-out matters as much as the commercial side. The vision there is being able to give a lab not just tasks, but insight – here’s a specific model misconception we identified, here are the targeted tasks to address it, here’s how to use synthetic generation to scale from that seed data. That’s a fundamentally different product to what anyone else is offering.

Additional Thoughts

Any other topics you’d like to discuss? D’Arcy concluded:

The ACL paper, because I don’t think the finding is getting the attention it deserves. Five frontier models, 800 original maths problems, and the single biggest predictor of failure, across all of them, was prompt length. Not at some enormous context length, either. At 20 to 300 words. That should be trivially short for a modern model.

Here’s why it matters. A maths problem isn’t a story. You can’t just read it left to right and follow the thread. Every new piece of information changes the meaning of everything that came before it. The more words in the problem, the more relationships the model has to hold in its head at once. And it turns out that even at lengths that should be nothing, that’s where things start to break.

What that tells me is that the performance ceiling the labs are hitting isn’t only an architecture problem. It’s a data-quality problem. And that’s a far more fixable problem than people think.

The other thing I could talk about for hours is benchmarks, because I’m a bit obsessed with how broken they are, and it’s the exact problem I spent 20 years watching in education. You default to measuring what’s easy to measure rather than what you actually care about measuring.

Here’s the example I always give. Say the answer to a maths problem is three. A model that says “three” gets full marks. A model that says “three if you exclude degenerate triangles, or infinite if you include them”, which is the more sophisticated answer, gets marked wrong. And a model that just answers “potato” gets marked wrong too. Those last two score exactly the same. That can’t be right.

So the thing I really want to build is SovBench. The idea is to bring in the chartered institutes, the engineers, the accountants, the professional bodies that actually hold that expertise, and codify the intuition and the years of experience they keep telling us a model can’t replace. The hard part is that the more complex a question is, the more expensive it is to mark, so we’d fine-tune our own model to act as the judge. A closed model, so it doesn’t hallucinate. My instinct is to make it free for UK companies and treat it as sovereign infrastructure.

And what bothers me most is that right now the people producing benchmarks are mostly the data companies, and I’m saying that as a data company. A benchmark built without the bodies who genuinely own that expertise doesn’t mean very much.

Both of these come back to the same thing: the industry keeps measuring the wrong things, or measuring them badly. Fixing that, both the data and the way we judge it, is the whole reason Poindexter exists.