The Preference Data Paradox: Why You Can't Hire Your Way to Good Training Data

The dream of clean triplet data, and why it's a fantasy

There's a seductive vision in AI training: hire expert annotators to create preference data. Show them two outputs, ask which is better, collect thousands of (prompt, chosen, rejected) triplets, run RLHF, and watch your model improve.

This vision has created a multi-billion dollar annotation industry. Scale AI, Surge, Appen, and hundreds of smaller shops employ armies of annotators to provide the human judgment that supposedly trains AI systems.

But there's a problem nobody wants to talk about: annotation is a fundamentally lossy approximation of real decisions.

And the most valuable decisions—the ones that would actually make AI useful in enterprises—are precisely the ones that can't be annotated.

The Triplet Dream

The idealized preference learning pipeline looks like this:

Generate two responses to a prompt
Show both to an expert annotator
Annotator selects the better response
Repeat thousands of times
Train a reward model on these preferences
Use the reward model to fine-tune the base model

This works well enough for general capabilities. "Which explanation of photosynthesis is clearer?" "Which code snippet is more efficient?" "Which email is more professional?" These are judgments that transfer across contexts.

But enterprise AI doesn't need to learn general preferences. It needs to learn your preferences. Your organization's preferences. The specific, contextual, often contradictory preferences that emerge from institutional culture, regulatory constraints, competitive dynamics, and individual relationships.

And these can't be captured through annotation.

Why Annotation Fails

Problem 1: Annotators Don't Have Your Context

When a skilled underwriter evaluates a loan application, they're drawing on:

Five years of experience at this specific institution
Knowledge of which exceptions leadership will approve
Understanding of the current risk appetite given recent market conditions
Relationships with specific brokers and their track records
Institutional memory of similar deals that went wrong

An annotator, no matter how smart, doesn't have any of this. They can evaluate "is this underwriting decision reasonable?" They cannot evaluate "is this the decision we would make?"

The preference signal from annotation is necessarily generic. It captures what a reasonable person would choose, not what your organization would choose given your context.

Problem 2: Real Decisions Are Politically Entangled

In the real world, "which option is better" is rarely a clean technical judgment. It's wrapped in organizational politics:

The VP who sponsored Option A will be offended if we choose Option B
Option B is technically better but requires cooperation from a team with whom we have a difficult relationship
Option A was already mentioned to the client, so changing course requires awkward communication
The person who proposed Option B is up for promotion, and this decision will be used as evidence

An annotator sees two options. They don't see the web of relationships, incentives, and consequences that make the choice meaningful.

Problem 3: The Ground Truth Keeps Changing

Your organization's preferences aren't static. They shift based on:

Leadership changes
Market conditions
Regulatory updates
Competitive dynamics
Recent successes and failures

The annotated preference data from six months ago might be actively misleading today. The "right" answer to "how should we handle this customer complaint" depends on whether you're in growth mode or efficiency mode, whether the customer is strategic or transactional, whether you've had similar complaints recently or this is an outlier.

Annotators capture a snapshot. Real preferences are a living stream.

Problem 4: Human Error Isn't Noise—It's Signal

Traditional annotation pipelines treat disagreement between annotators as noise to be averaged out. If three annotators choose A and two choose B, A wins.

But in real enterprise contexts, the disagreement IS the signal. The fact that reasonable people would differ on this decision tells you something important about the decision itself. Maybe it genuinely depends on factors not visible in the prompt. Maybe it's a case where organizational values are in tension. Maybe it's a situation that requires escalation rather than automated resolution.

Compressing this nuance into binary preference labels destroys exactly the information you need.

You Have to Sit in the Seat

There's a concept in enterprise training called "shadowing"—watching an expert do their job before doing it yourself. But shadowing isn't enough. At some point, you have to sit in the seat and make the actual decision.

The same is true for AI systems. Observing decisions is not the same as making them.

When an AI system actually executes a task—sends the email, generates the analysis, makes the recommendation—and a human validates or corrects it, that signal is categorically different from an annotator's hypothetical preference.

The difference:

Annotation: "If faced with this choice, I would probably pick A."

In-seat validation: "The system chose A. Given the actual consequences of this decision, for this actual customer, with this actual context, A was wrong. Here's why, and here's what it should have done."

The second signal is grounded in reality. It includes the full context. It captures the actual consequence of the decision. And crucially, it comes from someone who has to live with the outcome.

The Shift: From Annotation to Expert Feedback

This is why we built Context Evals around expert feedback loops rather than annotation pipelines.

The model:

AI executes the actual task in Context Workspace—not a simulated version, the real thing
Real experts validate the output—people who have institutional context, understand the stakes, and bear responsibility for outcomes
Feedback is captured in full context—not just "A or B" but "here's what was wrong and why, given this specific situation"
Learning is continuous—every validated task becomes training signal for similar future tasks

This is more expensive per data point than annotation. You can't hire offshore teams to do it. It requires the actual experts who do the actual work.

But the signal quality is incomparable. One piece of genuine in-context expert feedback is worth a hundred annotated triplets.

The Trillion-Dollar Market for Real Decisions

Here's the uncomfortable truth: every AI system learns from humans in some form—demonstrations, supervised fine-tuning, preference data, reinforcement learning signals.

Calling this work "annotation" or "data labeling" radically undersells what's happening. These aren't mechanical tasks. They're expressions of human judgment, expertise, and decision-making in structured form.

The question is: whose judgment?

Generic annotators give you generic preferences that produce generic AI behavior. Your actual experts, making actual decisions in actual context, give you institutional intelligence that produces AI that works like your organization works.

The first approach scales cheaply. The second approach creates lasting competitive advantage.

Why Current Systems Are Stuck in Demos

This explains why so many enterprise AI deployments fail to move beyond demos.

The demo works because it's evaluated against generic criteria. "Is this response helpful?" Yes. "Is this analysis reasonable?" Sure. Demo complete.

Production fails because real work requires institutional judgment. The response needs to be helpful in the way we help customers. The analysis needs to be reasonable given what we know about this specific situation.

You can't annotate your way to this capability. You can't hire contractors to generate it. You can only capture it by embedding AI in real workflows, having real experts validate real decisions, and building learning loops that compound institutional knowledge over time.

This is the last frontier for data in AI. Not more scale. Not better annotators. Real decisions, made in context, by the people who have to live with the consequences.

The path forward isn't more annotation. It's building systems where expert feedback emerges naturally from work—which is exactly what we're building at context.inc

The Preference Data Paradox: Why You Can't Hire Your Way to Good Training Data

The Preference Data Paradox: Why You Can't Hire Your Way to Good Training Data

The Triplet Dream

Why Annotation Fails

Problem 1: Annotators Don't Have Your Context

Problem 2: Real Decisions Are Politically Entangled

Problem 3: The Ground Truth Keeps Changing

Problem 4: Human Error Isn't Noise—It's Signal

You Have to Sit in the Seat

The Shift: From Annotation to Expert Feedback

The Trillion-Dollar Market for Real Decisions

Why Current Systems Are Stuck in Demos

More from Context

Continual Learning: Building AI That Gets Better With Every Task

Reinforcement Learning in Real-World Systems: When It Works, When It Doesn't

Durable Streams: The Infrastructure Challenge for Long-Running Agents