Reinforcement Learning in Real-World Systems: When It Works, When It Doesn't

A practitioner's guide to RL deployment beyond the lab

By Sasank Aduri

Reinforcement learning has produced some of the most impressive AI results: AlphaGo beating world champions, robots learning to walk, language models learning to reason through RLHF.

The temptation is to apply RL everywhere. "If it can master Go, surely it can optimize our sales process."

The reality is more nuanced. RL has specific requirements that many real-world domains don't meet. Understanding when RL works—and when it fails—is crucial for practical AI deployment.

What RL Needs

Reinforcement learning, at its core, is learning through trial and error with reward signals. The agent takes actions, receives rewards (or penalties), and adjusts its policy to maximize expected reward.

This requires several preconditions:

1. Clear Reward Signal

RL needs to know what "good" looks like. This seems obvious but is surprisingly difficult in practice.

Where it works: Games have scores. Robotics has physical objectives (reach this position, maintain balance). Code generation has tests that pass or fail.

Where it struggles: "Write a good email" has no clear reward. "Make a good business decision" depends on downstream effects that may not manifest for months. "Satisfy the customer" involves subjective judgment that varies by customer.

Many enterprise tasks are reward deserts—unclear success signals, noisy feedback, delayed consequences.

2. Ability to Explore

RL learns by trying things and seeing what happens. The agent needs to explore different actions to discover what works.

Where it works: Simulated environments where exploration is free. Digital systems where you can try variations. A/B testing where you can run experiments.

Where it struggles: High-stakes domains where failures are costly. Regulated industries where exploration may violate compliance. Customer-facing systems where bad experiences damage relationships.

You can't "explore" M&A strategies by trying random approaches with real deals.

3. Sufficient Iteration

RL typically requires many trials to learn effective policies. AlphaGo trained on millions of games. RLHF for language models uses millions of preference comparisons.

Where it works: Tasks with high volume and fast feedback. Recommendation systems with millions of daily interactions. Ad serving with immediate click signals.

Where it struggles: Low-volume domains with slow feedback. Enterprise decisions made monthly or quarterly. Strategic choices with consequences that unfold over years.

4. Stable Environment

RL learns a policy optimized for a specific environment. If the environment changes, the learned policy may no longer apply.

Where it works: Games with fixed rules. Physical systems with consistent dynamics. Domains with stable underlying processes.

Where it struggles: Markets that shift. Competitors that adapt. Regulations that change. Organizations that evolve.

The Enterprise RL Gap

Many enterprise tasks fail multiple RL preconditions simultaneously:

Unclear rewards: "Good analysis" is subjective and context-dependent
Exploration is costly: Bad decisions harm the business
Low iteration: Decisions happen at human timescales, not milliseconds
Unstable environment: Markets, competitors, and organizations constantly change

This doesn't mean RL is useless in enterprise settings. It means naive application of research RL methods won't work. You need different approaches.

What Actually Works

1. Expert Feedback as Reward

Instead of automated reward signals, use human expert feedback as the reward source.

The pattern:

Agent performs task
Expert evaluates output
Feedback becomes reward signal
Agent adjusts based on feedback

This is essentially RLHF adapted to enterprise tasks. The key insight: experts can recognize good outputs even when automated metrics can't.

An expert can't write a function that returns "1" for good M&A analyses and "0" for bad ones. But they can look at an analysis and provide structured feedback on what's good and what needs improvement.

This feedback, accumulated over many tasks, becomes the training signal.

Requirements:

Experts willing to provide feedback (addressed through workflow integration)
Structured feedback format (not just "this is bad" but "here's what's wrong and why")
Sufficient volume over time (months, not days)

2. Constrained Action Spaces

Instead of learning open-ended policies, constrain what the agent can do.

Wide action space: "Do whatever you think is best." Constrained action space: "Choose from these five approaches, then execute within these parameters."

Constrained spaces are:

Easier to explore safely
Faster to learn (fewer possible actions)
More interpretable (clear mapping from actions to outcomes)
More controllable (bounds on possible behavior)

The tradeoff is reduced flexibility. But in enterprise settings, bounded, reliable behavior often beats theoretically optimal but unpredictable behavior.

3. Offline RL / Learning from Demonstrations

Instead of learning through online exploration, learn from historical data.

The enterprise has records of past decisions:

What options were considered
What was chosen
How it turned out

This historical data can train RL policies without risky exploration. The agent learns from what worked in the past.

Limitations:

Only learns from past decision distribution (may miss better options never tried)
Historical data quality matters enormously
Environment shift invalidates old data

Best combined with limited online refinement once a reasonable policy is learned.

4. Hierarchical RL

Break complex decisions into levels:

High-level: Strategic direction (infrequent, high-stakes, human-in-loop)
Mid-level: Tactical choices (periodic, moderate stakes, agent with oversight)
Low-level: Operational execution (frequent, low-stakes, autonomous agent)

RL can work well at lower levels where:

Decisions are frequent enough for learning
Stakes are low enough for exploration
Feedback is fast enough for iteration

Higher levels use different approaches—expert judgment, structured decision processes, human-in-the-loop systems.

Context's Approach

At Context, we're building RL capabilities specifically designed for enterprise realities.

Expert Feedback Loops via Context Evals

Context Evals captures structured expert feedback on agent outputs:

What was the task?
What did the agent produce?
What aspects were good?
What aspects needed improvement?
What would have been better?

This feedback is the reward signal for enterprise RL. It's not as clean as a game score, but it's realistic for the domain.

Golden Workflows as Demonstrations

Successful task executions in Context Workspace become training demonstrations:

Complete trace of the execution
Decision points and choices made
Final output and validation

These golden workflows seed offline RL training, providing a baseline policy before any exploration.

Rubric-Based Evaluation

Instead of single reward numbers, we use multi-dimensional rubrics:

Accuracy (did the output match requirements?)
Completeness (were all aspects addressed?)
Quality (was the execution high-quality?)
Efficiency (was the approach appropriately scoped?)

Multi-dimensional feedback enables more nuanced policy learning than single-score optimization.

Safe Exploration via Validation

Before agents act autonomously, outputs go through validation. This serves two purposes:

Safety: Humans catch problematic outputs before they cause harm
Signal: Validation/rejection provides additional training signal

Over time, as agent capability improves, validation gates can become less frequent for proven task types.

When to Use RL (and When Not To)

Use RL when:

You can define meaningful evaluation criteria (even if subjective)
You have experts who can provide feedback
You have sufficient task volume for learning
The domain is stable enough that learned patterns persist

Don't use RL when:

Success is completely undefined or purely political
No experts are available for feedback
Task volume is too low for statistical learning
The domain changes faster than learning can happen

Consider hybrid approaches when:

High-level decisions are rare but low-level execution is frequent
Some aspects are measurable while others require judgment
Exploration can be done in simulation before real deployment

The Realistic Timeline

RL in enterprise settings isn't a deployment decision. It's an investment decision.

Months 1-3: Build feedback capture infrastructure. Accumulate labeled examples.

Months 4-6: Train initial models on accumulated data. Establish baseline metrics.

Months 7-12: Begin cautious online refinement. Measure improvement.

Year 2+: Realize compounding benefits. Knowledge accumulates. Policies improve.

The companies that start building RL feedback infrastructure today will have capabilities that can't be replicated by waiting for better models. The models will get better, but they'll still need your feedback data to learn your domain.

This is part of a series on applied ML at Context. Learn more at context.inc

Reinforcement Learning in Real-World Systems: When It Works, When It Doesn't

Reinforcement Learning in Real-World Systems: When It Works, When It Doesn't

What RL Needs

1. Clear Reward Signal

2. Ability to Explore

3. Sufficient Iteration

4. Stable Environment

The Enterprise RL Gap

What Actually Works

1. Expert Feedback as Reward

2. Constrained Action Spaces

3. Offline RL / Learning from Demonstrations

4. Hierarchical RL

Context's Approach

Expert Feedback Loops via Context Evals

Golden Workflows as Demonstrations

Rubric-Based Evaluation

Safe Exploration via Validation

When to Use RL (and When Not To)

The Realistic Timeline

More from Context

Continual Learning: Building AI That Gets Better With Every Task

Durable Streams: The Infrastructure Challenge for Long-Running Agents

The Future of Agent Coordination: Intents, Skills, and Why Applets Are the New Programs