Reinforcement Learning in Real-World Systems: When It Works, When It Doesn't
A practitioner's guide to RL deployment beyond the lab
By Sasank Aduri
Reinforcement learning has produced some of the most impressive AI results: AlphaGo beating world champions, robots learning to walk, language models learning to reason through RLHF.
The temptation is to apply RL everywhere. "If it can master Go, surely it can optimize our sales process."
The reality is more nuanced. RL has specific requirements that many real-world domains don't meet. Understanding when RL works—and when it fails—is crucial for practical AI deployment.
What RL Needs
Reinforcement learning, at its core, is learning through trial and error with reward signals. The agent takes actions, receives rewards (or penalties), and adjusts its policy to maximize expected reward.
This requires several preconditions:
1. Clear Reward Signal
RL needs to know what "good" looks like. This seems obvious but is surprisingly difficult in practice.
Where it works: Games have scores. Robotics has physical objectives (reach this position, maintain balance). Code generation has tests that pass or fail.
Where it struggles: "Write a good email" has no clear reward. "Make a good business decision" depends on downstream effects that may not manifest for months. "Satisfy the customer" involves subjective judgment that varies by customer.
Many enterprise tasks are reward deserts—unclear success signals, noisy feedback, delayed consequences.
2. Ability to Explore
RL learns by trying things and seeing what happens. The agent needs to explore different actions to discover what works.
Where it works: Simulated environments where exploration is free. Digital systems where you can try variations. A/B testing where you can run experiments.
Where it struggles: High-stakes domains where failures are costly. Regulated industries where exploration may violate compliance. Customer-facing systems where bad experiences damage relationships.
You can't "explore" M&A strategies by trying random approaches with real deals.
3. Sufficient Iteration
RL typically requires many trials to learn effective policies. AlphaGo trained on millions of games. RLHF for language models uses millions of preference comparisons.
Where it works: Tasks with high volume and fast feedback. Recommendation systems with millions of daily interactions. Ad serving with immediate click signals.
Where it struggles: Low-volume domains with slow feedback. Enterprise decisions made monthly or quarterly. Strategic choices with consequences that unfold over years.
4. Stable Environment
RL learns a policy optimized for a specific environment. If the environment changes, the learned policy may no longer apply.
Where it works: Games with fixed rules. Physical systems with consistent dynamics. Domains with stable underlying processes.
Where it struggles: Markets that shift. Competitors that adapt. Regulations that change. Organizations that evolve.
The Enterprise RL Gap
Many enterprise tasks fail multiple RL preconditions simultaneously:
- Unclear rewards: "Good analysis" is subjective and context-dependent
- Exploration is costly: Bad decisions harm the business
- Low iteration: Decisions happen at human timescales, not milliseconds
- Unstable environment: Markets, competitors, and organizations constantly change
This doesn't mean RL is useless in enterprise settings. It means naive application of research RL methods won't work. You need different approaches.
What Actually Works
1. Expert Feedback as Reward
Instead of automated reward signals, use human expert feedback as the reward source.
The pattern:
- Agent performs task
- Expert evaluates output
- Feedback becomes reward signal
- Agent adjusts based on feedback
This is essentially RLHF adapted to enterprise tasks. The key insight: experts can recognize good outputs even when automated metrics can't.
An expert can't write a function that returns "1" for good M&A analyses and "0" for bad ones. But they can look at an analysis and provide structured feedback on what's good and what needs improvement.
This feedback, accumulated over many tasks, becomes the training signal.
Requirements:
- Experts willing to provide feedback (addressed through workflow integration)
- Structured feedback format (not just "this is bad" but "here's what's wrong and why")
- Sufficient volume over time (months, not days)
2. Constrained Action Spaces
Instead of learning open-ended policies, constrain what the agent can do.
Wide action space: "Do whatever you think is best." Constrained action space: "Choose from these five approaches, then execute within these parameters."
Constrained spaces are:
- Easier to explore safely
- Faster to learn (fewer possible actions)
- More interpretable (clear mapping from actions to outcomes)
- More controllable (bounds on possible behavior)
The tradeoff is reduced flexibility. But in enterprise settings, bounded, reliable behavior often beats theoretically optimal but unpredictable behavior.
3. Offline RL / Learning from Demonstrations
Instead of learning through online exploration, learn from historical data.
The enterprise has records of past decisions:
- What options were considered
- What was chosen
- How it turned out
This historical data can train RL policies without risky exploration. The agent learns from what worked in the past.
Limitations:
- Only learns from past decision distribution (may miss better options never tried)
- Historical data quality matters enormously
- Environment shift invalidates old data
Best combined with limited online refinement once a reasonable policy is learned.
4. Hierarchical RL
Break complex decisions into levels:
- High-level: Strategic direction (infrequent, high-stakes, human-in-loop)
- Mid-level: Tactical choices (periodic, moderate stakes, agent with oversight)
- Low-level: Operational execution (frequent, low-stakes, autonomous agent)
RL can work well at lower levels where:
- Decisions are frequent enough for learning
- Stakes are low enough for exploration
- Feedback is fast enough for iteration
Higher levels use different approaches—expert judgment, structured decision processes, human-in-the-loop systems.
Context's Approach
At Context, we're building RL capabilities specifically designed for enterprise realities.
Expert Feedback Loops via Context Evals
Context Evals captures structured expert feedback on agent outputs:
- What was the task?
- What did the agent produce?
- What aspects were good?
- What aspects needed improvement?
- What would have been better?
This feedback is the reward signal for enterprise RL. It's not as clean as a game score, but it's realistic for the domain.
Golden Workflows as Demonstrations
Successful task executions in Context Workspace become training demonstrations:
- Complete trace of the execution
- Decision points and choices made
- Final output and validation
These golden workflows seed offline RL training, providing a baseline policy before any exploration.
Rubric-Based Evaluation
Instead of single reward numbers, we use multi-dimensional rubrics:
- Accuracy (did the output match requirements?)
- Completeness (were all aspects addressed?)
- Quality (was the execution high-quality?)
- Efficiency (was the approach appropriately scoped?)
Multi-dimensional feedback enables more nuanced policy learning than single-score optimization.
Safe Exploration via Validation
Before agents act autonomously, outputs go through validation. This serves two purposes:
- Safety: Humans catch problematic outputs before they cause harm
- Signal: Validation/rejection provides additional training signal
Over time, as agent capability improves, validation gates can become less frequent for proven task types.
When to Use RL (and When Not To)
Use RL when:
- You can define meaningful evaluation criteria (even if subjective)
- You have experts who can provide feedback
- You have sufficient task volume for learning
- The domain is stable enough that learned patterns persist
Don't use RL when:
- Success is completely undefined or purely political
- No experts are available for feedback
- Task volume is too low for statistical learning
- The domain changes faster than learning can happen
Consider hybrid approaches when:
- High-level decisions are rare but low-level execution is frequent
- Some aspects are measurable while others require judgment
- Exploration can be done in simulation before real deployment
The Realistic Timeline
RL in enterprise settings isn't a deployment decision. It's an investment decision.
Months 1-3: Build feedback capture infrastructure. Accumulate labeled examples.
Months 4-6: Train initial models on accumulated data. Establish baseline metrics.
Months 7-12: Begin cautious online refinement. Measure improvement.
Year 2+: Realize compounding benefits. Knowledge accumulates. Policies improve.
The companies that start building RL feedback infrastructure today will have capabilities that can't be replicated by waiting for better models. The models will get better, but they'll still need your feedback data to learn your domain.
This is part of a series on applied ML at Context. Learn more at context.inc
Share this article