From 25% to 94%: What Changed Wasn't the Model

By Context Team

In early 2025, a leading semiconductor company was running AI across engineering workflows: failure analysis, design verification, yield optimization, test coverage. The foundation models they were using (the same models available to everyone) were producing outputs that their engineers accepted less than 25% of the time.

By the end of Q3, the same workflows, running on the same class of foundation models, hit 94% accuracy on the same engineering tasks. Cost per case dropped from $2.48 to$ 0.70. The system was processing 1,600 to 2,000 cases per day across 73 teams in 12 countries.

The models didn't get smarter. The infrastructure around them did.

This post covers what we built, what broke, and what we learned.

The Starting Point

The deployment started the way most enterprise AI deployments start: take a capable model, connect it to internal data, and point it at a real workflow.

The model could reason about semiconductor engineering. It understood the domain vocabulary. It could produce coherent failure analyses. On paper, the capability was there.

In practice, the engineers rejected three out of four outputs. The reasons were consistent:

Wrong signal prioritization. The model would analyze a failure report and highlight the wrong root cause indicators. Not because it didn't understand failure analysis in general, but because this organization's engineering teams had specific heuristics, developed over years, about which signals mattered for which failure modes. A yield drop correlated with a specific test pattern might be critical for one product line and noise for another. The model couldn't distinguish without the institutional context.

Incorrect process adherence. The organization had documented engineering processes, but the real processes (the ones engineers actually followed) had diverged from documentation over years of iteration. The model followed the documented process. The engineers followed the real one. The outputs didn't match what engineers expected.

Missing cross-system context. A failure analysis for a specific design block required pulling data from the test management system, the defect tracker, the design revision history, and sometimes the customer requirements database. The initial deployment could access some of these, but not with the relational awareness needed to connect a test failure to a design change to a customer specification.

None of these problems were model capability problems. A smarter model with the same context would produce the same errors.

What We Changed

1. Retrieval Over Institutional Knowledge

The first intervention was connecting the system to the organization's actual knowledge, not just its documented knowledge.

This meant ingesting not just the official process documents, but:

Historical case resolutions (the past 18 months of engineering decisions)
Team-specific runbooks that existed as informal wikis
The annotation patterns in the defect tracker (how senior engineers categorized and prioritized issues)
Cross-references between test results, design revisions, and customer specifications

The retrieval system was built with permission boundaries that matched the organization's access model. An engineer on the RF team saw different context than an engineer on the digital logic team. This wasn't just a security requirement; it was a relevance filter. Context that's irrelevant to your role is noise.

Result after retrieval integration: Accuracy improved from roughly 25% to 58%. A meaningful jump, but the system was still wrong 42% of the time.

2. Expert-Defined Evaluation Rubrics

The second intervention was defining what "good" actually meant for each workflow.

We worked with senior engineers to build structured rubrics for each task type. Not generic quality measures (grammar, completeness), but domain-specific criteria:

For failure analysis: Did the output correctly identify the failure mode category? Did it prioritize the root cause indicators in the right order? Did it reference the correct design revision?
For yield optimization: Did the output use the right statistical methodology for the sample size? Did it account for the known process variation in this fabrication line?
For test coverage: Did the output map to the correct requirement specification? Did it identify gaps against the current coverage matrix?

These rubrics became the evaluation function for the system. Every output could be scored automatically against the rubric, and when senior engineers overrode the score, that override was captured as a training signal.

Result after rubric integration: Accuracy improved from 58% to 76%. The system now understood what "good" meant for each specific workflow.

3. Feedback Capture and Integration

The third intervention was closing the loop: capturing every expert correction and making it available to the system for future cases.

When a senior engineer rewrote a failure analysis, the system captured:

The original output
The corrected version
The specific rubric dimensions where the correction occurred
The engineer's annotation (if provided)

These corrections were indexed and made available via retrieval for future cases. When the system encountered a similar failure pattern, it could retrieve not just the technical documentation, but the corrected examples from senior engineers who had seen this pattern before.

This created a compounding effect. Each correction made the system better at similar cases. After processing roughly 500 expert-reviewed cases, the corrections covered the most common patterns across the organization's engineering workflows.

Result after feedback integration:

Month	Accuracy	Cost/case	Cases/day
Month 0 (baseline)	~25%	$2.48	~400
Month 1 (+retrieval)	58%	$2.31	~600
Month 2 (+rubrics)	76%	$1.84	~900
Month 3 (+feedback)	88%	$1.12	~1,400
Month 4	94%	$0.70	~1,800

The cost reduction came from two sources: higher accuracy meant less human rework time, and the evaluation rubrics enabled confidence-based routing. High-confidence outputs (rubric score above threshold) could be routed to smaller, cheaper models without quality degradation. Low-confidence outputs stayed on frontier models with human review.

By month 4, approximately 60% of cases were routed to smaller models, accounting for most of the cost reduction.

What Broke Along the Way

The Cold Start Problem

When we first deployed the feedback system, there were no corrections to retrieve. The system had to build its correction corpus from zero. During the first two weeks, accuracy improvement was minimal because the feedback corpus was too sparse to be useful.

We addressed this by having senior engineers review a batch of historical cases (roughly 100) to seed the corpus. This gave the system enough initial signal to start producing meaningfully better outputs, which in turn generated more corrections, which improved the system further.

Rubric Drift

The rubrics we defined in month 1 needed revision by month 3. As the system improved, the failure modes changed. Early on, the most common error was wrong root cause prioritization. By month 3, the system rarely made that mistake, and the dominant errors were more subtle: incorrect confidence levels in the analysis, missing edge cases in the test coverage assessment, or formatting that didn't match the team's latest template.

We built a rubric review cadence (monthly) where senior engineers could adjust the evaluation criteria. This kept the evaluation function aligned with the team's evolving standards.

Cross-Team Variation

73 teams is a lot of teams. Engineering standards varied significantly between teams, and between the different product lines they supported. A rubric that worked for the analog design team didn't apply to the digital verification team.

We ended up building team-level rubric variants: a shared base rubric with team-specific overrides. This added complexity but was necessary for accuracy at the team level.

What Generalized

Three observations from this deployment that apply beyond semiconductors:

The model is not the bottleneck. The same foundation models that produced 25% accuracy at the start produced 94% accuracy at the end. The model didn't change. The context, evaluation, and feedback infrastructure changed. Every percentage point of improvement came from the system around the model, not the model itself.

Expert corrections are the highest-value training signal. A single correction from a senior engineer who understands the institutional context contains more learning signal for that organization than thousands of generic training examples. The feedback corpus we built (roughly 2,000 corrections over 4 months) captured more organizational knowledge than any amount of pretraining data could provide.

Evaluation rubrics are the missing primitive. Without structured rubrics, the system couldn't distinguish between "technically correct but organizationally wrong" and "actually good." The rubrics were the evaluation function that made the improvement loop possible. Building them required significant effort (senior engineers defining what "good" means for each workflow), but that effort was the highest-leverage activity in the entire deployment.

Current State

The system currently processes 1,600 to 2,000 cases per day. 99.3% of workflows were authored by the organization's own engineers (not by us), using natural language runbooks in Context Workspace. The total contract value over four years: $10.4M.

The accuracy number (94%) continues to improve incrementally as the feedback corpus grows. The cost per case ($0.70) continues to decrease as more cases qualify for routing to smaller models.

The most interesting outcome is one we didn't anticipate: the rubric definitions and correction corpus that accumulated during the deployment have become institutional assets. They encode engineering judgment that previously existed only in the heads of senior engineers. New team members can now access this codified expertise through the system, which has started to change how the organization onboards engineers.

The model is still the same commodity input it was on day one. Everything that matters is the infrastructure around it.

From 25% to 94%: What Changed Wasn't the Model

From 25% to 94%: What Changed Wasn't the Model

The Starting Point

What We Changed

1. Retrieval Over Institutional Knowledge

2. Expert-Defined Evaluation Rubrics

3. Feedback Capture and Integration

What Broke Along the Way

The Cold Start Problem

Rubric Drift

Cross-Team Variation

What Generalized

Current State

More from Context

Continual Learning: Building AI That Gets Better With Every Task

Where the Bitter Lesson Fails

Context gets $11M to build an AI-powered office suite