6 Jun 2026

Same models, different infrastructure

The Context team

Every team weighing AI for real work asks the same question first: is the model good enough yet? It is the wrong question. On a fixed set of production tasks, the same model can finish 23 percent of the work or 94 percent of it. Nothing about the model changes between those two numbers. Everything around it does.

Picture an analyst asked to draft an investment memo. A capable model writes clean, confident prose and still gets it wrong. It reports GAAP net income when the firm standardizes on adjusted EBITDA. It buries the risk the committee reads first. It misses that an exception the team made last quarter applies here too. The model is not short on capability. It is short on context: the procedures the team follows, the standard the team holds, and the corrections the team makes without ever writing them down. We will follow that analyst down the page.

We can put numbers on the gap.

The test

At Qualcomm we benchmarked 100 representative tasks drawn from 8 business areas. We held the underlying models fixed and changed only the infrastructure around them, in four steps. The first step is a raw agent: tools, a capable model, and nothing about how the company works. The last step is the same agent with the team's procedures written down and the team's definition of a good answer captured from real accepted work. Each step in between adds one kind of context.

Task completion on 100 production workflows at Qualcomm, across 8 business areas. The underlying models are held fixed in every configuration.

The agents that finished 23 percent of the work with a raw scaffold finished 94 percent once they had the procedures and the standard. Same models, top to bottom. The 71 points are infrastructure.

Reading the ladder

Each rung adds one kind of context and removes one kind of failure.

Raw scaffold: 23 percent

A strong agent with tools and no institutional context retrieves, reasons, and then fails the moment a task depends on something the company never wrote down. The failures are not reasoning errors. They are context errors. The agent does not know this account's history, this team's convention, or the exception that everyone senior already carries in their head. Our analyst gets a fluent draft that reports GAAP net income and orders the sections however the model saw fit. Nothing told it otherwise, so it guessed, and it guessed like a stranger.

Domain documents: 41 percent

Add retrieval over the company's documents, the familiar search-and-retrieve setup, and accuracy climbs. The ceiling stays low. The agent can now find the policy and still cannot run the procedure. It surfaces the right page and stops short of the work. Our analyst's agent can now quote the firm's accounting policy and pull up the memo template, and it still structures the argument like a generic memo, because a document that describes the standard is not the same as the standard applied. Documents capture what the company knows. They do not capture how the company acts on it.

Procedural runbooks: 68 percent

Write the procedure down as a runbook, in plain language, and the agent stops guessing at the steps. This is the largest single jump on the ladder, and it is the one most teams never reach, because the procedure usually lives only in the person who does the work. The analyst's agent now follows the house format in order: thesis, then risks, then the model, then the recommendation. It still flags a soft risk as a hard one, because the runbook says what to do and not where this committee draws its lines. Runbooks tell the agent how to act. They do not yet tell it what a good result looks like.

Rubrics and golden sets: 94 percent

Capture accepted work as golden sets and the team's standard as rubrics, and the agent finally has the thing it was missing: a definition of good, and a way to recover when it produces something short of it. The analyst's agent has now seen a shelf of accepted memos and the dimensions the desk scores them on: risks first, adjusted EBITDA, exceptions noted with their precedent. It writes the memo the desk would have written. Corrections become reference. A near miss this week becomes a pass next week. The agent stops repeating the mistakes a person already caught once.

Where the two things come from

Two things moved the numbers: the procedure, and the standard. Neither is a document you sit down and author once. Both are captured from work the team already does.

The procedure becomes a runbook, the steps in plain language that anyone can read, run, and correct. Plain language is the point. The person who owns the procedure can write and fix it, not only an engineer, which is what lets the runbook keep up with how the work actually changes. The standard becomes a rubric and a set of golden examples: the dimensions the team scores on, and the accepted work that shows what a passing answer looks like. The important part is where they come from. Every time someone accepts an output, edits it, or rejects it, that is signal about how the work is really done and what counts as good. Capture it at the moment of the work, and the runbook and the rubric sharpen on their own, off the corrections the team was already making.

That is the whole distance between the bottom of the ladder and the top. Not a larger model. A system that records how the work is done and what good looks like, from the people doing it.

Why a bigger model does not close this

It is tempting to read 23 percent and conclude the model is not ready. Hold the four rows next to each other instead. The model is identical in all of them. The distance between the bottom rung and the top is not model capability. It is institutional context: how this team does the work, and what this team accepts as a good answer. A base model ships with neither, and it cannot, because that context is created while the work is being done and has never left the people doing it.

This is why the pattern across enterprise AI is so consistent. The pilot demos well, the model is obviously capable, and the deployment stalls anyway. The gap was never the model. It was everything the model could not see.

It also means the patient strategy, waiting for the next model, does not pay off here. A stronger base model lifts the whole ladder by some amount. It still starts without your procedures and your standard. The distance between 23 and 94 is a distance only your own work can close.

What this changes about buying

So the question to ask of an AI system is not how smart its model is. It is whether the system can capture the two things that move the numbers: the procedures your team follows, and the standard your team holds. A system that cannot record how the work is done and what good looks like is stuck near the bottom of the ladder, whichever model it routes to. The model choice is real, but it is the smaller lever, and it is the one you control least.

The ladder is also one you climb, not one you buy whole. You do not need all four rungs on day one. Each rung pays for itself on its own, so a team can move up it on the work already in front of them, instead of waiting for a single large rollout to land.

The gains compound past the accuracy number. At Qualcomm, the same shift that moved completion to 94 percent pulled a call-drop analysis from seven hours to five minutes, and cut the time to stand up a new team from four weeks to under one. None of it required a base-model upgrade.

Models supply capability. Your procedures and your standard supply the other 71 points. The work your team has already done is the asset. The platform's job is to capture it.