The models are good. They pass professional exams, write production code, and reason across expert domains. The demos land, the budgets get approved, and a year later most enterprise AI pilots have produced little. The technology kept advancing and the deployments did not. That gap is the defining problem of enterprise AI, and the usual suspect, model quality, is not the cause.
To see why, follow one piece of work all the way down: an analyst's agent drafting an investment memo for a live deal.
It is not the model
In the demo, the agent is excellent. It pulls the filings, summarizes the business, lays out the model. In production, on a real deal, it misses. It reports GAAP net income when this desk standardizes on adjusted EBITDA. It opens with the thesis when the investment committee reads the risks first. It uses a comparable set the partner quietly rejected last week. None of these are reasoning errors. The agent reasoned fine. It did not know how this desk works.
Watch any vendor bake-off and the pattern repeats: the model is obviously capable, and the deployment stalls anyway. The thing it is missing is not intelligence. It is everything about how the work is actually done here, and that turns out to be a different kind of information than the one every tool is built to retrieve.
What documents hold, and what they do not
Split the knowledge a task needs into two columns. The first is the what: the facts, the policies, the records, the relationships between them. This is what a company writes down, and it is what search and retrieval are built to surface. The second is the how: the procedure as it is really run, the exceptions and when they apply, who approves what, and the reasoning behind a call. The first column is necessary. It is also where almost every enterprise AI tool stops.
- Facts, records, policies
- Relationships between them
- What the company knows
- The procedure as actually run
- The exceptions, and when they apply
- Who approves what, and the reasoning
- The corrections a reviewer makes
For the memo, the what is the filing, the comps database, and the template. The agent can retrieve all of it and still write the wrong memo, because the things that made it wrong, adjusted EBITDA, risks first, the rejected comps, live in the how column, and that column is not in any document. A system that captures only the what builds a better search box. Doing the work needs both.
The how has three levels
The how is not one thing, and an agent that misses any layer of it produces work that is almost right. It sits at three levels, and the memo runs through all three.
At the team level, it is how this desk interprets a shared standard: risks first, adjusted EBITDA, the house format. A new analyst learns it by watching, not by reading a policy that says the same thing for every desk. At the individual level, it is the texture of one person's work: this partner wants the downside case before the base case and is cautious on terminal growth. At the task level, it is the context that exists only in the moment, the side conversation last Tuesday that changed how this specific deal should be framed, true today and irrelevant by next quarter.
Documents capture the team level poorly, the individual level rarely, and the task level not at all. The task level is the richest and the most fragile: it is created and consumed in the same breath, and unless something records it as the work happens, it is gone the instant the memo ships.
Why you cannot retrieve it later
This is why a better retrieval system does not close the gap. Retrieval returns what was deposited somewhere, and the how was never deposited. It is the same reason you cannot onboard a new hire by handing them a binder, or learn a job by watching a recording of someone's screen. A recording shows you what they clicked, not why. The valuable part is the why, and the why is spoken in the moment a reviewer says not like that, here is the rule, and then it is gone.
A capable base model is, in this light, a brilliant new graduate: sharp, fast, and useless at your company on day one, for the same reason any brilliant new graduate is. The fix for the graduate is not a smarter graduate. It is onboarding, done by working alongside the people who know, asking why, and absorbing the corrections. The fix for the agent is the same, with one difference: a person's onboarding walks out the door when they leave, and an agent's can be captured and kept, if you are present at the work to capture it.
It is not a budget problem
It is tempting to think the gap is about money or will. It is not. After two years of pilots, the budgets are allocated and the mandate from the top is real. What is missing sits one level below the model and one level above the data: infrastructure that captures how the work is actually done, the moment it is done, and turns a capable model into work the company will accept.
Without that layer, every pilot repeats the same arc. The model impresses, the work moves from the demo into the real systems and the real exceptions, the how that would have carried it was never captured, and the project quietly stalls. A stronger base model next year starts in exactly the same place, knowing everything except how you work.
The question that moves
So the question that has stalled most AI programs, which model and which tool, is the wrong one. The model crossed the threshold a while ago. The question that moves is narrower and harder: what records how our work actually happens, so an agent can do it the way we would? Capture the how at the moment of work and an agent can be onboarded like a person and improve like an institution. Capture only the what, and you have built a search box and called it a deployment.