Field Notes · Engineering

You're probably doing AI wrong.

Several months. Five model families. Eighteen real-world problems. Fifty attempts each. One stubborn conclusion: most enterprise AI failures aren't model failures — they're orchestration failures.

Roderick Bertoncini · May 19, 2026 · 9 min read

The Experiment — 18 tasks, 50 attempts, 5 model families

Better Context. Better Results. 67% to 100%.

What Works Better in Enterprise Environments

The setup — eighteen problems, fifty attempts each

Model families

Programming tasks

50×

Attempts each

Minute read

The Experiment

Eighteen problems.
Fifty attempts each.

Most people approach AI from one of two extremes — AI is a magic wand, or AI can't do serious work. The reality is much more nuanced.

Over the last several months I tested agents against eighteen real-world programming problems inside a simulated multi-tenant medical-practice management application. Each agent attempted every task fifty times inside a resettable sandbox. Between runs, the environment was reset so the model started from identical conditions.

Every agent had the same baseline toolkit:

Shell access
File reading and writing
The ability to run tests and inspect git state

Enough for a reasonably capable agent to navigate a real codebase — but in the with-context condition, I added something more important.

The experiment — five model families, 18 tasks, 50 attempts each, inside a resettable sandbox

The Headline Result

Same model.
Better surroundings.

The agents gained access to a knowledge graph containing architectural documentation, historical issues, conventions, blast-radius analysis, MCP retrieval, and LSP semantic navigation. In practice, the kind of contextual memory a senior engineer develops over years.

What surprised me wasn't that performance improved. It was how dramatically.

67%

Baseline

→

100%

With context

Task completion in selected scenarios. The reasoning model itself did not change. The environment around the model did.

Better context. Better results. 67% to 100% task success.

On Orientation

Context matters more
than most people think.

A large portion of software engineering is not raw intelligence. It is orientation.

Senior engineers are often effective not because they are "smarter" in the abstract, but because they know:

Where things live
What not to touch
Which patterns are dangerous
Which historical decisions still matter
Where hidden dependencies exist
Which documentation is trustworthy

Modern AI systems exhibit the same patterns. With the right contextual scaffolding, even smaller models improved substantially. The orchestration layer increasingly determines the quality ceiling.

What the agents had access to — architectural docs, LSP, MCP retrieval, historical context

Most enterprise AI failures are orchestration failures, not raw model failures.

— the conclusion that kept reappearing

III

The Failure Modes

Agents fail
like humans fail.

Agents forget things. Not metaphorically. Operationally.

Given long briefs or large specifications, agents would sometimes:

Observed attention failures

Skip explicit requirements buried mid-spec
Overlook details in long briefs
Fail to complete the final third of multi-part tasks
Fix the wrong layer of a problem (UI for backend, vice versa)
Follow outdated documentation without verification

In adversarial-doc tests, less-disciplined agents confidently modified working systems to match wrong documentation. The failure modes aligned closely with human cognitive overload.

The more information I dumped into a prompt, the worse reliability often became. That contradicts a common intuition: "if the model failed, just provide a better spec." In practice, longer briefs frequently increased failure rates.

Agents fail like humans fail — skipping requirements, overlooking details, attention overload

What Worked

Small changes,
outsized shifts.

Several orchestration patterns consistently improved outcomes.

Managed context injection

Bring in only what's needed, when it's needed.

MCP knowledge tooling

Agents retrieve — they don't have to recall.

iii

Constrained workflows

Structure reduces variance.

Blast-radius visibility

Know what touching X will affect.

Testable acceptance criteria

Define done. Make it verifiable.

Feedback loops

Let agents verify and improve their own work.

vii

Retrieval before execution

Look first. Act second.

viii

Structured prefetching

Surface relevant context proactively.

These map directly onto deployment surfaces available today in Claude Code, Cursor, and the OpenAI Agent SDK. Any enterprise engineering organization could implement many of them immediately.

What works better in enterprise environments — context injection, MCP tooling, constrained workflows, acceptance criteria, feedback loops

The Frontier

The future of enterprise AI.

The future of enterprise AI will not be won purely through larger models.

Larger models help. But bigger models alone do not solve:

Context fragmentation
Institutional memory
Dependency visibility
Workflow discipline

Retrieval quality
Verification reliability
Attention allocation
Bounded execution

Those are systems engineering problems. The organizations that win in enterprise AI will be the ones that build the best control surfaces around models.

The future of enterprise AI — stronger models plus better control surfaces

VI · Coda

The reality is that these systems are already capable of meaningful work — but they are fragile in surprisingly human ways.

That fragility does not necessarily indicate a weak reasoning model. Often it indicates weak orchestration.

I'll likely publish additional findings around retrieval systems, tool orchestration, agent reliability, attention management, workflow constraints, and enterprise deployment patterns. A more formal paper may follow — because increasingly, that feels like the real frontier.

— RB

Roderick Bertoncini · May 2026

Roderick Bertoncini

Founder, Mente360

Building practice-management software, and occasionally writing about how it gets built.

Ready to simplify your practice?

See how Mente360 can help you spend less time on admin and more time with clients.

Get Started Start Free Trial