Skip to content
Field Notes · Engineering

You're probably doing AI wrong.

Several months. Five model families. Eighteen real-world problems. Fifty attempts each. One stubborn conclusion: most enterprise AI failures aren't model failures — they're orchestration failures.

RB
Roderick Bertoncini · May 19, 2026 · 9 min read
5
Model families
18
Programming tasks
50×
Attempts each
~9
Minute read
I
The Experiment

Eighteen problems.
Fifty attempts each.

Most people approach AI from one of two extremes — AI is a magic wand, or AI can't do serious work. The reality is much more nuanced.

Over the last several months I tested agents against eighteen real-world programming problems inside a simulated multi-tenant medical-practice management application. Each agent attempted every task fifty times inside a resettable sandbox. Between runs, the environment was reset so the model started from identical conditions.

Every agent had the same baseline toolkit:

  • Shell access
  • File reading and writing
  • The ability to run tests and inspect git state

Enough for a reasonably capable agent to navigate a real codebase — but in the with-context condition, I added something more important.

The experiment — five model families, 18 tasks, 50 attempts each, inside a resettable sandbox
The Headline Result

Same model.
Better surroundings.

The agents gained access to a knowledge graph containing architectural documentation, historical issues, conventions, blast-radius analysis, MCP retrieval, and LSP semantic navigation. In practice, the kind of contextual memory a senior engineer develops over years.

What surprised me wasn't that performance improved. It was how dramatically.

67%
Baseline
100%
With context

Task completion in selected scenarios. The reasoning model itself did not change. The environment around the model did.

Better context. Better results. 67% to 100% task success.
II
On Orientation

Context matters more
than most people think.

A large portion of software engineering is not raw intelligence. It is orientation.

Senior engineers are often effective not because they are "smarter" in the abstract, but because they know:

  • Where things live
  • What not to touch
  • Which patterns are dangerous
  • Which historical decisions still matter
  • Where hidden dependencies exist
  • Which documentation is trustworthy

Modern AI systems exhibit the same patterns. With the right contextual scaffolding, even smaller models improved substantially. The orchestration layer increasingly determines the quality ceiling.

What the agents had access to — architectural docs, LSP, MCP retrieval, historical context

Most enterprise AI failures are orchestration failures, not raw model failures.

— the conclusion that kept reappearing
III
The Failure Modes

Agents fail
like humans fail.

Agents forget things. Not metaphorically. Operationally.

Given long briefs or large specifications, agents would sometimes:

Observed attention failures
  • Skip explicit requirements buried mid-spec
  • Overlook details in long briefs
  • Fail to complete the final third of multi-part tasks
  • Fix the wrong layer of a problem (UI for backend, vice versa)
  • Follow outdated documentation without verification

In adversarial-doc tests, less-disciplined agents confidently modified working systems to match wrong documentation. The failure modes aligned closely with human cognitive overload.

The more information I dumped into a prompt, the worse reliability often became. That contradicts a common intuition: "if the model failed, just provide a better spec." In practice, longer briefs frequently increased failure rates.

Agents fail like humans fail — skipping requirements, overlooking details, attention overload
IV
What Worked

Small changes,
outsized shifts.

Several orchestration patterns consistently improved outcomes.

i
Managed context injection

Bring in only what's needed, when it's needed.

ii
MCP knowledge tooling

Agents retrieve — they don't have to recall.

iii
Constrained workflows

Structure reduces variance.

iv
Blast-radius visibility

Know what touching X will affect.

v
Testable acceptance criteria

Define done. Make it verifiable.

vi
Feedback loops

Let agents verify and improve their own work.

vii
Retrieval before execution

Look first. Act second.

viii
Structured prefetching

Surface relevant context proactively.

These map directly onto deployment surfaces available today in Claude Code, Cursor, and the OpenAI Agent SDK. Any enterprise engineering organization could implement many of them immediately.

What works better in enterprise environments — context injection, MCP tooling, constrained workflows, acceptance criteria, feedback loops
V
The Frontier

The future of enterprise AI.

The future of enterprise AI will not be won purely through larger models.

Larger models help. But bigger models alone do not solve:

  • Context fragmentation
  • Institutional memory
  • Dependency visibility
  • Workflow discipline
  • Retrieval quality
  • Verification reliability
  • Attention allocation
  • Bounded execution

Those are systems engineering problems. The organizations that win in enterprise AI will be the ones that build the best control surfaces around models.

The future of enterprise AI — stronger models plus better control surfaces
VI · Coda

The reality is that these systems are already capable of meaningful work — but they are fragile in surprisingly human ways.

That fragility does not necessarily indicate a weak reasoning model. Often it indicates weak orchestration.

I'll likely publish additional findings around retrieval systems, tool orchestration, agent reliability, attention management, workflow constraints, and enterprise deployment patterns. A more formal paper may follow — because increasingly, that feels like the real frontier.

— RB
Roderick Bertoncini · May 2026
RB

Roderick Bertoncini

Founder, Mente360

Building practice-management software, and occasionally writing about how it gets built.

Ready to simplify your practice?

See how Mente360 can help you spend less time on admin and more time with clients.