AI engineering for production: the discipline we keep avoiding
The new AI app ships to applause. Then the questions start rolling in: why has the API bill tripled, and why is compliance on the phone? Turns out shipping was the easy part.
What started out as an impressive AI demo is finally in production. After a few weeks, the questions start rolling in: The CFO wants to know why the AI API bill is so high. The Head of Compliance is wondering exactly how the decision was made that is now subject of an EU AI Act complaint and investigation. The only thing a panicked engineering team has to say is “We’ll get back to you on that”.
What started out as a victory lap after the successful demos, is now turning into a nightmare of cost and compliance questions, and just plain chaotic, unpredictable fires that need to be put out daily.
It is the support agent that worked so impressively in demos, but is now doing things subtly wrong on a daily basis.
It is the legal document generator that silently broke when the vendor updated their model, that only got flagged when a customer had questions about clauses that shouldn’t have been there.
Or the behavioral change that was introduced when someone updated a prompt via a dashboard, where there was no change history, no review nor approval.
This is the default state of AI engineering in 2026, not a string of one-off failures.
Chaos is the rule, not the exception
It’s easy to think “we’re behind, everyone else has this figured out”, but the reality is, AI engineering for production systems is a discipline that is not yet fully formed. The tools and models work, the products can be built, but the operational practices of running AI systems in production with the engineering rigor we (hopefully) apply to other systems have not caught up.
And this is a problem: when a non-deterministic, unsupervised, semi-intelligent agent can make autonomous decisions with real-world consequences, both the blast radius and the space of possible outcomes grow exponentially if not carefully controlled.
The impressive MVP that was created in record time is only the beginning. An oft cited number is that maintenance of software corresponds to 60-80% of its lifetime cost. When you throw AI into it as a core component, I’d argue that the costs are going to shift even more in the direction of running and maintaining the system, even excluding cost of inference.
Why? Because we need to learn to control the chaos introduced by systems that no longer behave deterministically, that no longer have a handful of paths they can go down, but rather have potentially near infinite paths of action they can take, many which we cannot predict.
Software engineering is shifting from a discipline of producing systems to one of ensuring they stay strictly on purpose once deployed, while being understandable, financially viable, and legally defensible.
The post MVP engineering gap for AI systems
Cost
The demo cost a few cents per run, nobody was watching, because no one needed to. But after it goes to production, real traffic arrives, token economics that looked trivial at ten requests a day stop being trivial at ten thousand. The most spectacular version is a company waking up to a six figure inference bill, the less spectacular but more insidious one is a team quietly adding a few thousand a month to the cloud spend across a handful of features, and no one ever notices until enough features have been added or teams are doing it. It’s a slow creep in the cost base over quarters.
It creeps up, because the things that keep costs under control are not on by default. Caching, routing cheap requests to cheap models, attributing spend per feature so you actually know what is costing how much: all of it has to be built, and most teams don’t build it until someone with a budget asks why the costs tripled. The anti-pattern that fills the gap in the meantime is treating the provider’s billing dashboard as a cost strategy. Glance at it once in a while, note what went up, decide to look into it later, rinse and repeat.
Evals
In a demo, evaluation is a feeling. You run the thing a few times, it looks right, everyone nods. The work is fine when the audience is three people in a room who already want it to succeed. It stops working the moment the model behind it gets upgraded, the prompt drifts through a dozen small edits, or the input distribution shifts, because real users do things no test case imagined. The failures start hiding in the long tail of traffic nobody is looking at, and only the tip of the iceberg of failures is ever discovered, from the small number of users who complain, instead of switch off.
What a production system needs is an evaluation suite that catches those changes before a customer does. What most teams have instead is a handful of happy-path examples: vibes dressed up as test cases, green because they were written to be green. The actually hard version of this is regression evals built from real production traces, so that every model upgrade and prompt change gets scored against what actually happened in the wild. This is something neither open source, nor commercial products have yet solved satisfactorily, which is exactly why most teams don’t have an evaluation suite like this. It is also why most teams realize too late, that whatever evals they actually have, were theater all along.
Observability
You can debug a demo by reading the prompt, testing an input and using your mental model: here’s the instruction, here’s what came back. Production does not work the same way. By the time something is going subtly wrong, the question is no longer “what does the prompt say”, it is “which prompt version, which model, what tools got called and what was in the context window, for which user”. None of it is recoverable unless you were capturing it in the first place.
The observability stack you already run was not built to answer those questions. Datadog or Splunk will happily tell you the request took 800ms and returned a 200 (if it was one of the requests that were sampled). They won’t tell you the model changed its answer because one of the retrieved documents was different this time. Getting traces you can replay, semantic diffs between versions, and a comparison across model upgrades is real engineering work. Until someone does it, every incident review starts from “we think it began around three weeks ago”, which is probably not where you want to start.
Governance & audit
Your MVP answers to nobody. It runs, impresses the room, and leaves nothing anyone will ever ask about again. A production system answers to compliance, to auditors, to regulators, and to customers, and every one of them eventually asks the same thing: who authorized the decision that affected me, and on what basis. The regulatory clock is running fast, in more than one jurisdiction, and the honest answer is usually a shrug.
Answering it properly means an audit trail that ties every decision back to a user, the policy in force at the time, and the exact model and prompt version behind it. That is largely the same evals and observability work from earlier, aimed at accountability rather than debugging, plus a system that keeps the record instead of forgetting. None of it is there by default, and risk and compliance will want their own policies and guardrails layered on top.
Underneath all of it sits the design decision nobody has settled: where the line falls between autonomous action, static guardrails, and a human in the loop. You can make that call deliberately now, while it is cheap, or have it made for you later by whoever files the first complaint.
Human comprehension debt
When most code is written by AI, the teams mental model of the system erodes faster than the system grows. Code shows up that works, that passes review. But the process of comprehension is aided by the act of doing, and we are no longer writing anything by hand, from scratch. A study from Anthropic themselves showed engineers using AI assistance scored 17% lower (50%) in a comprehension quiz about their codebase, compared to 67% in the control group. Debugging had the steepest decline. While AI coding is a large step forward, we should not underestimate that something is also getting lost. Addy Osmani coined a term for this: comprehension debt. Code exists, understanding does not. Furthermore, the unit-tests won’t help you either, because they too were written by AI. They were constructed to pass for what the AI was doing. Maybe they are correct, maybe they are not.
The answer to this dilemma is the least clear at the time of writing. Do we blindly trust the AI? Can we pit the input agent vs an adversary agent in a different context? Or do we simply have to slow down, to sustainably go faster by actually allowing our comprehension to keep up?
A new problem, an old craft
We are still learning how to do these things, however, before trying to create a whole new playbook, it is worth being precise about what is actually changing.
What is new is the problem: a system that does not do the same thing twice, that fails in ways the old ones never could, that need to adapt to regulation that did not exist a few years ago. Building the system is not the hard part, but keeping it on purpose once it runs is. The 80/20 cost of maintenance/operations vs development remains, and perhaps even grows while the other shrinks.
The building blocks for solving the problem are not new, though. Cost discipline and tracking, evals, observability, audit trails, code review, are all practices we’ve spent years refining for systems we understood well. So when I say the discipline of running AI in production is not yet fully formed, I don’t mean we are missing knowledge. We just haven’t built the habit of aiming what we know at a target that is unpredictable.
The work is to extend the disciplines we have, not invent new ones. It’s still software engineering. It just got a little bit harder, requiring a little bit more discipline. The demo was never the hard part, everything that comes after is.


