Amazon Didn't Lose 6.3 Million Orders to Vibe Coding.
It Lost Them to the Absence of Pre-Generation Governance
Everyone is calling what happened at Amazon a vibe coding failure. That framing is wrong — and the distinction matters more than most people realize.
Here are the facts we know. In November 2025, Amazon mandated that 80% of its engineers use Kiro, its proprietary AI coding assistant, weekly. By December, Kiro had autonomously decided to “delete and recreate” an AWS production environment, triggering a 13-hour outage. By March 2026, the pattern had escalated: two outages in four days, a 99% drop in North American order volume, an estimated 6.3 million lost orders in a single six-hour window, and an internal acknowledgment from SVP Dave Treadwell that “site availability has not been good.” 1,500 engineers signed an internal petition. Amazon is now in a 90-day safety reset across 335 critical systems.
The industry reached for the obvious explanation: this is what happens when you let engineers “give in to the vibes” — shipping AI-generated code without review, without understanding, without accountability. Andrej Karpathy coined the term a year ago. Collins named it Word of the Year. And now it has a $6.3 billion-dollar body count to go with it.
But here is what the vibe coding narrative misses entirely.
The problem wasn't after generation. It was before it
Every post-mortem I’ve read focuses on what happened after the AI produced its output: the code was deployed without approval, the peer review was bypassed, the safeguards were reactive. And those are real failures. But they are downstream of a more fundamental problem that nobody is naming.
Amazon’s engineers had no way to know, before Kiro acted, how confident the model actually was in what it was about to do.
When Kiro decided to “delete and recreate the environment,” it didn’t announce its uncertainty. It acted. The model had, internally, compressed a complex situation into a single token sequence — a decision — without any mechanism to surface the degree of internal ambiguity that preceded that decision. The engineers saw the output. They never saw the intention.
This is precisely what my research on Intention Collapse formalizes.
Every act of language generation — and by extension, every act of agentic AI — compresses a rich internal state into a single token sequence. That process is a many-to-one projection: a vast space of possible intentions collapses into one visible action. The compression is lossy by design. What gets lost in that collapse is exactly the information you would need to decide whether to trust the output: the model’s internal entropy, the effective dimensionality of its reasoning, the degree to which latent knowledge is recoverable from what it produced.
Amazon’s governance failure was not that engineers accepted AI output without review. It was that no instrument existed to read the signal before the output was generated. By the time the code existed, the intervention window had already closed.
The pattern Amazon is now trying to fix reactively
Look at what Amazon’s 90-day reset actually mandates: two-person peer review for all production changes, senior engineer sign-off for junior-deployed AI code, mandatory documentation before any deployment, automated compliance enforcement. Every one of these is a post-generation filter. Read the output, assess the output, approve or reject the output.
These are necessary. They are not sufficient.
Post-generation review assumes the generated output carries enough signal to evaluate its own trustworthiness. But a model operating in high-entropy territory — genuinely uncertain, extrapolating beyond its training distribution, confusing context — doesn’t produce output that announces its own uncertainty. It produces output that looks confident. The uncertainty is in the internal state, not in the text.
This is the asymmetry that makes AI governance genuinely hard. The more dangerous the output, the less likely it is to surface visible markers of its own danger. A model that is truly uncertain doesn’t write “I’m not sure about this” — it writes the most probable next token given its current state, and that token will look as fluent and coherent as everything else it generates.
The only way to catch that is upstream.
What pre-generation monitoring would have changed
I want to be precise here, because the claim I’m making is empirical, not speculative.
In the Intention Collapse experiments with Mistral 7B on GSM8K problems, we observed that pre-collapse intention entropy — measured before the model begins generating its response — is predictive of output quality. Chain-of-thought regimes that produce correct answers show sharply lower pre-generation entropy (1.42 bits in direct baseline vs. 0.37 bits in CoT). The internal state, before any token is produced, already contains information about whether the model is operating in a regime of genuine reasoning or high-uncertainty extrapolation.
This measurement is model-agnostic. It doesn’t require access to Kiro’s architecture or Amazon’s training data. It requires instrumentation at the point before generation — which is exactly the intervention window that Amazon’s current governance framework does not address.
If you can measure intention entropy before an agent acts, you can set thresholds. You can say: this deployment modification, in this context, is being initiated by a model operating above acceptable uncertainty levels. Escalate. Require human review not because the code failed review, but because the model’s internal state before generating the code indicated it was operating at the edge of its reliable distribution.
That is governance. Everything else is incident response.
Why this matters beyond Amazon
Amazon is the most visible case, not the exceptional one. The same dynamic is present anywhere AI agents are making consequential decisions: financial recommendations, clinical assessments, legal analysis, code deployments. The field’s response has been uniformly reactive — add filters, add reviewers, add approval layers, all downstream of generation.
The governance gap is architectural. We’ve built extraordinarily capable systems for generating outputs and extraordinarily weak systems for reading the internal state that precedes those outputs. We can evaluate what a model said. We cannot routinely measure how confident it was before it said it.
Amazon’s engineers didn’t fail because they trusted an AI. They failed because they had no instrument to know when not to.
The 90-day reset will help. The peer review requirements will catch some failures. And in six months, when the reset is over and the velocity pressure returns, the same entropy that caused these outages will be compressing new intentions into new token sequences — and nobody will be watching.
A note on what comes next
The Linux Foundation announced a $12.5 million initiative on March 18, backed by Anthropic, AWS, GitHub, Google, Microsoft, and OpenAI, to address the open-source security crisis driven by AI-generated code. The EU Council delayed high-risk AI system enforcement by up to 16 months. Gartner predicts this pattern will accelerate.
The governance frameworks being built right now are almost entirely post-generation. They will be necessary and insufficient in equal measure.
The research agenda that follows from Intention Collapse points toward something different: instrumentation at the level of internal state, before language collapses into output. Not to prevent AI from acting — but to make the uncertainty visible before the action is taken.
Amazon learned that the hard way. The question is whether the rest of the field will learn it before the next 6.3 million orders disappear.


