The Two-Week Shipping Mentality

There is a phrase you hear constantly in Western AI engineering circles: "We're waiting for the model to get better." The agent is not quite reliable enough. The accuracy is not quite high enough. The latency is not quite low enough. So the team waits. They run more evaluations. They test more models. They refine the prompt for the hundredth time. And three months later, the feature is still not in production.

Meanwhile, in Jakarta, in Singapore, in Shenzhen, teams are shipping with models that score 72% on public benchmarks. They are not waiting for 98%. They are not running six-month evaluation cycles. They are building, deploying, measuring, and iterating — often in two-week cycles. And while Western teams are still deciding which model to use, the Asian teams are already on their fourth iteration of a feature that has been generating revenue for months.

This is not a story about technical superiority. It is a story about decision philosophy. The Western default is risk aversion expressed as perfectionism: we ship when the model is "ready." The Asian default, in many production environments, is speed expressed as pragmatism: we ship when the model is "good enough."

The numbers bear this out. A mid-sized logistics company in Singapore deployed an API routing agent using a model that benchmarked at 72% on MMLU. The Western equivalent in their competitor set was still evaluating models in the 90%+ range. Six months later, the Singapore company had processed over 10 million successful API calls, identified 1,200 edge cases, retrained their prompt on 300 of them, and reduced error rates to below 1%. The Western competitor had not shipped.

This pattern is not unusual. MIT’s 2025 State of AI in Business study found that 95% of enterprise AI pilot projects failed to deliver measurable business impact. The projects that succeeded were not the ones with the highest benchmark scores. They were the ones that shipped fastest and iterated most aggressively. Deloitte’s 2025 survey found that 42% of organizations abandoned at least one AI initiative in 2025, with an average sunk cost of \$7.2 million per abandoned project. The primary driver was not technical failure. It was extended pilot phases that never reached production.

The two-week shipping mentality is not about recklessness. It is about recognizing that the cost of waiting is almost always higher than the cost of fixing problems in production. Every month you spend perfecting a model that could have been shipped at 80% accuracy is a month of user feedback you are not collecting, a month of edge cases you are not discovering, and a month of revenue you are not generating.

Here is the calculation the Singapore team ran, which more teams should run:

... [

Cost of shipping early (80% accuracy): Some fraction of requests will

fail. Each failure costs some amount in user friction, support tickets,

or retries. The total cost is bounded and measurable.

Cost of waiting (98% accuracy): Three months of engineering salaries.

Three months of deferred revenue. Three months of market share

potentially captured by competitors. Three months of learning that could

have been applied to the next iteration.

The Singapore team concluded that the cost of waiting was an order of

magnitude higher than the cost of fixing failures in production. So they

shipped. And they were right.

The evidence for this approach is not theoretical. ProjectDiscovery,

building a multi-agent security testing platform, did not wait for their

models to be perfect. They shipped with a working system, measured cache

hit rates climbing from 7% to 84%, and cut costs by 59% through

iterative optimization — not through pre-deployment perfectionism. The

Jakarta marketplace team shipped a 47-language system in nine

person-days, not nine months. The Shenzhen factory with 47 agents in

production for nine months did not pilot for a year before deployment.

They deployed, measured, and improved.

The Western model-first perfectionism is a luxury of markets where AI

features are nice-to-have rather than core to revenue. In markets where

AI is the product — where speed to market determines survival — the

calculus changes. Two weeks of deployment and iteration will teach you

more about your problem than six months of offline evaluation ever will.

The deeper issue is that benchmark scores are not predictive of

production performance. A model that scores 98% on MMLU might fail on

your spe

cific task because your data distribution is different, your prompt phrasing varies, or your edge cases are not represented in the benchmark. The model that scores 72% but has been fine-tuned — or even just prompt-engineered — on your actual data will almost always outperform the 98% model in production.

The two-week shipping mentality does not mean accepting broken functionality. It means defining a minimum viable accuracy threshold — say, 80% or 85% — shipping when you hit it, and then iterating in production with real user feedback driving improvements. The alternative — waiting for 98% — means you are optimizing for a metric that does not exist in the real world.

So here is the question for your team: What feature is currently blocked by "the model isn’t good enough yet"? When was the last time you actually measured what "good enough" means for your users? And if you shipped today at 80% accuracy, how much would the failures cost you compared to the revenue you would start generating?

If you cannot answer those questions, you are not practicing responsible AI engineering. You are practicing indefinite postponement disguised as quality assurance.

*Data sources: MIT 2025 State of AI in Business Study ( MLQ.ai , 2025); Deloitte 2025 Go-to-Market AI Survey; Singapore logistics routing API case study (industry documentation, 2025); ProjectDiscovery cost reduction case study (2026).*

The Two-Week Shipping Mentality

More analysis like this, weekly.