A year ago, a five‑person startup in Toronto spent four months building their AI feature on open‑source models. They had read the blog posts. They believed that proprietary APIs were a trap, that vendor lock‑in was inevitable, and that the only responsible choice was to self‑host. They spun up eight H100s on a cloud provider, loaded Llama 3‑70B, and started building.
Four months later, they had not shipped. Their infrastructure costs were running at $18,000 per month. One engineer was spending 80% of their time on GPU memory management, request batching, and scaling policies. The feature worked in their staging environment, but every time they tried to scale to production traffic, something broke. The CUDA out‑of‑memory errors. The cold‑start latency spikes. The model version upgrades that required a full redeploy.
Then they tried the API.
They picked GPT‑4o‑mini for most of their classification workload, reserved GPT‑4o for the ambiguous 10% of queries, and implemented prompt caching on their system instructions. Two weeks later, they shipped. Their monthly API bill: $3,200. Their engineering time spent on infrastructure: effectively zero. And the feature worked better than their self‑hosted version, because the API provider's inference stack was simply more optimized than anything a five‑person team could build.
The startup's CTO later told a colleague: "We thought we needed control. What we actually needed was enough control to ship."
The control trap
The open‑source vs API debate has become a religious war. One side argues that APIs are a trap—vendor lock‑in, data privacy risks, unpredictable pricing. The other side argues that self‑hosting is a distraction—operational overhead, GPU costs, engineering waste. Both sides have valid points. Both sides are asking the wrong question.
The right question is not "open source or API?" It is: "How much control does your use case actually require?"
For most teams, the answer is: far less than the open‑source community wants you to believe. The Toronto startup learned this the hard way. They assumed that because they could control every layer of the stack, they should. That assumption cost them four months of engineering time, tens of thousands of dollars in GPU spend, and a delayed market entry that their competitors used to capture early customers.
Here is the framework they wish they had used from the start. It asks four questions. Answer them honestly, and the control question resolves itself.
Question 1: Does your data include personally identifiable information, trade secrets, or regulated data that cannot legally leave your infrastructure?
If yes, you may need to self‑host. But check the fine print. Many regulated industries now permit API usage with data masking, encryption in transit, and contractual data processing agreements. HIPAA‑compliant API endpoints exist. GDPR‑compliant data residency options exist. The number of use cases that truly cannot use APIs is smaller than most teams assume. If the answer is no — and for most early‑stage and mid‑market companies, it is — then APIs are on the table.
Question 2: Do you need to modify the model architecture itself — attention mechanisms, custom kernels, novel training techniques?
If you are building AI products rather than AI infrastructure, the answer is almost certainly no. Model architecture control is relevant to model developers, not application builders. For 95% of production AI workloads — classification, extraction, summarization, routing, structured generation — the model's internal architecture is irrelevant. What matters is the output. And the output is something you can influence through prompts, retrieval, and fine‑tuning APIs, without ever touching the attention mechanism.
Question 3: Does your workload require sub‑50ms latency that cannot tolerate API network overhead?
Real‑time applications — high‑frequency trading, real‑time voice processing, certain robotics control loops — genuinely need the latency guarantees of self‑hosted inference. For everyone else, 100‑500ms API latency is perfectly acceptable. Most applications are not latency‑sensitive enough to justify the operational burden of self‑hosting. The Toronto startup's feature was a document classification system for customer support tickets. Latency tolerance: 2 seconds. The API delivered 400ms. Control over latency was irrelevant.
Question 4: Does your volume exceed the point where API costs outweigh self‑hosting infrastructure costs?
This is the only economic question that matters. And it has a clear answer for most teams.
A single H100 GPU costs about $2.54 per hour on major cloud providers. That is $1,850 per month per GPU. A 70B‑parameter model requires at least two H100s for reasonable throughput — $3,700 per month before you account for scaling, redundancy, or any of the other operational requirements of a production system.
Now compare that to API pricing. DeepSeek V3: $0.27 per million input tokens. GPT‑4o‑mini: $0.15 per million input tokens. At 10 million tokens per month — a substantial workload for most teams — the API bill is $1,500 to $2,700. That is less than the cost of a single H100. At 50 million tokens per month, the API bill is $7,500 to $13,500 — still competitive with self‑hosting once you factor in engineering overhead.
The breakeven point where self‑hosting becomes cheaper than APIs is typically above 100 million tokens per month, and even then only if you ignore the cost of the engineering team required to keep the self‑hosted stack running. For the Toronto startup, at 12 million tokens per month, their self‑hosted GPU bill was $18,000 plus engineering overhead. The API solution cost $3,200. The choice was trivial.
What "enough control" actually looks like
After the Toronto startup switched to APIs, they discovered something unexpected: the API gave them more than enough control. They could control the prompt. They could control the temperature, the top‑p, the frequency penalty. They could control the caching behavior. They could control the structured output schema. They could control the fallback logic when the model returned malformed responses.
What they could not control was the model weights, the attention implementation, or the inference kernel. And they did not need to. Those layers of control were never relevant to their problem. They were paying for control they did not use.
The Jakarta marketplace team, building a 47‑language system, made the same discovery. They did not need to control the model. They needed to control the prompt, the retrieval examples, and the validation rules. The API gave them all of that. The Shenzhen factory, running 47 agents for nine months, made the same discovery. They needed to control the agent orchestration, the caching policy, and the circuit breakers. The API gave them all of that.
In every case, the control the teams actually needed was fully available from the API provider. The control they thought they needed — the kind that requires self‑hosting — was an illusion.
The decision framework, simplified
Here is the one‑page version of the framework. It fits on a sticky note.
• Do you have a compliance requirement that absolutely forbids APIs? If yes, self‑host. If no, continue.
• Are you building a model, not an application? If yes, self‑host. If no, continue.
• Do you need sub‑50ms latency that APIs cannot provide? If yes, self‑host. If no, continue.
• Is your volume above 100 million tokens per month and you have a dedicated infrastructure team? If yes, consider self‑hosting as a cost optimization. If no, use the API.
For the overwhelming majority of teams — startups, mid‑market companies, enterprise internal tools — the answer to all four questions is "no." Which means the API is the correct choice. Not because APIs are morally superior to open source. Because the cost of control — in engineering time, in infrastructure spend, in delayed shipping — is simply not worth the benefit.
The Toronto startup learned this after four months of pain. You can learn it in the five minutes it takes to read this article.
*Data sources: H100 GPU pricing (public cloud rates, 2025–2026); DeepSeek V3 pricing via Future AGI (verified June 2, 2026); GPT‑4o‑mini public pricing (OpenAI, 2025–2026); Toronto startup migration case study (industry interview documentation, 2025); Jakarta marketplace case study (industry documentation, 2025); Shenzhen factory deployment case study (internal documentation, 2025–2026).*