📢 This content was created with AI assistance and reviewed by a human editor for accuracy and compliance.

"Free" is a dangerous word in AI infrastructure. It convinces teams to make decisions they would never make if they looked at the total cost of ownership. The open source model costs nothing to license. You download the weights, you spin up some GPUs, and you are running inference without paying per-token fees to a provider. It sounds like the obvious choice for any team with scale.

It is not. And here is the calculation that proves it.

Let us start with the obvious: the model weights are free. But the infrastructure to run them is not. A single H100 GPU costs approximately $2.54 per hour to run on major cloud providers. That is $1,850 per month per GPU. For a moderately sized open source model — say, a 7B parameter model — you might need 40GB of GPU memory, which fits on a single H100. For larger models — a 70B parameter model — you are looking at multiple GPUs, driving monthly costs into the tens of thousands of dollars just for base compute, before you account for scaling, failover, or any of the other operational requirements of a production system.

Now compare that to API pricing. DeepSeek V3 charges $0.27 per million input tokens and $1.10 per million output tokens. GPT-4o-mini charges $0.15 per million input tokens and $0.60 per million output tokens. Even at moderately high scale — say, 100 million tokens per month — the API bill is around $3,000 to $10,000, depending on model and caching. That is less than the cost of a single dedicated H100.

But the real cost differential is not in the base compute. It is in everything else.

The hidden costs of open source

Engineering time. An API is a managed service. You write code, you send requests, you get responses. Self-hosting requires you to manage the entire inference stack: model loading, request batching, GPU memory management, scaling policies, health checks, failover, upgrades, and security patches. Every hour your engineers spend on these tasks is an hour not spent on your product. At typical engineering salaries — $150,000 to $250,000 per engineer — a single engineer spending 20% of their time on infrastructure maintenance costs $30,000 to $50,000 per year. That is real money.

Caching and optimization. Provider-native prompt caching is deeply integrated into the inference stack. Self-hosting requires you to implement your own KV cache management, your own request routing for shared prefixes, your own eviction policies. The providers have teams of PhDs optimizing these systems. You have whatever time your engineering team can spare. The result is that API-based teams routinely achieve 75-95% cache hit rates on agent workloads, while self-hosted teams often struggle to break 30-40% without substantial investment.

Flexibility and iteration. When a new model releases on an API, you can try it immediately, pay as you go, and switch back if it does not work. When you self-host, switching models means redeploying the entire inference stack — new weights, new memory requirements, possibly new hardware. The switching cost alone deters teams from trying better options. The API team tries five models in a month. The self-hosted team tries one.

Opportunity cost. The time you spend debugging GPU memory errors, configuring load balancers, and implementing basic observability is time you are not spending on your actual product — not improving prompts, not building features, not understanding your users. In competitive markets, that opportunity cost is the largest hidden expense of all.

The breakeven calculation

Here is how you actually calculate whether self-hosting makes economic sense:

Step 1: Estimate your monthly API bill at current volume. Include caching benefits. For most teams, this is $5,000 to $50,000 per month.

Step 2: Estimate the infrastructure cost to self-host at the same throughput. Include GPU costs, networking, storage, and the engineering time required to maintain the stack. For most teams, this is $10,000 to $100,000 per month, depending on model size and traffic patterns.

Step 3: Compare the two. If API costs are lower, the answer is trivial. If self-hosting costs are lower, ask whether the difference is large enough to justify the increased engineering overhead and reduced flexibility. For most teams, the answer is no until they are above 50 million to 100 million calls per month.

The breakeven point moves depending on model size and caching efficiency. For small models — 7B parameters or less — the breakeven is higher because API pricing is already very low. For large models — 70B parameters or more — the breakeven is lower because API prices are higher and self-hosting requires more GPUs. In practice, for the majority of teams building AI applications today, the API is cheaper or close enough that the flexibility and reduced maintenance overhead tilt the decision firmly toward APIs.

When open source actually wins

There are legitimate reasons to self-host. Compliance requirements for sensitive data that cannot leave your infrastructure. Extreme latency requirements that cannot tolerate the network round-trip to an API. Massive scale — hundreds of millions to billions of calls per month — where the API bill genuinely exceeds self-hosting infrastructure costs. Or a core competency in model optimization where your team can outperform the providers on efficiency.

But for the vast majority of teams — from early-stage startups to mid-market companies — open source is not cheaper. It is a tax. You pay in engineering time, reduced flexibility, and opportunity cost. And you get nothing in return except the illusion of control.

The Jakarta marketplace team did not self-host. They used APIs. The Singapore logistics team did not self-host. They used APIs. The Shenzhen factory with 47 agents in production for nine months? APIs. The ProjectDiscovery security testing platform? APIs. These are not teams that cannot afford GPUs. They are teams that did the math and realized that "free" was the most expensive option.

*Data sources: H100 GPU pricing (public cloud rates, 2025–2026); DeepSeek V3 pricing via Future AGI (verified June 2, 2026); GPT-4o-mini public pricing (OpenAI, 2025–2026); Jakarta marketplace, Singapore logistics, Shenzhen factory case studies (industry documentation, 2025–2026); ProjectDiscovery cost reduction case study (2026).*