What is inference caching in LLMs?

Inference caching stores previously computed intermediate results (KV cache) from LLM inference. When a similar query arrives, the cached result is reused instead of recomputed, cutting API costs by up to 70% for repetitive workloads.

How much can inference caching reduce API costs?

Production systems with good cache hit rates (60-90%) can reduce API bills by 70% or more. The exact savings depend on query similarity and workload repetition patterns.

Which LLM providers support inference caching?

Major providers like OpenAI, Anthropic, and Google offer prompt caching but often don't prominently advertise it. Implementing it usually requires explicit cache configuration in your API calls.

The 70% Off Your API Bill That Pricing Pages Won't Highlight

You're looking at your monthly LLM API bill, and something doesn't add up. You're not running anything unusual — just normal production traffic, maybe a customer support bot, maybe a code assistant. The volume feels reasonable. The cost doesn't.

And you're right to be suspicious.

Global LLM API spending doubled from \$3.5 billion to \$8.4 billion in 2025^[1], as companies moved from proof-of-concept to production at scale. But a large chunk of that spending isn't coming from model usage alone. It's coming from recomputing the same work over and over again.

The mechanism your providers don't advertise

Here's the architecture behind every LLM API call you pay for. When you send a prompt, the model doesn't just generate output—it first computes attention states for every token in your input, storing them in what's called a Key-Value (KV) cache. That computation isn't trivial. On a full cache miss, the inference engine runs prefill across the entire input, processing every token across every layer at full quadratic attention cost.

Now here's what happens if two requests share a common prefix—say, the same system instructions, the same tool definitions, or the same user context. That KV state can be reused instead of recomputed. The second request effectively gets the prefill work for free.

That's what prefix caching does. It caches attention states across multiple requests, not just within a single session. And when it works, the economics shift dramatically.

A Singapore logistics company recently implemented caching on their routing API. They hit cache rates between 70 and 90 percent^[2]. That means 7 to 9 out of every 10 API calls cost them almost nothing on the input side.

You're probably not there yet. But that number isn't a hallucination—it's what happens when you design your prompts with prefix stability in mind.

The 72 percent problem

A recent survey found that 72 percent of teams using LLM APIs aren't using prompt caching at all^[3]. Not that they're doing it poorly. Not that they're getting low hit rates. They're simply not doing it.

Why? The answer isn't technical. Caching isn't hard to enable. Most major providers have made it automatic or near-automatic—Heroku launched automatic prompt caching in December 2025, enabled by default. Databricks now supports prompt caching for open-source models with no setup required.

The real reason is cognitive. Most engineering teams think of caching as a database problem, not an LLM problem. They look at their API calls, see that each query has slightly different user input, and assume nothing could be reused. That assumption costs them tens of thousands of dollars annually.

But here's what they're missing: the system prompts, the tool definitions, the instruction blocks—these are identical across millions of requests. The user's specific question might change, but the scaffolding around that question shouldn't. And that scaffolding is where most of your input tokens live.

A cache that's designed around prefix stability—placing dynamic content at the end of the system prompt—consistently outperforms naive full-context caching, which can paradoxically increase latency.

What 70 percent actually saves you

Global LLM API prices dropped roughly 80 percent from 2025 to 2026^[4]. GPT-4-level performance costs about \$0.40 per million tokens now, down from \$30 per million in March 2023. That's the good news.

The bad news: inference volume is growing faster than prices are falling. Agentic workflows that make 50 to 200 LLM calls per task turn a cheap per-token price into an expensive per-task cost. Inference now accounts for roughly 70 percent of total AI compute costs^[5].

Now plug caching into that equation.

Take a production system running 10 million API calls per month. Say each call has 8,000 input tokens—a typical configuration for an agent with tool definitions and conversation history. With a 50 percent cache hit rate—conservative for most production workloads, where typical hit rates range from 30 to 50 percent—you're saving on half of those input tokens across every call.

At DeepSeek's current pricing — \$0.14 per million input tokens on cache miss, \$0.0028 per million on cache hit for V4-Flash^[6] — the difference is stark. Cached input tokens cost roughly 2 percent of what you'd pay for uncached processing. That's a 98 percent discount on the portion that hits.

For 10 million calls at 8,000 input tokens each:

Total input tokens: 80 billion
Without caching: ~\$11,200 (at \$0.14/1M)
With 50% cache hit: ~\$5,712 (\$5,600 uncached + \$112 cached)

That's \$5,488 saved per month. Per year, that's nearly \$66,000. For one moderately sized workload.

If you hit 70 percent, as the logistics company did? The saving jumps to \$7,600 monthly. Over \$90,000 annually.

And those numbers assume you're on a relatively low-cost provider. On Anthropic's API, where cached token reads cost 90 percent less than uncached — \$0.03 per million for Claude 3 Haiku versus \$0.30 for a miss^[7] — the multiple is even larger.

The point isn't the specific math. The point is that this isn't fine-tuning or model distillation or any of the heavy optimization work that requires ML expertise. This is fixing how you assemble strings before sending them.

Why the pricing page is silent

Open your provider's pricing dashboard. Look for the row that says "cache hit discount." You'll find it—usually 50 percent for OpenAI, 90 percent for Anthropic, 75 percent for Google Gemini's context caching. But look at the fine print.

OpenAI's caching is automatic but only effective above 1,024 tokens and charges no write fee. Anthropic's caching gives you 90 percent off on reads but charges a 25 percent premium on cache writes—\$6.25 per million instead of \$5.00 for a 5-minute TTL, or double for a 1-hour TTL. Google Gemini charges a per-hour cache storage fee in addition to cache hit pricing.

Every provider's pricing model is different. And none of them prominently advertise the before-and-after effect, since the savings figures are substantial relative to their standard rates.

But the bigger silence isn't on the pricing page — it's in the ecosystem conversation. For every article about fine-tuning techniques or model benchmarks, there are maybe five about inference caching. Prompt caching tends to receive less attention in the optimization conversation because it's not technically glamorous, even though it delivers significant savings.

Care Access, a healthcare organization, recently achieved an 86 percent reduction in data processing costs and 66 percent faster processing by caching static medical records while varying only the analysis questions^[8]. Amazon's ElastiCache semantic caching experiments reduced LLM inference cost by up to 86 percent and improved average latency by up to 88 percent^[9]. These aren't laboratory results. These are production numbers from companies that decided to look at their bills and actually do something about them.

A production agent stack without the three-layer caching pattern — engine prefix cache, API prompt cache, and gateway semantic cache — is carrying 30 to 60 percent avoidable inference bill. That's not a rounding error. That's the difference between a profitable product and one that's absorbing significant avoidable costs.

Now do this today

You don't need a six-month roadmap. You don't need to refactor your entire application. Here's what you do, starting this week:

First, measure your current cache hit ratio. Most teams have no idea what theirs is because they've never looked. A 30 to 50 percent hit rate is typical for unoptimized production workloads; anything below 20 percent means you're paying for expensive redundant computation.

Second, fix your prompt assembly. Caching works by hashing the prefix of your prompt. If the hash matches a recent request, you pay the cheap rate. If it doesn't—if your system prompt changes slightly between calls, if dynamic content is interspersed with static content, if you're generating fresh tool definitions on every request—you pay full price on the entire prefix.

Move everything that stays constant to the front of your prompt. Put user-specific content at the end. That's it. That's most of the optimization.

Third, check your provider's caching semantics. Some require explicit cache control markers. Some cache automatically above a token threshold. Some charge for writes. Read the documentation—it takes 15 minutes and the ROI is thousands of dollars per month.

Strategic cache block placement—not just enabling caching—is what separates teams with 10 percent hit rates from teams with 75 percent. You don't need to be an AI researcher to implement this. You need to care about your bill and understand how prefix hashing works.

And if you're still skeptical, run a small A/B test. Take one endpoint with stable prompts, enable caching, and watch the number. The results will speak for themselves.

📖 Deeper dive? For a complete framework on AI cost engineering — including how to evaluate caching strategies, model selection, and infrastructure TCO — see The AI Tax: 5 Decisions That Stop You From Overpaying for AI on Kindle.

Don't expect providers to highlight this optimization. The pricing page already shows the numbers — it's up to you to connect them with your architecture. That's your job.

📋 Data Authenticity Statement

^[1] SemiAnalysis LLM inference economics report, 2025. Global LLM API spending growth analysis.
^[2] Dev.to LLM cost optimization series, 2026. Singapore logistics company case study.
^[3] Future AGI prompt caching evaluations, 2026. Enterprise LLM caching adoption survey.
^[4] DigitalOcean inference optimization reports, 2026. Historical LLM pricing trend analysis.
^[5] SemiAnalysis / Future AGI, 2026. AI compute cost breakdown.
^[6] DeepSeek API official pricing documentation, June 2026.
^[7] Anthropic API caching pricing page, June 2026.
^[8] AWS Bedrock caching case studies, 2025. Care Access implementation.
^[9] AWS ElastiCache semantic caching experiments, 2025.

Additional sources: Zenodo context caching economic analysis (2026). Pricing figures reflect publicly available provider rate cards as of June 2026; actual costs may vary by provider and usage tier. Pricing is subject to change. All projections are estimates and should not be interpreted as guaranteed cost reductions.