What Is Inference Caching and Why Your Team Isn’t Doing It

If you are running anything more than a weekend prototype on an LLM API, there is a mechanism sitting in your provider’s documentation that you have probably never used. It is not hidden. It is not locked behind an enterprise plan. And its absence from your deployment is likely costing you between 50% and 90% of what you pay for input tokens — not in theory, but in actual, measurable production spend.

The mechanism is called inference caching — specifically, prompt caching or prefix caching at the provider level. When you send a prompt to an LLM, the model computes internal attention states for every token in your input, storing them in a Key-Value cache. On a full cache miss, the inference engine runs prefill across the entire input, processing every token across every layer. That is the expensive part.

But here is the thing most teams miss: if two requests share a common prefix — the same system instructions, the same tool definitions, the same static context — that work can be reused. The second request effectively gets the prefill computation for free. That is prefix caching. It caches attention states across requests, not just within a single session. And when it works, it works spectacularly.

In a real production deployment, an AI cloud provider integrated Alibaba Cloud Tair’s HiCache into their SGLang framework. The cache hit rate jumped from 40% to 80%. Average Time to First Token fell by 56%. Inference queries per second doubled. For workloads where agents share a fixed system prompt and tool definitions across all sessions — which describes nearly every agent you have ever built — cache hit rates consistently land between 75% and 95% on multi-turn conversations.

This is not a laboratory result. Major LLM providers now ship prompt caching with aggressive discounts: Anthropic caches reads at 90% off the base input price, OpenAI delivers 50% cost savings on cached tokens automatically, Google Gemini offers variable pricing based on context window. A 200,000-token system prompt without caching runs about 60 cents per request at Claude input rates; with caching on a warm cache, the same prompt runs around 6 to 8 cents per request. At 10,000 requests per day, that is the difference between \$6,000 and \$600 per day. An order of magnitude. On a single feature.

So why is your team not using it?

The short answer: because you do not know to look. The longer answer: because you, like 72% of teams using LLM APIs, have not implemented prompt caching at all, according to recent production telemetry from over 1,000 organizations. And the reason is not technical. Caching is not hard to enable. Most major providers have made it nearly automatic or require minimal configuration flags.

The real reason is cognitive. Most engineering teams think of caching as a database problem, not an LLM problem. They look at their API calls, see that each query has slightly different user input, and assume nothing could be reused. That assumption costs them tens of thousands of dollars annually — because the system prompts, the tool definitions, the instruction blocks — these are identical across millions of requests. The user’s specific question might change, but the scaffolding around that question should not. And that scaffolding is where most of your input tokens live.

Consider what an agentic workflow actually looks like. Every turn arrives carrying a long, mostly-static context: tool definitions, memory state, and prior conversation turns. A standard inference server treats each request as independent and recomputes attention from scratch — including the tokens that have not changed since the last turn. That is the exact inefficiency that prefix caching eliminates.

ProjectDiscovery, building an autonomous security testing platform with 20–40 LLM steps per task, implemented prompt caching and watched their cache hit rate climb from 7% to 84%. Overall cost savings hit 59% compared to full-rate pricing, then 66% post-optimization, with the last 10 days of their tracking period touching 70%. They served 9.8 billion tokens from cache. Not millions. Billions. Every one of those tokens would have been full-price without caching.

That is the reality check. The providers’ pricing pages do not prominently advertise the before-and-after effect because the numbers are embarrassing for their margins. DeepSeek’s API now charges \$0.27 per million input tokens on a cache miss and \$0.07 per million on a cache hit — a 74% discount on that portion. Tencent Cloud’s price reductions in June 2026 took cached hits down by 97.5% on DeepSeek-V4-Pro. These are not promotional teaser rates. They are structural economics: caching is cheaper because the computation has already been done.

So here is the question you need to ask yourself today: What percentage of your input tokens are repeats — the same system prompts, the same tool definitions, the same static context — that you are currently paying full price to recompute? In batch processing workloads, hit rates reach 92%. In agent workflows and FAQ traffic, typical hit rates land between 30% and 70%. Any number below 30% means you are burning money on something you could have solved in an afternoon.

And if your first instinct is to say "but our prompts are dynamic," go look at your most expensive endpoint right now. Pull up the prompt. Subtract the user’s message. What is left? That is your cacheable prefix.

If you are still not using inference caching at the provider level — not semantic caching at the application layer, but the prefix caching that ships with the API — you are not optimizing. You are subsidizing every other customer on your provider’s shared infrastructure.

This article covered what caching is and why most teams aren’t doing it. The case studies, the hit rates, the price differentials — they are all real, documented, and sitting in production logs right now. If your team’s numbers look worse than these benchmarks, the next step is not more engineering. The next step is opening your provider’s caching documentation and fixing your prompt assembly.

*Data sources: Alibaba Cloud HiCache production case study with Novita AI (Alibaba Cloud Blog); SGLang Production Deployment Guide, Spheron Blog (2026); Prompt Caching Efficiency — Measuring Reuse Across Real Workloads, Zenodo (March 2026); Datadog State of AI Engineering Report (2025); ProjectDiscovery cost reduction case study (2026); Prompt Caching in 2026: Anthropic vs OpenAI vs Gemini for Production Apps, Dev.to (2026); DeepSeek V3 pricing via Future AGI (verified June 2, 2026); Tencent Cloud DeepSeek-V4-Pro price adjustments, Beijing Daily (June 3, 2026).*

What Is Inference Caching and Why Your Team Isn’t Doing It

More analysis like this, weekly.

📚 Further Reading