← Back to blog
·6 min read·By Ravi Patel

Your AI bill, minus the AI you've already paid for

Most AI traffic is repeated traffic — the same prompts, the same near-duplicates, the same system messages. Caching is the difference between paying once and paying every time. Here's the math, the layers, and where Prism lands.

aiapicachingcost-optimizationdeveloper-toolssemantic-cache

If you watch your AI API spend for a week, two things become obvious. The first is that costs scale linearly with traffic, which everyone expects. The second is that a lot of that traffic is the same traffic. The same support question, asked by ten different users in slightly different words. The same system prompt, prepended to every single request. The same internal tool query, run on a cron every five minutes. You're paying full price for the model to compute the same answer it computed yesterday.

Caching is the difference between paying for an AI call once and paying for it every time. It's not a clever optimization — it's the bare minimum. The interesting question is which caching, with what tradeoffs, at what layers.

Three layers, three tradeoffs

There's no single "AI cache." There are three layers that solve overlapping but distinct problems.

Exact match. If the request you're making has been made before, byte-identical, return the previous response. This is how etag works on the web. It's instant, free, and never wrong. The catch is that it almost never hits in production AI traffic — even tiny variations like a different timestamp in the prompt, a user-specific name, or a re-ordered tool list make the requests non-identical. Exact-match cache hit rates in real-world AI APIs are typically in the 5–15% range.

Semantic match. If the meaning of the user's question matches a previous question closely enough, return the previous response. "What's your refund policy?" and "How do I get my money back?" are different strings but the same query. Semantic caching embeds both, computes cosine similarity, and returns the cached answer when similarity exceeds some threshold. This catches the 30–60% of traffic that exact cache misses. The tradeoff is correctness: a too-low threshold returns wrong answers (false positives), a too-high threshold misses real matches. The default of 0.95 is conservative — most production setups end up tuning it per use case.

Provider-native cache. Anthropic's prompt caching and OpenAI's cached input tokens operate at a different layer. They don't cache the response — they cache the prefix processing. If your system prompt is 2,000 tokens and the user message is 50 tokens, providers can recognize the system prompt from a recent request and skip re-processing those 2,000 tokens. You still pay for output, but input cost drops by 60–90% on cache hits. This works whenever your system message is stable across requests, which is most production traffic.

The three layers are additive. Exact catches identical requests at zero latency. Semantic catches near-identical requests cheaply. Provider-native catches the system-prompt overhead even when neither cache hits. Stacked together they often cut total spend in half.

The math, conservatively

Let's work an example. Say you run a customer support bot. 50,000 requests per month. Average input is 1,500 tokens (system prompt + retrieved context + user message), average output is 300 tokens. On Claude Sonnet at sport-mode markup that's roughly $360/month direct, and slightly more through Prism with the markup.

Now apply caching:

  • Exact match: 8% hit rate. $29 saved.
  • Semantic match (threshold 0.95): 22% additional hit rate on what exact missed. $79 saved.
  • Provider-native cache on the stable system prompt prefix: ~70% input-token discount on the remaining 70% of traffic that fell through to the model. $98 saved.

Total: $206/month, or 57% off. The exact ratios depend heavily on your traffic mix, but the shape is consistent: each layer compounds on the gaps the previous layer left behind.

Where caching backfires

Caching is one of the few places in software where being too aggressive changes the answer, not just the latency. Three failure modes are worth naming.

Stale answers. Your retrieval index updated. The customer's account state changed. The user's prior question affects this one. If your cache TTL is too long, the cached response is wrong because the world has moved. Default 1-hour TTLs are conservative for a reason. Anything user-state-dependent should bypass the cache entirely.

Semantic false positives. "How do I cancel my subscription?" and "How do I cancel an order?" are 0.91 similar by some embeddings — close enough that a too-permissive threshold returns the wrong policy. The fix is monitoring: log every semantic hit's similarity score, sample the responses, raise the threshold whenever you see drift. There's no way to set this once and forget it.

Multi-tenant leakage. If account A's cache entries are visible to account B, you've shipped a privacy bug. Caches must be namespaced — by API key, by project, by organization — and cache lookups must hit only the requester's namespace. This is implementation hygiene more than a tradeoff, but it's the kind of thing that's easy to get wrong.

What Prism does

Prism stacks all three layers. Caching is free across every tier — Free accounts get it by default, Paid accounts get it by default, no header to enable. Exact cache runs first (Redis SHA-256 fingerprint), then semantic cache (Upstash Vector cosine over the user's last message embedding via fastembed BGE-small), then provider-native cache pass-through to Anthropic or OpenAI. Each layer's hit shows up in the response headers (X-Prism-Cache-Status, -Saved-Cents, -Age-Seconds, -Similarity) so you can see exactly which layer caught what.

The defaults are conservative: 1-hour TTL, 0.95 similarity threshold, key-scoped (your cache is namespaced to your API key, never cross-account). For most workloads that's the right setting. If you're tuning for a specific use case — tighter TTL on volatile data, lower threshold on a domain where paraphrases are common, project-level scope across multiple keys — those knobs are part of the Pro plan.

The thing that surprised me in the build was how often the semantic layer earns its keep. Exact-match hit rates are stuck at single-to-low-double digits in any realistic traffic. The semantic layer is where actual savings materialize, and it's the layer that has to be tuned. We default to 0.95 because anything lower is risky out of the box; production users monitor and adjust. The fact that you can monitor at all — that every hit logs a similarity score in usage_logs — is the part that matters.

What this is not

Caching is not magic and it's not a substitute for picking the right model. If you're using Claude Opus for a query that Haiku could handle, no amount of caching closes that 20× cost gap on the misses. Routing comes first. Caching is the layer underneath: it makes the right model call cheaper when traffic repeats. The same way you wouldn't ship a web app without browser caching, you shouldn't ship an AI app without prompt caching. It's just hygiene.

The bill you should be paying is the bill for the AI you actually consumed. Caching is what you do to get there.