The Hidden Cost of Stateless AI APIs

Every AI API is stateless, which means you resend the entire conversation on every call. Here's what that actually costs — and why session memory matters more than you think.

If you're building anything conversational with an AI API — a chatbot, an agent, a multi-turn assistant — there's a cost hiding in plain sight that most developers don't think about until their bill shows up.

Every AI API is stateless. The model has no memory between calls. None. Claude, GPT, Gemini — it doesn't matter which provider. The API has no idea what you said in the previous message. If you want the AI to remember that the user's name is Ravi from message 1 when you send message 5, you have to resend messages 1 through 4 every single time.

And you pay for every resend.

The math nobody talks about

Let me walk through what this actually looks like in practice.

You're building a customer support chatbot using Claude Sonnet. Average conversation is 10 turns back and forth. Each user message is roughly 50 input tokens, each assistant response is roughly 200 output tokens. Pretty typical for a support bot.

Here's what you think you're paying for on message 10:

50 input tokens (the new user message)
200 output tokens (the response)

Here's what you actually pay for on message 10:

50 (message 10 from user)
50 + 200 (message 1 user and assistant)
50 + 200 (message 2)
50 + 200 (message 3)
50 + 200 (message 4)
50 + 200 (message 5)
50 + 200 (message 6)
50 + 200 (message 7)
50 + 200 (message 8)
50 + 200 (message 9)
200 output for the new response

That's 2,300 input tokens and 200 output tokens, not 50 and 200. You're paying 46x more for input on message 10 than you thought.

And it gets worse on every subsequent message. Message 15 costs even more. Message 20 is genuinely expensive. A 30-turn conversation at Claude Sonnet pricing ($3 per million input, $15 per million output) costs roughly 15-20x what a single exchange would cost on its own.

For a production chatbot handling thousands of conversations a day, this compounding effect is the single biggest driver of your API bill. More than model choice. More than prompt optimization. More than anything else.

Why APIs work this way

The stateless design isn't a bug or an oversight. It's a deliberate architectural choice and it makes sense from the provider's perspective.

Stateless APIs scale horizontally. Any request can hit any server because no server has to know anything about previous requests. They're simpler to operate, easier to debug, and they keep the model's compute profile predictable. Every request is a fresh context window processed from scratch.

But "simpler for the provider" translates directly to "more expensive for the developer." You pay for the provider's architectural convenience, one token at a time, on every request.

The workarounds most developers build

Every developer building a chat application discovers this problem on their own, usually after their first big bill. And then they build one of these four workarounds:

Sliding window. Keep only the last N messages and drop the rest. Simple to implement, but the AI genuinely forgets older context. User gets frustrated when they have to repeat things.

Summarization. Periodically summarize older messages into a compact context block. Better quality than sliding window, but adds complexity, introduces another model call, and the summary still costs tokens.

Conversation database. Store messages in Postgres or Redis, retrieve them on each request, assemble the context, send it to the API. This is the right answer for production apps but it's significant engineering work — schema, TTL management, concurrent access, cleanup, monitoring.

Prompt caching. Anthropic and OpenAI both support prompt caching now, where you can mark parts of your prompt as cacheable and the provider charges you 10% of the normal input rate on cache hits. This is genuinely helpful and reduces costs significantly. But it requires understanding the caching semantics, structuring your prompts correctly, managing cache breakpoints, and it only helps on identical repeated content. Dynamic conversation history is harder to cache because it changes on every turn.

All four of these approaches work. All four of them require the developer to write, maintain, and debug memory management code that isn't really core to their product. Nobody building a customer support chatbot wants to become an expert in conversation history compression.

What memory-as-a-service looks like

Here's the alternative: move the memory problem out of the application entirely.

Instead of your code managing conversation history, a proxy sits between your application and the AI provider. You send the proxy just the new message. The proxy handles everything else — storing the conversation, retrieving it on the next call, assembling the context, forwarding it to the provider, caching what can be cached.

Your code gets simpler. Much simpler. Here's what your client code looks like without session memory:

# Without memory — you manage history yourself
conversation = [
    {"role": "system", "content": "You are a helpful assistant."}
]

def chat(user_message):
    conversation.append({"role": "user", "content": user_message})
    response = openai.chat.completions.create(
        model="claude-sonnet-4-6",
        messages=conversation
    )
    assistant_message = response.choices[0].message.content
    conversation.append({"role": "assistant", "content": assistant_message})
    return assistant_message

And here's what it looks like with session memory at the proxy level:

# With memory — the proxy handles history
def chat(user_message):
    response = openai.chat.completions.create(
        model="claude-sonnet-4-6",
        messages=[{"role": "user", "content": user_message}],
        extra_headers={"X-Prism-Session": "user-123"}
    )
    return response.choices[0].message.content

That's it. No conversation array. No history management. No database queries. You send one message. The proxy sends the full context to the provider and bills you normally. The complexity is abstracted away.

The honest tradeoff

I want to be clear about what memory-as-a-service does and doesn't solve.

What it solves: the complexity of managing conversation history in your application code. You stop writing memory management code. You stop debugging context window overflow. You stop building conversation databases. Your chat feature becomes a 5-line wrapper around a single API call.

What it doesn't solve: the underlying token cost. The provider still charges you for every input token, including the re-injected history, because the provider still has to process those tokens. Moving memory to a proxy doesn't make the tokens free.

But it does unlock optimizations that are hard to do in application code. The proxy can automatically apply prompt caching across sessions. The proxy can compress older messages when conversations get long, using a cheap model to summarize context before forwarding to the expensive model. The proxy can share cache hits across users with similar system prompts. These are all optimizations that individual developers rarely implement themselves because the engineering effort isn't worth it for a single application.

At scale, a proxy handling millions of sessions can optimize the memory layer in ways that no individual app could. That's the real cost benefit — not the fact that memory exists, but the fact that memory is managed by infrastructure designed specifically to manage it.

When you need this

Not every AI application needs session memory. If you're doing stateless tasks — classification, extraction, translation, summarization of a single document — there's no conversation to maintain. Prism's session memory feature is opt-in exactly because most API calls don't need it.

But if you're building anything conversational, anything multi-turn, anything agentic that maintains context across calls — you need a memory strategy. The question is whether you want to build that strategy yourself or let infrastructure handle it.

For a weekend side project, roll your own sliding window. It's fine. For a production chatbot serving thousands of users, you're going to need something more robust, and "more robust" is exactly the kind of engineering problem that gets delayed, deprioritized, and eventually becomes the thing that breaks in production at 2 AM.

Memory-as-a-service exists because this problem is common enough and painful enough that it's worth solving once, centrally, so that every developer doesn't have to solve it separately.

The takeaway

Stateless AI APIs are convenient for providers and invisible to developers who don't think about the cost implications. But every multi-turn conversation is paying a hidden tax — input tokens for the full conversation history, re-sent on every call, forever.

You can fight this with sliding windows, summarization, conversation databases, prompt caching, or any combination of those. They all work. They all require engineering effort that has nothing to do with your actual product.

Or you can move memory out of your application code entirely and let a proxy handle it. Your code gets simpler. Your mental model gets cleaner. The complexity goes somewhere that's designed to handle it.

Either way, the first step is recognizing that the tax exists. Most developers don't, until the bill arrives.

I built Prism because I wanted session memory without building a conversation database. Add one header (X-Prism-Session) to any OpenAI-compatible request and Prism handles the history for you. Works across Anthropic, OpenAI, and Google. Free tier available. Read the docs.