How I Cut My AI API Costs by 40% Without Changing a Single Prompt

Most developers overpay for AI by sending every query to the same model. Here's how intelligent routing across Anthropic, OpenAI, and Google saved me 40% on API costs.

I was spending about $200 a month on Anthropic API calls. Building a product that makes a lot of AI requests — some complex, some dead simple. Every single call went to Claude Sonnet because it was "good enough" and I didn't want to deal with multiple providers.

Then I sat down and actually looked at what those calls were doing.

About 60-70% of my API calls were simple tasks. Summarise this paragraph. Extract the name from this email. Classify this support ticket. Translate this sentence. These don't need Sonnet. A model like Gemini Flash or Claude Haiku handles them perfectly at a fraction of the cost.

The other 30% were genuinely complex — analysing financial data, generating reports, multi-step reasoning. Those needed a capable model. But I was paying Sonnet prices for the simple stuff just because I couldn't be bothered to manage multiple providers and figure out which model to use for each call.

So I fixed it. And the fix turned into a product.

The actual cost difference between models

Here's what caught my attention. These are real per-million-token costs as of early 2026:

Gemini 2.5 Flash costs about $0.15 per million input tokens and $0.60 per million output. Claude Haiku is $0.80 input and $4.00 output. GPT-4o-mini is $0.15 input and $0.60 output.

Compare that to the "default" models most developers use: Claude Sonnet at $3.00 input and $15.00 output. GPT-4o at $2.50 input and $10.00 output.

For a simple summarisation task, you're paying 10-20x more than you need to by using Sonnet instead of Flash. And the output quality on straightforward tasks is nearly identical.

The maths is simple. If 65% of your calls can go to cheap models and 35% need the mid-tier ones, your blended cost drops from roughly $8 per million tokens to about $3-4 per million. That's a 50% reduction before you've changed a single prompt.

Why developers don't do this already

Because it's genuinely painful to set up. You need to sign up for Anthropic, OpenAI, and Google. Manage three API keys. Learn three different request formats (they're similar but not identical). Build routing logic. Handle failures when one provider goes down. Track costs across three billing dashboards.

Nobody does this for a side project. Most companies don't do it either — they pick one provider and accept the overspend because the engineering cost of multi-provider routing isn't worth it.

What intelligent routing actually looks like

The approach I built classifies each incoming query before routing it. The classifier looks at signals in the prompt: how long is it, does it contain code, does it ask for analysis or reasoning, is there a system prompt, what's the conversation depth.

Based on those signals, each query gets tagged as one of four types: simple, code, reasoning, or complex. Then the routing table maps that type to a model based on your cost-quality preference.

If you want aggressive cost savings, simple tasks go to Gemini Flash, code goes to Haiku, reasoning goes to Haiku, and only truly complex multi-step queries go to Sonnet. That's the cheapest option that still maintains a quality floor.

If you want the best answer regardless of cost, everything goes to the most capable model available. Simple or not.

The interesting case is the middle ground — you want good answers at a reasonable price. Simple tasks go cheap, but anything with substance goes to a capable model. This is where most production apps should operate.

The quality floor matters more than the routing

Here's the thing I learned building this: the routing algorithm is less important than the quality floor. If your cost optimisation ever sends a complex reasoning task to a model that can't handle it, the developer loses trust immediately. One bad answer and they switch back to hardcoding Sonnet.

So the classifier has to be conservative. When in doubt, route to a more capable model. It's better to overpay slightly on an edge case than to return a garbage response. The savings come from volume — getting the easy 60-70% of calls right, not from being aggressive on the hard 30%.

Session memory: the other cost nobody talks about

While building the router, I noticed another source of waste. Every AI API is stateless. If you're building a chatbot or any multi-turn interaction, you resend the entire conversation history on every single call.

Message 1: you send 1 message. Message 5: you send all 5 messages. Message 20: you send all 20. You're paying input token costs on the same messages over and over.

The fix was adding a memory layer at the proxy level. The developer sends a session identifier with their request. The proxy stores the conversation history and assembles it before forwarding to the provider. The developer sends one new message each time. The proxy handles the rest.

This doesn't reduce the provider cost — the full history still gets sent. But it eliminates the need for the developer to build conversation management infrastructure. No Redis store, no message assembly logic, no context window overflow handling. One header and it works.

The more interesting cost benefit comes later with compression. When conversations get long, the proxy can summarise older messages using a cheap model before forwarding. The developer doesn't manage this. The summary is transparent. And the input token count on long conversations drops significantly.

Real numbers from production

I ran a test with the same prompt across different routing strategies.

A reasonably complex prompt about Indian tax law sent directly to Claude Sonnet: 955 input tokens, 1115 output tokens, cost roughly $0.018.

The same prompt through intelligent routing in balanced mode: same quality response, same token counts, but routed to the optimal model. Cost was comparable in this case because the classifier correctly identified it as a reasoning task.

Where the savings show up is in aggregate. Across a mix of simple and complex calls — the kind any production app generates — balanced mode saves 30-50% compared to hardcoding a single mid-tier model. Aggressive cost mode saves 50-70% on workloads that are mostly simple tasks.

The five-minute integration

The whole point is that none of this should require work from the developer. The proxy accepts OpenAI-compatible requests. Switching means changing one URL:

- base_url = "https://api.openai.com/v1"
+ base_url = "https://api.prism.ssimplifi.com/v1"

Existing prompts work unchanged. Existing response parsing works unchanged. The developer adds one header to choose their cost-quality preference, and optionally a session header for memory.

The proxy handles model selection, provider failover, session memory, cost tracking, and response normalisation. The developer's code stays exactly as it was.

What I'd tell myself six months ago

Stop overpaying for simple tasks. The model landscape has cheap, capable options for straightforward work. Reserve the expensive models for queries that actually need them. And if the engineering overhead of managing multiple providers is what's stopping you, use a proxy that handles it.

The AI API cost problem isn't about any single model being too expensive. It's about using the same model for everything when the tasks have wildly different complexity levels. Fix the routing and the savings follow.

I built Prism to solve this for myself and now it's available for any developer. Free tier available, no credit card required. Read the docs or get an API key.