Prism API

Prism is an OpenAI-compatible HTTP API for chat completions. It classifies each request as simple, code, reasoning, or complex, then routes it to the cheapest model capable of handling it across Anthropic, OpenAI, and Google. Session memory and provider failover are built in.

Base URL

https://api.prism.ssimplifi.com/v1

Quick reference

Auth: prism_sk_ bearer token
Compatibility: OpenAI Chat Completions API
Modes: eco (15%), balanced (20%), sport (30%)
Providers: Anthropic, OpenAI, Google
Streaming: SSE via stream: true
Free tier: 50K input / 10K output tokens/day

Quickstart

Get an API key from the signup page, then make your first call:

curl

curl https://api.prism.ssimplifi.com/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "X-Prism-Mode: eco" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "any",
    "messages": [{"role": "user", "content": "What is 2+2?"}]
  }'

Python

import requests

response = requests.post(
    "https://api.prism.ssimplifi.com/v1/chat/completions",
    headers={
        "Authorization": "Bearer YOUR_API_KEY",
        "X-Prism-Mode": "balanced",
    },
    json={
        "model": "any",
        "messages": [{"role": "user", "content": "Explain quantum computing"}],
    },
)

data = response.json()
print(data["choices"][0]["message"]["content"])

Authentication

All API requests require a Bearer token in the Authorization header. API keys start with prism_sk_.

Authorization: Bearer prism_sk_a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6

Modes

Every request requires the X-Prism-Mode header. This controls the quality/cost tradeoff:

eco

Optimizes aggressively for cost. Simple tasks go cheap. Complex tasks still get capable models. 15% markup.

balanced

Best balance of quality and cost. Smart routing for every query type. 20% markup.

sport

Best model for every task. Quality first, cost second. 30% markup.

Prism classifies your query as simple, code, reasoning, or complex, then picks the optimal model for your mode. All modes maintain a quality floor — Prism never returns a bad answer to save money.

Chat Completions

POST/v1/chat/completions

OpenAI-compatible chat completion endpoint. Send the same request body you would send to OpenAI.

Headers

Authorization

stringrequired

Bearer token with your API key.

X-Prism-Mode

stringrequired

eco, balanced, or sport.

X-Prism-Model-Prefer

string

Pin a specific model. See Model Pinning.

X-Prism-Session

string

Session ID for conversation memory. See Session Memory.

Request body

model

stringrequired

Any value accepted. Prism selects the model based on mode and classification.

messages

arrayrequired

Array of message objects with role (system/user/assistant) and content.

stream

boolean

Set to true for SSE streaming. Default false.

max_tokens

integer

Maximum tokens to generate. Default 4096.

temperature

number

Sampling temperature, 0 to 2. Default 1.

Response Format

Responses follow the OpenAI chat completion format:

Response

{
  "id": "prism-a1b2c3d4",
  "object": "chat.completion",
  "created": 1712150400,
  "model": "claude-haiku-4.5",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The answer is 4."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 8,
    "total_tokens": 20
  }
}

Response Headers

Every response includes Prism-specific headers:

X-Prism-Model

string

The actual model used (e.g. claude-haiku, gpt-4o-mini, gemini-flash).

X-Prism-Cost

string

Total cost in USD (e.g. 0.000234).

X-Prism-Tokens-In

string

Input token count.

X-Prism-Tokens-Out

string

Output token count.

X-Prism-Task-Type

string

Classification result: simple, code, reasoning, or complex.

X-Prism-Failover

string

Present and set to "true" only if the request was rerouted to a different provider.

X-Prism-Cache-Status

string

One of hit-exact, hit-semantic, miss, bypass, error, or disabled.

X-Prism-Cache-Saved-Cents

string

USD cents saved on this hit (0 on miss).

X-Prism-Cache-Age-Seconds

string

Age of the cached entry in seconds (only on hits).

X-Prism-Cache-Similarity

string

Cosine similarity for hit-semantic only (e.g. 0.9831).

X-Prism-Feedback-Id

string

UUID for this request. POST it back to /v1/feedback with thumbs / rating / comment to attach feedback. See Feedback.

Streaming

Set stream: true in the request body to receive Server-Sent Events. Chunks follow the OpenAI delta format:

SSE stream

data: {"id":"prism-xxx","object":"chat.completion.chunk","choices":[{"delta":{"content":"Hello"},"index":0}]}

data: {"id":"prism-xxx","object":"chat.completion.chunk","choices":[{"delta":{"content":" world"},"index":0}]}

data: {"id":"prism-xxx","object":"chat.completion.chunk","choices":[{"delta":{},"finish_reason":"stop","index":0}],"usage":{"prompt_tokens":12,"completion_tokens":5,"total_tokens":17}}

data: [DONE]

The final chunk includes usage data. Prism headers (X-Prism-Model, X-Prism-Task-Type) are in the HTTP response headers.

Caching

Prism caches responses automatically. No opt-in, no header to enable it — caching is free for every tier and engages the moment you send traffic. Two layers run in series: an exact-match layer (Redis SHA-256 fingerprint of the request) and a semantic-match layer (cosine similarity over an embedding of the user message).

Inspect cache behavior on any response via the X-Prism-Cache-* headers. Disable caching for a single request with X-Prism-Cache: off.

Free / Paid: 1-hour TTL, 0.95 similarity threshold, key-scoped (per-API-key namespace). Pro tunes TTL (60s–30d), threshold (0.70–0.99), and scope (project-level coming v1.3), plus the cache inspector for browsing and manual eviction.

Request Tags

Attach arbitrary key/value tags to a request via the X-Prism-Tags header. Tags persist on the request log and drive per-feature cost attribution on the Usage dashboard.

Tag a request by feature

curl https://api.prism.ssimplifi.com/v1/chat/completions \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "X-Prism-Mode: balanced" \
  -H "X-Prism-Tags: feature=onboarding,team=growth" \
  -d '{"model":"any","messages":[{"role":"user","content":"..."}]}'

Format: key=value,key=value. Whitespace is trimmed; malformed pairs are dropped silently (we never 400 a real completion over tag parsing).

Hard caps per request: max 10 keys, 64 chars per key/value. Anything over is truncated or skipped.

Capturing tags works on all tiers. The dashboard view (cost-per-feature breakdown) unlocks on Pro and Team.

Feedback

Every /v1/chat/completions response includes an X-Prism-Feedback-Id response header (a UUID). POST it to /v1/feedback to attach thumbs / rating / comment / tag.

POST /v1/feedback

curl https://api.prism.ssimplifi.com/v1/feedback \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "feedback_id": "<uuid from X-Prism-Feedback-Id>",
    "thumbs": 1,
    "rating": 5,
    "comment": "Helpful response",
    "tag": "factually-correct"
  }'

Body fields (all optional except feedback_id): thumbs ∈ {-1, 0, 1}, rating ∈ 1–5, comment (≤4000 chars), tag (≤200 chars, free-form).

UPSERT semantics: latest non-null field wins per feedback_id. Send a thumb first, a comment later — both stick.

Unknown feedback_id still returns 200 (no link to a request log, but the feedback is captured).

Aggregates view (thumbs split, rating histogram, recent comments) is on /dashboard/usage → Feedback. Available on all tiers — you can only see your own data anyway.

Policy + Governance

Per-project routing rules + monthly budget caps + an append-only audit log. Configured at /dashboard/policy(Pro/Team only). The hot path enforces in <5 ms; every firing is captured for compliance review.

Routing rules — error.policy_rule

When a rule blocks a request, Prism returns HTTP 403 with a structured envelope. Branch on error.rule to handle each kind gracefully.

403 — denied_mode example

{
  "error": {
    "type": "policy_rule",
    "rule": "denied_mode",
    "message": "Mode 'eco' is denied by project policy",
    "denied_value": "eco",
    "policy_url": "/dashboard/policy"
  }
}

Possible error.rule values:

denied_mode — the request specified a mode in the deny list. Body includes denied_value.
denied_model — the resolved model is in the deny list. Body includes denied_value (the resolved Prism model name).
max_input_tokens — estimated input tokens (4 chars ≈ 1 token) exceeded the cap. Body includes max_input_tokens and estimated_tokens.

Force-model-by-task is the one rule that does not 403 — it silently overrides the model your request would have used for a given task type (simple / code / reasoning / complex). The override surfaces in the response as the actual X-Prism-Model header and lands on the usage log row.

Budget caps — error.budget_exceeded

When a project would exceed its monthly USD cap, Prism returns HTTP 402 Payment Required. Soft warns (default 80%) email the project owner once per calendar month but never block.

402 — budget_exceeded example

{
  "error": {
    "type": "budget_exceeded",
    "message": "Project would exceed monthly cap of $50.00 (current $49.87, this request est. $0.18)",
    "monthly_cap_usd": 50.00,
    "current_spend_usd": 49.87,
    "policy_url": "/dashboard/policy"
  }
}

Pre-bill estimate uses max_tokens × output price + tokens_in × input price with a 10% safety margin. Actual usage is usually lower.
Cache hits cost $0 and are never blocked on budget grounds — caching is exactly how you stay under cap.
Mid-stream requests are never killed. Only NEW requests after the cap fires get blocked.
Disable hard-block (alert-only mode) on /dashboard/policy if you want visibility without lockout risk.
The reconciliation job runs daily at 02:00 UTC and corrects any Redis drift against the authoritative usage_logs SUM.

Audit log

Every rule change and every enforcement firing is recorded in policy_audit_log with actor, before/after, and details. View on /dashboard/usage → Audit. Retention: Pro 30 days, Team 365 days.

Reliability

The contract: your request is retried, your stream closes cleanly, you get a response or a structured error. No silent hangs, no SSE that never terminates, no "the provider went down and your app froze for 30 seconds." Live health visible at the top of /dashboard.

Failover — X-Prism-Failover

When the primary provider for your routed model 5xxs or times out (30s non-streaming, 10s to first-token for streaming), Prism retries once on the same provider, then walks the fallback chain in order. The response includes X-Prism-Failover: true when this happened. Health observations from the failed attempts feed a rolling 5-minute window — the next request from any caller routes around the unhealthy provider automatically.

Streaming mid-stream drops

If the upstream provider drops the connection AFTER first-token has flowed (network glitch, provider OOM, etc.), Prism does NOT retry mid-stream — that would corrupt the SSE you're already consuming. Instead the stream emits a final data: {"error": {"type": "stream_error", ...}} followed by data: [DONE], your client sees a clean close, and the failure is recorded in the provider's health window so the NEXT request routes around.

Speculative parallel routing — X-Prism-Speculative (Pro/Team, sport mode)

On X-Prism-Mode: sport, Pro and Team accounts fire the primary AND first healthy fallback in parallel and return whichever responds first. The loser is cancelled. The response includes X-Prism-Speculative: true when this happens. Trade: ~1.3x average token cost on the provider side (loser keeps generating until cancel propagates) for hedged p99 latency + immunity to single-provider degradation. Streaming requests stay serial in v1.5 — hedging two SSE streams is messy. We absorb the loser's token cost; you're only billed for the winner.

When all providers fail

You get a clean HTTP 502 with {"error": {"type": "provider_error", "message": "All providers failed: ..."}}. This is the only failure mode that escapes the retry/failover loop — and it requires Anthropic, OpenAI, AND Google to all be down for your routed model class at the same moment. Hasn't happened to us since v1.0 launch.

Edge Routing

api.prism.ssimplifi.comis fronted by a Cloudflare Worker. Requests hit the customer's nearest Cloudflare PoP first; the worker handles three things there before deciding whether to forward to Mumbai:

Malformed Authorization headers + unknown API keys → 401 from the edge. The request never reaches Mumbai. Saves ~250ms round-trip on every garbage request / opportunistic scan.
Exact cache lookup happens at the edge. If a previous identical request from the same project is in the cache, the response is served from the edge with X-Prism-Edge-Cache: hit.
Everything else passes through to Mumbai unchanged. Cache misses, streaming, all the v1.4 policy + budget enforcement, the v1.5 router hardening — all of it still happens in Mumbai. The worker is a front door, not a re-implementation.

Response headers

X-Prism-Edge-Cache — hit when served from edge cache, passthrough when forwarded to Mumbai, auth-reject when the edge 401'd the request.
X-Prism-Edge-Region — IATA airport code of the PoP that handled the request (e.g. SIN for Singapore, SFO for San Francisco, FRA for Frankfurt). Useful for debugging routing in multi-region deployments.
X-Prism-Cache-Status: hit-exact-edge on cache hits served from the edge — distinguishes from Mumbai-side hit-exact.

What this means for latency

International customers benefit most. Without the edge layer, every request from San Francisco, London, or Sydney pays a ~600ms round-trip to Mumbai before the response even starts coming back. With the edge:

Bad keys / malformed headers: 401 from the edge, no Mumbai round-trip. Saves ~250ms on every such request.
Cache hits: served from the edge in ~300-500ms anywhere in the world. Still bound by one Mumbai-to-edge hop for the underlying Upstash Redis lookup (single-region today). A future optimization replicates the cache to Cloudflare KV which would get this to ~30-50ms.
Cache misses + streaming + everything else: ~10-30ms of edge overhead on top of normal Mumbai latency. Negligible for non-Indian customers; small tax for Indian ones.

Cache freshness + revocation lag

The edge reads the exact same Redis cache Mumbai writes to, so deleting a cached entry from /dashboard/cache propagates to the edge on the very next read. API key revocations have a 60-second lag at the edge — the worker caches the auth lookup for 60s to avoid pummeling Supabase. During those 60s a revoked key may auth at the edge, but Mumbai re-validates and rejects, so security is unaffected; the request just gets a slightly slower 401.

Session Memory

Add the X-Prism-Session header with any unique string to enable conversation memory. Prism stores message history in Redis and automatically includes prior context in subsequent calls.

Using sessions

# First message — creates session
curl https://api.prism.ssimplifi.com/v1/chat/completions \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "X-Prism-Mode: balanced" \
  -H "X-Prism-Session: my-session-123" \
  -d '{"model":"any","messages":[{"role":"user","content":"My name is Ravi"}]}'

# Second message — Prism includes history automatically
curl https://api.prism.ssimplifi.com/v1/chat/completions \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "X-Prism-Mode: balanced" \
  -H "X-Prism-Session: my-session-123" \
  -d '{"model":"any","messages":[{"role":"user","content":"What is my name?"}]}'
# Response: "Your name is Ravi."

Sessions expire after 30 minutes of inactivity. Each access refreshes the TTL.

Maximum 100 messages per session. At 90 messages, older messages are summarized to free space.

System messages persist automatically across the session without re-sending.

Use GET /v1/sessions/{id} to check session info and DELETE /v1/sessions/{id} to clear it.

Model Pinning

Override Prism's routing by setting the X-Prism-Model-Prefer header to a specific model:

Model name	Provider	Actual model
claude-opus	Anthropic	claude-opus-4-6
claude-sonnet	Anthropic	claude-sonnet-4-6
claude-haiku	Anthropic	claude-haiku-4-5-20251001
gpt-4o	OpenAI	gpt-4o
gpt-4o-mini	OpenAI	gpt-4o-mini
gemini-pro	Google	gemini-2.5-pro
gemini-flash	Google	gemini-2.5-flash

Free Tier

Free accounts get daily access with these limits:

Daily input tokens

50,000

Resets at midnight UTC.

Daily output tokens

10,000

Resets at midnight UTC.

Max tokens per request

4,000

max_tokens cannot exceed 4,000.

Modes

eco only

Balanced and sport modes require a paid account.

Streaming

disabled

Free tier does not support streaming.

Session messages

5 max

Each session limited to 5 messages on the free tier.

All Endpoints

POST/v1/chat/completionsSend a chat completion request

GET/v1/sessions/{id}Get session info (message count, tokens, created_at)

DELETE/v1/sessions/{id}Delete a session

GET/v1/usageList usage logs with filters (from, to, mode, limit, offset)

GET/v1/usage/summaryAggregated usage summary with cost comparison

GET/v1/usage/exportExport usage logs as CSV (from, to params)

GET/v1/balanceGet current balance, tier, and auto-topup status

POST/v1/keysCreate a new API key

GET/v1/keysList active API keys

DELETE/v1/keys/{id}Revoke an API key

All endpoints except sessions require API key authentication.

Error Codes

Errors return a consistent JSON format:

Error response

{
  "error": {
    "type": "insufficient_balance",
    "message": "Insufficient balance. Please top up your account."
  }
}

Status	Type	Description
400	invalid_request	Malformed request body or invalid parameters
400	missing_mode_header	X-Prism-Mode header not provided
401	invalid_api_key	Missing, invalid, or revoked API key
402	insufficient_balance	Account balance too low for estimated cost
403	free_tier_limit	Request exceeds free tier limits
429	rate_limited	Too many requests. Check Retry-After header
502	provider_error	All AI providers failed after failover attempts

Have questions?

See the Prism FAQ for answers to common developer questions, or email hello@ssimplifi.com.