LLM Cost Control in Production: Multi-Level Caching for AI Products
There's a moment every AI product builder hits, usually around week three of production traffic, where the OpenAI dashboard stops being exciting and starts being alarming. The spend curve is vertical. The unit economics don't work. And the painful realisation sets in that "call the model" is not a cost model — it's a billing surprise waiting for enough users.
I've built three AI-backed products — TaxLens (income tax from bank statements), TrustRail (BNPL underwriting), and BuffByte (AI content optimisation for creators) — and across all three I've had to think carefully about where model calls happen, how often they happen, and how to avoid paying for the same computation twice.
This article is the distillation of those decisions, it's the specific layers I use, the conditions under which each one applies, and the honest evaluation of what each buys you.
Let's dig in.
Why Caching Is Different for AI Products
In a traditional API, caching is about latency: you cache to serve responses faster.(sometimes too about costs, one could argue time costs money too)
In an AI product, caching is primarily about cost(monetary): you cache to avoid paying OpenAI for a response you've already paid for.
This distinction changes the calculus. For a fast traditional API response (~20ms), you might cache results that are only marginally expensive to recompute. For an LLM call ($0.06–0.15 per request, 2–5 seconds), you cache aggressively because the cost of a cache miss is substantial.
It also changes the invalidation strategy. Traditional caches invalidate on data changes. LLM response caches must also account for prompt changes, model version changes, and the inherent non-determinism of model outputs. A cache that returns a stale LLM response because the underlying data changed and the cache key didn't capture it is a silent correctness bug.
The design question is not "should I cache?" — you should. The design question is "at which level do I cache, with which key, and with which TTL?"
Level 1: The Gate as a PreComputation Firewall
The first and most impactful caching decision isn't a response cache at all — it's the gate pattern described in an earlier article in this series. But it belongs here because its primary function is cost avoidance.
TaxLens's runPipeline runs a cheap gate model before the expensive analysis model:
// Tier 1: cheap model (~$0.002)
const gate = await llmClient.structured({
tier: 'gate',
model: env.OPENAI_GATE_MODEL,
// ...
});
if (!gate.data.valid) {
// Pipeline terminates. Analysis model never fires.
return;
}
// Tier 2: expensive model (~$0.06–0.08)
const analysis = await llmClient.structured({
tier: 'analysis',
model: env.OPENAI_ANALYSIS_MODEL,
// ...
});
At a 15% invalid document rejection rate, this saves ~13% on total model spend with no change in output quality for valid documents. At a 30% rejection rate — which is plausible for a consumer product where users experiment with non-bank-statement PDFs — it saves ~26%.
The gate also prevents one category of abuse: a user uploading documents in a rapid loop to test the system would trigger a gate failure on non-bank-statement documents without ever burning expensive analysis credits. This isn't a complete abuse prevention system, but it's a meaningful cost control at minimal engineering cost.
Cache key consideration: The gate result is not cached by document content — PDFs are large, content-hashing is expensive, and the same user might legitimately re-upload after fixing a problem. The gate is a per-submission cost, not a per-unique-document cost. For the TaxLens use case (one bank statement per tax process), this is the right model.
Level 2: Exact-Match Response Caching
For queries where the same input is likely to produce the same useful output, exact-match response caching is the most straightforward win.
The Ajala AI SDK (my open-source multi-provider AI integration library) implements this as a first-class feature. When cachable: true is set, the SDK computes a cache key from the hash of system prompt + user prompt + provider + model and returns the cached response if a matching entry exists:
const result = await ai.prompt('Get weather in {{CITY}}', {
expectJson: true,
jsonStructure: { temp: 'number', condition: 'string' },
validateJSON: true,
cachable: true,
}, { CITY: 'Lagos' });
If this exact prompt, with this exact variable substitution, against this exact model, has been called within the TTL window, the SDK returns the cached response. No API call. No tokens consumed.
Where This Works Well
Classification tasks. BuffByte's content analyser classifies creator content against a taxonomy of topics, tones, and engagement patterns. A specific YouTube title or hook will receive identical analysis regardless of how many users trigger the analysis. Exact-match caching means the second user to submit the same content gets an instant response at zero incremental cost.
Reference data queries. "What are the NTA 2025 income tax bands?" has a stable answer that doesn't change between tax years. The TaxLens chat system could cache this kind of reference query and serve it from cache for the duration of the tax year.
Validation queries. The gate call in TaxLens asks "is this document a valid Nigerian bank statement?" The answer for a given document doesn't change between users. If two users upload the same bank statement file (rare, but possible in a shared-access product), exact-match caching would serve the second validation instantly.
Where Exact-Match Fails
User-specific queries. "Why is my tax liability ₦240,000?" is specific to the user's computation object. Two users with different incomes will have different context in their prompts even if the question is identical. Exact-match caching on the full prompt + context is unlikely to produce cache hits because the context varies per user.
Long prompts. If the cache key is a hash of a 20KB prompt (a bank statement CSV embedded in the prompt), the hash computation itself is cheap but the cache may be too specific to produce hits. Segment the cache key: hash the static parts (system prompt, model) separately from the dynamic parts (user content), and cache based on the static + a meaningful key from the dynamic part.
Level 3: Process-Level Response Reuse (Conversation Threading)
TaxLens's pipeline makes two sequential model calls — gate, then analysis. Both calls process the same PDF. Without any mechanism to avoid redundancy, the PDF is sent to the model twice.
OpenAI's Responses API supports previous_response_id: a new call can continue from a previous response, and the API server reuses the cached prior context. TaxLens uses this directly:
// Tier 2 continues from Tier 1's response
const analysis = await llmClient.structured({
tier: 'analysis',
previousResponseId: gate.responseId, // ← continues the conversation
// No PDF attached — the model has it from the prior turn
// ...
});
When previous_response_id is set, the model's context from the prior response is reused server-side. The PDF doesn't need to be re-sent. The input tokens for the second call are reduced to the new user prompt and system prompt only.
The chat tier chains further:
// Chat questions continue from the analysis response
const chat = await llmClient.structured({
tier: 'chat',
previousResponseId: process.analysisResponseId,
user: `\({context}\n\nQUESTION: \){question}`,
// ...
});
Each chat question pays for: the system prompt, the context block (the computed tax figures — a few hundred tokens), and the user's question. It does not re-pay for the bank statement content, the gate verdict, or the analysis output — those are in the cached prior response context.
For a user who asks 10 follow-up questions, this difference is substantial. Without threading, each question re-sends the full context. With threading, the marginal cost per question is just the question tokens and the answer tokens.
Condition: This only works with providers that support server-side response caching and conversation threading. OpenAI's Responses API supports it. Not all providers do. When building with Ajala, check provider capabilities before designing around this pattern.
Level 4: Prompt Structure for Provider-Side Cache Hits
Both Anthropic and OpenAI offer server-side prompt caching: if a call's prompt shares a long prefix with a recent call to the same model, the provider charges a reduced rate (often 50–90% less) for the cached prefix tokens.
The key constraint: the cached prefix must be identical across calls. Variable content must come after the stable prefix.
This changes how you structure prompts. The wrong structure:
// Bad: user-specific content mixed into the prefix
const system = `You are TaxLens. The user's gross income is ₦${grossAnnualKobo}.
Answer only personal income tax questions under the NTA 2025.`;
Every user has a different grossAnnualKobo. Every call has a different system prompt. No provider-side cache hits.
The right structure:
// Good: stable instructions as prefix, variable context in the user turn
const system = `You are TaxLens, a grounded assistant for Nigerian PERSONAL income tax
under the Nigeria Tax Act 2025 (NTA 2025), effective 1 January 2026.
HARD RULES:
- Answer ONLY personal income tax questions under the NTA 2025...
- You may ONLY explain the computed numbers provided to you in the CONTEXT below...
- Every substantive answer MUST cite the relevant NTA 2025 section...`;
// Variable context goes in the user turn
const user = `CONTEXT — the only numbers you may discuss:\n${JSON.stringify(computation)}
\n\nQUESTION: ${question}`;
The system prompt is now identical across every chat call from every user. The provider caches it after the first call. Subsequent calls from any user hit the cache for the static portion. Only the variable user turn is charged at full rate.
This is the structure TaxLens uses in ai.service.ts. The SYSTEM constant is defined once at module level — no runtime interpolation, no user-specific content. All variable content goes into the user message.
How Much Does This Save?
System prompts of 500–1,000 tokens at gpt-4o pricing (~\(0.0025/1K input tokens) save ~\)0.001–0.002 per call with a 70% cache hit. At 10,000 calls/day, that's $10–20/day — not enormous, but not trivial. For the analysis system prompt (which is longer and contains classification guidance), the saving is proportionally larger.
More importantly: prompt structure for cache hits is zero-cost to implement once you're aware of the constraint. It's one of the few cost optimisations with no trade-off.
Level 5: Background Processing as a Demand Smoother
TrustRail's statementAnalysisJob doesn't respond to user requests in real time. It processes up to 10 pending applications per minute on a cron schedule:
const pendingApplications = await Application.find({ status: 'PENDING_ANALYSIS' })
.sort({ submittedAt: 1 })
.limit(10);
This is a form of demand shaping. A sudden surge of 50 simultaneous application submissions doesn't produce 50 simultaneous GPT-4o calls — it produces a queue that drains at a controlled rate over 5 minutes. The cost curve stays linear and bounded regardless of submission spikes.
For interactive products (TaxLens, BuffByte), this pattern doesn't apply directly — users are waiting for results. But the principle generalises: any AI computation that doesn't need to happen in the request path should be deferred to a background job. Asynchronous content analysis for a creator platform, batch report generation, digest emails — these are all better suited to a queue than to inline model calls.
The cost property: A queue with a throughput ceiling creates a predictable cost ceiling. If the ceiling is 10 calls/minute and each call costs $0.08, the maximum cost is $4.80/hour regardless of submission volume. Without the ceiling, cost is fully variable with user behaviour.
The Token Budget: Constraining the Chat Tier
The chat tier is the highest-risk tier for runaway costs. A single engaged user sending 30 questions in a session is 30 model calls. At $0.01–0.05 per call depending on answer length, that's manageable per user. At 1,000 concurrent users all doing the same thing, it's not.
Practical controls:
Per-session question limits — TaxLens processes are scoped to a single tax computation. The session naturally terminates when the user leaves or the process expires. But for open-ended chat products, a per-session or per-day question limit is a meaningful cost control. Implement it at the application layer before it becomes necessary at the billing layer.
Context compression — The buildContext function in TaxLens passes JSON.stringify(computation) as the user context. For a typical tax computation, this is 1–3KB. If context grows (historical data, multiple analyses), it should be summarised rather than passed in full. A summarised context that's 500 tokens instead of 2,000 tokens changes the per-question cost from $0.06 to $0.015 — a 75% reduction.
Response length constraints — System prompts that say "keep answers concise" are not just UX guidance — they reduce output tokens. A 500-token answer costs more than a 100-token answer. For a constrained domain (personal income tax questions, not open-ended conversation), short answers are often better answers anyway.
Monitoring as a Prerequisite
None of the above can be optimised without visibility into what's actually happening. TaxLens's llm_audit collection records inputTokens, outputTokens, latencyMs, tier, model, and circuitState for every model call.
This makes cost questions answerable:
"Which tier is consuming the most tokens per day?" — query
llm_audit, group bytier, suminputTokens + outputTokens"Is any user or process burning disproportionate chat turns?" — group by
code, counttier: 'chat', surface outliers"Are cache hits registering on the provider's side?" — compare
inputTokensfor chat calls withpreviousResponseIdvs. without; a significant difference confirms caching is working"Is the circuit breaker affecting cost?" — count calls with
circuitState: 'open'; if non-zero during business hours, adjust the failure threshold
The promptHash field (SHA-256 of system + user prompts) enables deduplication analysis: calls with the same hash that produce different outputTokens indicate model non-determinism; calls with the same hash that could be served from a response cache identify caching opportunities you haven't implemented yet.
Putting It Together: Cost Profile for TaxLens
A complete TaxLens analysis + one chat question, with all caching layers applied:
| Step | Model | Tokens (estimated) | Cost |
|---|---|---|---|
| Gate call | gpt-4o-mini | 2K input (PDF), 50 output | ~$0.0004 |
| Analysis call (threading) | gpt-4o | 1K input (prompt only, no PDF re-send), 500 output | ~$0.016 |
| Chat call (threading + cached prefix) | gpt-4o | 300 input (context + question, prefix cached), 150 output | ~$0.004 |
| Total | ~$0.021 |
Without caching (no threading, no prefix caching, no gate):
Gate: skipped (1 analysis call does the work)
Analysis: 20K input tokens (PDF + full prompt), 500 output → ~$0.055
Chat: 5K input tokens (full re-sent PDF context + question), 150 output → ~$0.015
Total: ~$0.070
The difference: ~\(0.049 per complete user session, or ~70% cost reduction from caching layers. At 1,000 sessions/day, that's ~\)49/day saved ($1,470/month) from architecture decisions that don't change the user experience at all.
Trade-offs
Caching introduces correctness risk. A cached LLM response that was correct when computed may be incorrect after a model update or a prompt change. Response caches must be invalidated when prompts change. The Ajala SDK's cache key includes the model version for this reason — a model upgrade automatically invalidates cached responses.
Conversation threading creates longer dependency chains. If the analysis response that subsequent chat turns depend on is deleted (by a reaper job clearing old processes), those chat turns can no longer thread correctly. TaxLens handles this by including the full computation context in every chat user turn as a fallback — the threading is an optimisation, not a requirement.
Background queues don't suit interactive workloads. The throughput ceiling that makes queues cost-predictable also makes them latency-unpredictable. If response time matters to the user, background queuing isn't the right shape.
Provider-side caching has minimum length requirements. OpenAI's prompt caching applies to prompts above a certain length threshold (currently 1,024 tokens). Short system prompts don't benefit. This is only relevant for very terse prompts — in practice, most production system prompts with meaningful instructions clear this threshold.
The Mental Model
LLM calls have three cost dimensions: how often they're made, how many tokens they use, and whether you pay full price or cached price. Every caching layer addresses one of these dimensions.
The gate addresses frequency (some calls are simply never made). Response caching and background queues address frequency from a different angle (calls are deduplicated or rate-limited). Conversation threading and prefix caching address token cost (you pay for less content per call). Provider-side prompt caching addresses per-token price (you pay a fraction of the normal rate for cached portions).
No single layer is sufficient. A product with a perfect gate but no threading still pays full token cost for every chat question. A product with threading but no gate pays full cost for every invalid document submission. The layers compound.
Design the cost architecture before the first user arrives. The bill does not wait for you to be ready for it.


