Grounding AI in High-Stakes Domains: When the LLM Must Never Produce the Number

A few months ago I shipped two products within weeks of each other. One computes your Nigerian income tax from a bank statement. The other decides whether a small business gets a BNPL loan. Different domains, different users, different stakes — but the backend architecture converged on the same rule: the language model reads the document; deterministic code produces every number that matters.
This isn't a style preference. It's the invariant the entire design is built around. Understanding why it's necessary, where the boundary sits, and what breaks when you cross it is what this article is about.
Let's get cracking,
The Problem with Trusting the Number
Language models are remarkably good at reading. They can parse a scanned bank statement in six different formats, figure out that "SALARY JAN — STAC INTERCONTINENTAL" is employment income and "REVERSAL — INSUFFICIENT FUNDS" is a bounce, and produce structured output from unstructured chaos. This is genuinely hard, and they do it well.
They are not reliable calculators under legal constraints.
The failure mode isn't dramatic. The model doesn't say "I cannot compute tax" and refuse. It produces a number — confidently, fluently, with the same tone it uses when it's completely correct. The number might be ₦180,000 when the NTA 2025 Fourth Schedule says ₦240,000. The trust score might be 78 when the real affordability math says 41. There's no error message. There's no undefined. Just a wrong number, presented as fact.
And then someone acts on it.
A business owner is approved for a loan they cannot afford. A freelancer files a tax return with the wrong liability. The failure is silent, downstream, and by the time it surfaces, the model has been called ten thousand more times with the same architecture.
This is the problem. And the solution is not better prompting.
The Architectural Invariant
The invariant I settled on across both TaxLens and TrustRail:
An LLM may extract, classify, and validate. It may never calculate, decide, or produce a figure the user will act on.
Three permitted operations. Three forbidden ones. The boundary between them is the line between the LLM doing what it's good at and the LLM being asked to do something it will eventually get wrong in a way you can't predict.
Let's walk through exactly how this plays out in both systems.
TaxLens: The Two-Tier Pipeline
TaxLens processes a bank statement PDF to estimate Nigerian personal income tax under the NTA 2025. The user uploads a PDF. A few seconds later, they see their estimated tax liability, the applicable bands, and the inflows the system counted as income.
The pipeline has three tiers. Only the first two involve a language model.
Tier 1: The Gate (cheap model)
The gate model receives the PDF and answers exactly one question:
const GateVerdictSchema = z.object({
valid: z.boolean(),
bankName: z.string(),
monthsCovered: z.number().int().nonnegative(),
reason: z.string(),
});
Four fields. A boolean, a bank name, a month count, and a reason if invalid. The gate model never sees a tax question. It never computes anything. Its only job is to tell the system whether the document is a genuine, legible Nigerian bank statement that can support an income estimate.
If valid is false, the pipeline terminates. No analysis call fires. The user gets a clear failure reason and a prompt to upload a different document.
This gate serves two purposes simultaneously: it protects the more expensive analysis model from wasted calls on unusable input (a photo of a receipt, a foreign bank statement, a blank page), and it establishes the first hard boundary — before any income extraction begins, a human-interpretable validation has run.
Tier 2: The Analysis (better model)
If the gate passes, the analysis model runs. It receives the same PDF and a different question:
const AnalysisSchema = z.object({
inflows: z.array(InflowSchema),
grossAnnualKobo: z.number().int().nonnegative(),
});
The inflows array contains every credit the model found, each tagged with a classification: salary, business, transfer, or other. The grossAnnualKobo is the model's annualised income estimate in kobo.
Notice what the analysis model does not return: tax bands, relief amounts, effective rates, tax payable. It returns structured income data. Nothing more.
Tier 3: The Tax Engine (pure code, no LLM)
The tax engine is a pure TypeScript function. It receives grossAnnualKobo and a profile type, and it returns the full computation — band-by-band breakdown, applicable reliefs, both old and new regime figures, recommended regime. Every number the user sees on screen comes from here.
const computation = computeFromGross(profileType, grossAnnualKobo);
computeFromGross calls compareRegimes, a deterministic function that encodes the NTA 2025 Fourth Schedule as code. It takes no model output as a parameter except the gross figure. It has no randomness, no retries, no API calls. Given the same inputs, it produces the same outputs. Forever.
This is the invariant in practice: the LLM hands off a single number (grossAnnualKobo) to a piece of code that knows the law. The code does the legal reasoning.
TrustRail: The Same Pattern at Different Stakes
TrustRail is a BNPL underwriting platform. A business uploads a bank statement as part of a credit application. The system produces a trust score (0–100) and a decision: APPROVED, FLAGGED_FOR_REVIEW, or DECLINED.
The stakes are different from tax. A wrong tax estimate might result in a corrected filing. A wrong loan approval can trap a business in a debt it can't service.
The architecture is the same.
GPT-4o reads the uploaded PDF and returns a TrustEngineAnalysisResult. But the trust score inside that result is computed by calculateTrustScore — a pure TypeScript function in trustEngineService.ts. Five weighted buckets: income stability (30 points), spending behaviour (25 points), balance health (20 points), transaction behaviour (15 points), affordability (10 points). Pure arithmetic on structured transaction data.
const calculateTrustScore = (
incomeAnalysis: IncomeAnalysis,
spendingAnalysis: SpendingAnalysis,
balanceAnalysis: BalanceAnalysis,
behaviorAnalysis: BehaviorAnalysis,
affordabilityAssessment: AffordabilityAssessment,
installmentAmount: number
): number => {
let score = 0;
// Income Stability (30 points)
score += incomeAnalysis.incomeConsistency * 15;
const incomeToInstallmentRatio = installmentAmount / incomeAnalysis.avgMonthlyIncome;
if (incomeToInstallmentRatio < 0.2) score += 15;
else if (incomeToInstallmentRatio < 0.3) score += 10;
else if (incomeToInstallmentRatio < 0.4) score += 5;
// ... four more buckets ...
return Math.max(0, Math.min(100, Math.round(score)));
};
GPT-4o's job in TrustRail is identical to the analysis model in TaxLens: extract and classify. It reads the document format that the CSV parser can't handle, identifies income patterns, flags bounces, and returns structured data. The scoring is never delegated.
The isValidStatement Guard
Both systems share one more hard boundary: before any analysis runs, the model is asked only whether the document is real.
In TaxLens, gate.data.valid === false short-circuits to a failed state before any analysis model call fires.
In TrustRail, the application service checks analysisResult.isValidStatement. If false, it calls createInvalidStatementOutput — a function that returns a fully zeroed TrustEngineAnalysisResult with trustScore: 0 and decision: 'DECLINED'. The model's rejection reason is preserved for the audit trail. Everything else is zeroed.
if (analysisResult.isValidStatement === false) {
analysisResult = createInvalidStatementOutput(
analysisResult.invalidStatementReason || 'Document is not a valid bank statement',
application.installmentAmount,
);
}
This matters for a reason beyond correctness: it prevents document gaming. If an applicant uploads a forged or altered statement and the model flags it as invalid, the system doesn't try to extract income from it. It declines. The LLM's "I can't read this properly" is treated as a hard signal, not an error to retry through.
The Audit Trail Problem
Every financial decision must be explainable and reproducible. If a loan applicant disputes a decline, you need to be able to reconstruct exactly why. If a tax authority questions a filed return, you need to show your working.
A language model's chain-of-thought is neither explainable nor reproducible in that sense. The same prompt, the same document, the same model version can produce different structured outputs on different days. Not wildly different — but measurably different in ways that matter when someone's loan application is on the line.
TaxLens maintains a separate llm_audit collection that records every model call: tier, model, promptHash (a SHA-256 of the system + user prompts — never raw statement text, for PII reasons), inputTokens, outputTokens, latencyMs, and circuitState. This records what the LLM did.
The tax_process document records what the tax engine computed: grossAnnualKobo, the full computation object with every band and relief, inflows with their classifications. This records what the system decided.
They're separate because they answer different accountability questions. The LLM audit answers: did the model behave correctly? The process record answers: how did we arrive at this tax figure? An auditor cares about the second. A debugging session cares about both.
Where to Draw the Line
The pattern generalises. Here's the framework I use when deciding what a language model can own in a high-stakes pipeline:
Permitted:
Extraction — reading structure from unstructured input (transaction parsing, field extraction from PDFs, form recognition)
Classification — labelling items against a defined taxonomy (salary vs. transfer, valid vs. invalid, income vs. refund)
Validation — checking whether a document is what it claims to be
Forbidden:
Calculation — arithmetic on figures that have legal, financial, or medical significance
Decision — producing an outcome the user will act on (approved, declined, liable, not liable)
Threshold application — checking a computed value against a rule ("does this score exceed the minimum?")
The test is simple: if an auditor asks "why did you produce this number?", can you answer with a function call? If yes, the number belongs in code. If the answer is "because the model said so", the architecture needs to change.
Trade-offs and Honest Limitations
This architecture is not free.
The extraction step can still be wrong. If the analysis model misclassifies an inflow — calling a ₦500,000 monthly salary "transfer" — the tax engine will compute the correct tax on a wrong gross. The invariant protects the calculation. It does not protect the input. This is why TaxLens has a needs_review state (triggered when grossAnnualKobo === 0 while inflows exist) and a manual reclassification flow. The user can correct what the model got wrong.
The gate is a probabilistic validator, not a cryptographic one. A sufficiently realistic forged bank statement will pass the gate. The architecture reduces the blast radius — the scoring remains deterministic and the decision remains auditable — but it does not eliminate the risk of document fraud. That requires additional signals (account verification, BVN matching, real-time bank data feeds) that sit outside the LLM pipeline entirely.
The two-call structure increases latency. Gate + analysis adds a sequential LLM round-trip before the tax engine can run. On fast connections with the right model tier choices (a cheap, fast model for the gate; a more capable model for analysis), this is tolerable. On slow network conditions or when using a single high-capability model for both tiers, the user waits longer than they would with a single call. The right choice depends on your rejection rate — if 20%+ of uploads hit an invalid document, the gate saves enough analysis calls to justify the extra latency even in the average case.
Per-process singleton circuit breakers don't survive horizontal scale. TaxLens's CircuitBreaker is in-memory, per Node.js process. On a single instance, it works correctly: three consecutive OpenAI failures trip the breaker, and subsequent calls fast-fail for the cooldown period. On two instances, each has an independent breaker. One instance can be in open state while the other is closed, and the load balancer routes requests to whichever happens to answer. Moving the circuit state to Redis or MongoDB is the obvious fix; it was explicitly deferred as a v2 decision in the design notes, and it's the right call for an early product where single-instance deployments are the norm.
The A2 Guard: A Real Edge Case
The needs_review state in TaxLens came from a real production observation, not speculation.
The Kuda MFB bank statement format produces credits with narrations like "Stac Intercontinental Ltd transfer" and "Abolarinwa Babafemi transfer." Both of those are income for the person receiving them — regular client payments, freelance work. But the word "transfer" appears in both narrations, and the analysis model, if it applies the classification guidance too literally, tags both as transfer rather than business.
The result: grossAnnualKobo: 0, despite inflows summing to ₦1.6M.
The guard:
const inflowsSumKobo = inflows.reduce((s, f) => s + f.amountKobo, 0);
const needsReview = grossAnnualKobo === 0 && inflowsSumKobo > 0;
If the model extracted credits but counted none as income, the pipeline routes to needs_review instead of ready. The user sees their inflows, selects which ones are actually income, and the system recomputes. The tax engine runs on corrected input.
This is the architectural response to a model classification error: don't try to prompt-engineer it away. Build a recovery path that keeps the human in the loop for the specific case where the model is known to drift.
Evaluation
Across both systems, the pattern produces measurable properties:
Reproducibility — Given the same extracted inflows or transactions, the tax engine and trust score engine produce the same output every time. This makes regression testing straightforward: fix a snapshot of extracted data, assert on the computed output.
Auditability — Every figure can be traced to a specific function call with specific inputs. "Your trust score is 62" is backed by: here are the five bucket scores, here are the income figures that drove them, here are the transactions the classification step found.
Fault isolation — When the model produces unexpected output (a misclassification, a wrong grossAnnualKobo), the error is contained to the extraction layer. The calculation layer sees structured data and computes correctly on whatever it receives. The bug is findable and fixable without touching the scoring logic.
Testability — The tax engine and the trust scoring functions are pure TypeScript with no I/O. They can be tested exhaustively with fixtures. The LLM calls are tested with a stub transport (more on that in the next article).
The Honest Conclusion
The pattern "LLM as parser, code as calculator" is not a clever trick. It's just a clear application of what language models are and are not reliable for, applied to a domain where being wrong has real consequences.
The LLM is a remarkable reader. It can handle document chaos that would require months of parser engineering. But it has no internal representation of the NTA 2025 Fourth Schedule, no guarantee that its arithmetic is consistent, and no awareness that the number it produces will be used to approve or decline someone's loan application.
Code has all of those properties. The combination is what makes the system trustworthy.
The model earns the data. The code earns the answer.


