Operations 7 min read

AI Cost Engineering: How to Cut Your Inference Bill 45–90% Without Touching Your Product

Two levers that move the number: prompt caching and model routing. Together they can make the difference between viable unit economics and a business that cannot scale.

Who this is for: Founders and technical operators of AI-powered products (SaaS, voice agents, chatbots, or any product where LLM API calls appear on the infrastructure bill). Most useful at $1K+/month in AI API spend, but the patterns apply from day one.

The problem

AI SaaS businesses average 52% gross margin in 2026, structurally lower than the 75 to 85% that traditional SaaS enjoys. The primary driver: inference costs are a new cost-of-goods-sold line that traditional SaaS never had, and they scale with usage, not with seats. A single heavy user on a flat-rate plan can generate negative gross margin. One customer paying $299/month and generating 10x typical usage produced -$59/month gross profit in a documented per-customer P&L model.

Most AI founders treat inference costs as a fixed constraint, something determined by which model they chose. It is not. Prompt caching and model routing alone can reduce the API cost component by 45 to 90%, and both can be implemented in days, not months.

Cost engineering is not an optimization task. It is a prerequisite for viable unit economics.

Two Levers That Move the Number

Lever 1: Prompt Caching (roughly 90% savings on cached tokens)

Prompt caching allows you to store a stable portion of a prompt (the system prompt, large context blocks, reference documents) so the model provider only processes it once, then serves cached results on subsequent calls. Anthropic charges approximately 90% less for cached input tokens than freshly processed ones. OpenAI offers approximately 50% reduction.

What qualifies for caching: System prompts, persona definitions, knowledge base injections, product documentation, multi-turn conversation history, and any context that remains stable across a large number of requests.

The cost math in practice: A typical AI assistant call might include 2,000 tokens of system prompt plus 500 tokens of user input. Without caching, you pay for 2,500 tokens on every call. With caching enabled and the system prompt cached, you pay the full rate for 500 tokens and the cache-hit rate for 2,000 tokens. That is a reduction of roughly 70 to 90% of the input token cost.

One documented case: a production AI product cut LLM costs 59 to 66% by enabling prompt caching with no changes to output quality.

Action step

Implementation time: 1 to 2 days for most existing integrations. The change is primarily at the API call layer. No model fine-tuning, no prompt rewriting required.

Lever 2: Model Routing (45–85% savings)

Model routing is the practice of directing different types of queries to different models based on complexity. Route simple, high-frequency queries to smaller, cheaper models. Escalate complex queries to larger, more capable ones.

The cost differential between model tiers is large. At 100 calls per day per customer: Claude Sonnet costs $75 to $150/month in inference; Claude Haiku at the same volume costs $9 to $24/month. That is a 6 to 8x cost difference per customer at equivalent volume.

Query typeRoute toWhy
Structured data extractionSmall model (Haiku, Flash)Pattern-matching task; no reasoning required
Simple FAQ / status responseSmall modelSingle-turn, predictable outputs
Appointment booking (form fill)Small modelDeterministic workflow, not open-ended
Complex reasoning / multi-stepFrontier model (Sonnet, GPT-5)Justified cost for quality-critical paths
Ambiguous intent resolutionFrontier modelError cost exceeds inference savings
Compliance-sensitive outputsFrontier model + evalCannot risk quality degradation

A well-designed routing layer sends 80 to 90% of queries to small models and escalates 10 to 20% to frontier models. The result is 45 to 85% reduction in total inference cost at 90 to 95% quality parity across the full query distribution.

The risk with model routing is quality degradation on the queries that should escalate but don't. Build your routing logic around failure modes, not just cost. Always A/B test routing thresholds before shipping to production.

The Unit Economics Framing

The industry benchmark for sustainable AI SaaS is an Inference Efficiency Ratio (IER), defined as inference cost as a percentage of revenue, below 15%. Without active cost management, the typical AI SaaS product lands at 23% IER before any other COGS. At 23%, your gross margin ceiling is roughly 77% before hosting, support, tooling, and payment processing, which typically land gross margin in the 40 to 55% range. Below 40%, you cannot support a viable go-to-market, customer success, or R&D function.

With prompt caching and model routing, a product starting at 23% IER can reach 8 to 12% IER. That difference is the margin that makes the business fundable, hireable, and acquirable.

Preventing margin destruction from heavy users: Even with cost engineering in place, flat-rate pricing is a margin risk if any users generate 10x+ typical volume. The countermeasure: usage caps framed as "AI credits" included in each plan tier. A cap of 500 credits/month on a $299 plan, with overages at a published per-credit rate, prevents a single heavy user from generating negative gross profit while giving the majority of users a frictionless experience.

Copy this prompt
Audit my AI tool spend. Here are all the AI/LLM services I'm currently paying for: [list each tool/API, monthly cost, and what it's used for]. Calculate my total monthly AI spend, identify any tools with overlapping functionality, and flag the top 3 opportunities to reduce cost through consolidation or elimination.

When to use: Monthly or quarterly. List every AI tool and API subscription. The output is a prioritized cost reduction list, starting with the easiest wins.

Copy this prompt
Calculate the per-task cost of my AI product. My product handles these task types: [list each task type and approximate daily volume]. I'm using [model name] at [price per 1K input tokens / price per 1K output tokens]. For each task type, estimate: average input tokens, average output tokens, cost per task, and monthly cost at current volume. Then show what each would cost if routed to [smaller model name] instead.

When to use: Before implementing model routing. Fill in your actual task types, volumes, and model pricing. The output shows you exactly where routing saves the most money.

Copy this prompt
Identify redundant AI subscriptions in my tech stack. Here are all the software tools I pay for that include AI features: [list each tool, monthly cost, and AI features used]. Flag any tools where the AI feature overlaps with another tool I'm already paying for. Recommend which to keep and which to cancel, with estimated monthly savings.

When to use: During quarterly subscription reviews. Include all tools with AI features, not just dedicated AI APIs. Many SaaS products now bundle AI features that duplicate standalone tools.

How to apply it

  1. Day 1 to 2: Enable prompt caching in your API integration. Identify the stable context block in your most common request type and move it into the cached prefix. Measure: compare token counts and costs before and after. Establish your baseline IER.
  2. Week 1 to 2: Build a simple routing layer. Start with intent classification (3 to 5 intent categories, each mapped to a model tier). Test against your last 500 real queries. A/B test quality at the query level before routing to production.
  3. Ongoing: Track IER weekly. Set a target (below 15%) and a threshold that triggers review (above 20%). Add usage caps to flat-rate plans before your first heavy user appears, not after.

The one decision

The one decision this topic forces: which queries does your product route to a smaller model, and what is the quality floor you will accept?

This is not a technical question. It is a product question. Define the interactions where quality degradation is acceptable (structured data tasks, simple Q&A, predictable workflows) vs. unacceptable (compliance-adjacent outputs, complex reasoning, high-stakes user decisions). Your routing logic flows from that definition.

Starting with 100% frontier model calls and working down is safer than over-routing immediately. Add routes as you validate quality at each tier.

Share on LinkedIn Share on X

Get the companion toolkit

Copy-paste prompts, calculators, and checklists that go with this guide. Yours free.

AI tool cost audit checklist
Per-task cost calculator
Subscription optimization matrix
Work with Brady

Want this done for your business?

I build AI automations for small businesses every week. If you'd rather have someone set this up than DIY it, let's talk.

Join a free workshop

or see the Build Sprint ($497)