← Blog/LLM Cost Optimization in Production
FinOps

LLM Cost Optimization in Production

Apr 18, 2026·13 min read

A SaaS company sees its LLM API bill increasing every month and needs a practical strategy to reduce costs without hurting user experience.

Cost OptimizationLLM

LLM Cost Optimization in Production

Scenario

A SaaS company sees its LLM API bill increasing every month and needs a practical strategy to reduce costs without hurting user experience.

Business Problem

LLM bills typically grow because of a few avoidable patterns:

  • oversized prompts and weak token caps
  • using premium models for simple requests
  • no response caching
  • repeated synchronous calls that should be batched
  • no per-team cost attribution

The goal is to reduce unit cost per request while protecting answer quality and latency SLOs.

Reference Architecture

graph TD U[Users] --> API[FastAPI Gateway] API --> ROUTER[Model Router] ROUTER --> CACHE[(DynamoDB/Redis Cache)] ROUTER --> SMALL[Small Model] ROUTER --> LARGE[Large Model] API --> BATCH[SQS Batch Queue] BATCH --> WORKER[Batch Worker] API --> METRICS[Usage Collector] METRICS --> CW[CloudWatch] CW --> BUD[AWS Budgets + Alerts]

Core Optimization Levers

  1. Token optimization
  2. Dynamic model routing
  3. Semantic caching
  4. Batching and async processing
  5. Prompt compression and context pruning
  6. Budget guardrails and automated alerts

Step-by-Step Tutorial

1) Create usage and cache tables

export AWS_REGION=us-east-1
export PROJECT=llm-cost-ops
export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

aws dynamodb create-table \
  --table-name ${PROJECT}-usage \
  --attribute-definitions AttributeName=pk,AttributeType=S AttributeName=ts,AttributeType=S \
  --key-schema AttributeName=pk,KeyType=HASH AttributeName=ts,KeyType=RANGE \
  --billing-mode PAY_PER_REQUEST \
  --sse-specification Enabled=true

aws dynamodb create-table \
  --table-name ${PROJECT}-cache \
  --attribute-definitions AttributeName=cache_key,AttributeType=S \
  --key-schema AttributeName=cache_key,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --sse-specification Enabled=true
$env:AWS_REGION = "us-east-1"
$env:PROJECT = "llm-cost-ops"
$env:ACCOUNT_ID = (aws sts get-caller-identity --query Account --output text)

aws dynamodb create-table `
  --table-name "$($env:PROJECT)-usage" `
  --attribute-definitions AttributeName=pk,AttributeType=S AttributeName=ts,AttributeType=S `
  --key-schema AttributeName=pk,KeyType=HASH AttributeName=ts,KeyType=RANGE `
  --billing-mode PAY_PER_REQUEST `
  --sse-specification Enabled=true

aws dynamodb create-table `
  --table-name "$($env:PROJECT)-cache" `
  --attribute-definitions AttributeName=cache_key,AttributeType=S `
  --key-schema AttributeName=cache_key,KeyType=HASH `
  --billing-mode PAY_PER_REQUEST `
  --sse-specification Enabled=true

2) Add budget and alerts

aws budgets create-budget \
  --account-id "$ACCOUNT_ID" \
  --budget '{
    "BudgetName":"llm-monthly-prod",
    "BudgetLimit":{"Amount":"2000","Unit":"USD"},
    "TimeUnit":"MONTHLY",
    "BudgetType":"COST"
  }'

Add notification thresholds at 50%, 80%, and 100% and route to SNS/email/on-call.

3) Implement token caps and routing policy

router.py

from dataclasses import dataclass

@dataclass
class RequestContext:
    task_type: str
    user_tier: str
    prompt_tokens_estimate: int


def choose_model(ctx: RequestContext) -> str:
    # Simple policy; store in config/SSM in production
    if ctx.task_type in {"classification", "summarization"} and ctx.prompt_tokens_estimate < 1200:
        return "small_model"
    if ctx.user_tier == "enterprise":
        return "large_model"
    return "small_model"


def enforce_token_cap(prompt_tokens: int, max_allowed: int = 4000) -> None:
    if prompt_tokens > max_allowed:
        raise ValueError(f"Prompt token estimate {prompt_tokens} exceeds cap {max_allowed}")

4) Add semantic cache before calling model

cache_layer.py

import hashlib
import json
import boto3
from datetime import datetime, timedelta, timezone

ddb = boto3.resource("dynamodb")
cache_table = ddb.Table("llm-cost-ops-cache")


def make_key(model: str, prompt: str) -> str:
    return hashlib.sha256(f"{model}:{prompt}".encode("utf-8")).hexdigest()


def get_cached(model: str, prompt: str):
    key = make_key(model, prompt)
    item = cache_table.get_item(Key={"cache_key": key}).get("Item")
    if not item:
        return None
    if item["expires_at"] < int(datetime.now(timezone.utc).timestamp()):
        return None
    return item["response"]


def put_cached(model: str, prompt: str, response: str, ttl_seconds: int = 3600):
    key = make_key(model, prompt)
    expires_at = int((datetime.now(timezone.utc) + timedelta(seconds=ttl_seconds)).timestamp())
    cache_table.put_item(Item={
        "cache_key": key,
        "response": response,
        "expires_at": expires_at
    })

5) Prompt compression pattern

prompt_compression.py


def compress_context(messages: list[str], max_chars: int = 8000) -> str:
    # Keep latest messages and critical system instructions.
    system = messages[0] if messages else ""
    recent = messages[-8:]
    merged = "\n".join([system] + recent)
    return merged[-max_chars:]

In production, replace this with summarization-based memory compaction and quality checks.

6) Batch asynchronous work

aws sqs create-queue --queue-name llm-cost-ops-batch --attributes VisibilityTimeout=300

Use synchronous response only where necessary. For report generation or large analysis jobs, enqueue and process asynchronously with worker autoscaling.

7) FastAPI integration example

app.py

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from router import choose_model, enforce_token_cap, RequestContext

app = FastAPI()

class Ask(BaseModel):
    prompt: str
    task_type: str = "qa"
    user_tier: str = "standard"


def estimate_tokens(text: str) -> int:
    return max(1, len(text) // 4)

@app.post("/ask")
def ask(req: Ask):
    est = estimate_tokens(req.prompt)
    enforce_token_cap(est, max_allowed=4000)
    model = choose_model(RequestContext(req.task_type, req.user_tier, est))

    # Plug in cache lookup + model call + usage logging
    return {"selected_model": model, "estimated_tokens": est}

8) Usage tracking and cost reporting script

usage_report.py

from collections import defaultdict

records = [
    {"team": "search", "prompt_tokens": 12000, "completion_tokens": 3000},
    {"team": "support", "prompt_tokens": 8000, "completion_tokens": 1500},
]

summary = defaultdict(lambda: {"prompt": 0, "completion": 0})
for r in records:
    summary[r["team"]]["prompt"] += r["prompt_tokens"]
    summary[r["team"]]["completion"] += r["completion_tokens"]

for team, s in summary.items():
    print(team, s)

Security and Reliability Considerations

  • Keep provider API keys in Secrets Manager or SSM Parameter Store.
  • Use IAM least privilege for usage datastore updates.
  • Redact sensitive payloads before logging.
  • Protect against prompt injection before retrieval/tool calls.

Monitoring and KPIs

Track at minimum:

  • cost per 1,000 requests
  • average prompt tokens and completion tokens
  • cache hit rate
  • routed-to-large-model percentage
  • p95 latency and error rate

Set alarms for:

  • daily spend anomaly
  • sudden token-per-request increase
  • cache hit rate drop

Cost Optimization Playbook (Practical)

  • Cap tokens aggressively for non-critical flows.
  • Route easy tasks to smaller/cheaper models.
  • Cache deterministic outputs.
  • Batch offline jobs.
  • Compress historical context.
  • Move long attachments to retrieval references instead of inline prompt stuffing.

Pricing note: verify current rates directly from provider pages and AWS pricing pages before setting budget thresholds.

Production-readiness checklist

  • Token caps enforced server-side
  • Routing rules versioned and tested
  • Cache TTL strategy defined by use case
  • Per-team usage attribution available
  • Budget alerts active and routed to on-call
  • Anomaly detection baseline established
  • Runbook for cost spike response completed

Final takeaway

Cost optimization is an architecture problem, not a single prompt tweak. Teams that combine routing, caching, batching, and guardrails usually achieve major savings without hurting user experience.