FinOps

LLM Cost Optimization in Production

Apr 18, 2026·13 min read

A SaaS company sees its LLM API bill increasing every month and needs a practical strategy to reduce costs without hurting user experience.

Cost OptimizationLLM

LLM Cost Optimization in Production

Scenario

A SaaS company sees its LLM API bill increasing every month and needs a practical strategy to reduce costs without hurting user experience.

Business Problem

LLM bills typically grow because of a few avoidable patterns:

oversized prompts and weak token caps
using premium models for simple requests
no response caching
repeated synchronous calls that should be batched
no per-team cost attribution

The goal is to reduce unit cost per request while protecting answer quality and latency SLOs.

Reference Architecture

graph TD U[Users] --> API[FastAPI Gateway] API --> ROUTER[Model Router] ROUTER --> CACHE[(DynamoDB/Redis Cache)] ROUTER --> SMALL[Small Model] ROUTER --> LARGE[Large Model] API --> BATCH[SQS Batch Queue] BATCH --> WORKER[Batch Worker] API --> METRICS[Usage Collector] METRICS --> CW[CloudWatch] CW --> BUD[AWS Budgets + Alerts]

Core Optimization Levers

Token optimization
Dynamic model routing
Semantic caching
Batching and async processing
Prompt compression and context pruning
Budget guardrails and automated alerts

Step-by-Step Tutorial

1) Create usage and cache tables

export AWS_REGION=us-east-1
export PROJECT=llm-cost-ops
export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

aws dynamodb create-table \
  --table-name ${PROJECT}-usage \
  --attribute-definitions AttributeName=pk,AttributeType=S AttributeName=ts,AttributeType=S \
  --key-schema AttributeName=pk,KeyType=HASH AttributeName=ts,KeyType=RANGE \
  --billing-mode PAY_PER_REQUEST \
  --sse-specification Enabled=true

aws dynamodb create-table \
  --table-name ${PROJECT}-cache \
  --attribute-definitions AttributeName=cache_key,AttributeType=S \
  --key-schema AttributeName=cache_key,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --sse-specification Enabled=true

$env:AWS_REGION = "us-east-1"
$env:PROJECT = "llm-cost-ops"
$env:ACCOUNT_ID = (aws sts get-caller-identity --query Account --output text)

aws dynamodb create-table `
  --table-name "$($env:PROJECT)-usage" `
  --attribute-definitions AttributeName=pk,AttributeType=S AttributeName=ts,AttributeType=S `
  --key-schema AttributeName=pk,KeyType=HASH AttributeName=ts,KeyType=RANGE `
  --billing-mode PAY_PER_REQUEST `
  --sse-specification Enabled=true

aws dynamodb create-table `
  --table-name "$($env:PROJECT)-cache" `
  --attribute-definitions AttributeName=cache_key,AttributeType=S `
  --key-schema AttributeName=cache_key,KeyType=HASH `
  --billing-mode PAY_PER_REQUEST `
  --sse-specification Enabled=true

2) Add budget and alerts

aws budgets create-budget \
  --account-id "$ACCOUNT_ID" \
  --budget '{
    "BudgetName":"llm-monthly-prod",
    "BudgetLimit":{"Amount":"2000","Unit":"USD"},
    "TimeUnit":"MONTHLY",
    "BudgetType":"COST"
  }'

Add notification thresholds at 50%, 80%, and 100% and route to SNS/email/on-call.

3) Implement token caps and routing policy

router.py

from dataclasses import dataclass

@dataclass
class RequestContext:
    task_type: str
    user_tier: str
    prompt_tokens_estimate: int


def choose_model(ctx: RequestContext) -> str:
    # Simple policy; store in config/SSM in production
    if ctx.task_type in {"classification", "summarization"} and ctx.prompt_tokens_estimate < 1200:
        return "small_model"
    if ctx.user_tier == "enterprise":
        return "large_model"
    return "small_model"


def enforce_token_cap(prompt_tokens: int, max_allowed: int = 4000) -> None:
    if prompt_tokens > max_allowed:
        raise ValueError(f"Prompt token estimate {prompt_tokens} exceeds cap {max_allowed}")

4) Add semantic cache before calling model

cache_layer.py

import hashlib
import json
import boto3
from datetime import datetime, timedelta, timezone

ddb = boto3.resource("dynamodb")
cache_table = ddb.Table("llm-cost-ops-cache")


def make_key(model: str, prompt: str) -> str:
    return hashlib.sha256(f"{model}:{prompt}".encode("utf-8")).hexdigest()


def get_cached(model: str, prompt: str):
    key = make_key(model, prompt)
    item = cache_table.get_item(Key={"cache_key": key}).get("Item")
    if not item:
        return None
    if item["expires_at"] < int(datetime.now(timezone.utc).timestamp()):
        return None
    return item["response"]


def put_cached(model: str, prompt: str, response: str, ttl_seconds: int = 3600):
    key = make_key(model, prompt)
    expires_at = int((datetime.now(timezone.utc) + timedelta(seconds=ttl_seconds)).timestamp())
    cache_table.put_item(Item={
        "cache_key": key,
        "response": response,
        "expires_at": expires_at
    })

5) Prompt compression pattern

prompt_compression.py


def compress_context(messages: list[str], max_chars: int = 8000) -> str:
    # Keep latest messages and critical system instructions.
    system = messages[0] if messages else ""
    recent = messages[-8:]
    merged = "\n".join([system] + recent)
    return merged[-max_chars:]

In production, replace this with summarization-based memory compaction and quality checks.

6) Batch asynchronous work

aws sqs create-queue --queue-name llm-cost-ops-batch --attributes VisibilityTimeout=300

Use synchronous response only where necessary. For report generation or large analysis jobs, enqueue and process asynchronously with worker autoscaling.

7) FastAPI integration example

app.py

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from router import choose_model, enforce_token_cap, RequestContext

app = FastAPI()

class Ask(BaseModel):
    prompt: str
    task_type: str = "qa"
    user_tier: str = "standard"


def estimate_tokens(text: str) -> int:
    return max(1, len(text) // 4)

@app.post("/ask")
def ask(req: Ask):
    est = estimate_tokens(req.prompt)
    enforce_token_cap(est, max_allowed=4000)
    model = choose_model(RequestContext(req.task_type, req.user_tier, est))

    # Plug in cache lookup + model call + usage logging
    return {"selected_model": model, "estimated_tokens": est}

8) Usage tracking and cost reporting script

usage_report.py

from collections import defaultdict

records = [
    {"team": "search", "prompt_tokens": 12000, "completion_tokens": 3000},
    {"team": "support", "prompt_tokens": 8000, "completion_tokens": 1500},
]

summary = defaultdict(lambda: {"prompt": 0, "completion": 0})
for r in records:
    summary[r["team"]]["prompt"] += r["prompt_tokens"]
    summary[r["team"]]["completion"] += r["completion_tokens"]

for team, s in summary.items():
    print(team, s)

Security and Reliability Considerations

Keep provider API keys in Secrets Manager or SSM Parameter Store.
Use IAM least privilege for usage datastore updates.
Redact sensitive payloads before logging.
Protect against prompt injection before retrieval/tool calls.

Monitoring and KPIs

Track at minimum:

cost per 1,000 requests
average prompt tokens and completion tokens
cache hit rate
routed-to-large-model percentage
p95 latency and error rate

Set alarms for:

daily spend anomaly
sudden token-per-request increase
cache hit rate drop

Cost Optimization Playbook (Practical)

Cap tokens aggressively for non-critical flows.
Route easy tasks to smaller/cheaper models.
Cache deterministic outputs.
Batch offline jobs.
Compress historical context.
Move long attachments to retrieval references instead of inline prompt stuffing.

Pricing note: verify current rates directly from provider pages and AWS pricing pages before setting budget thresholds.

Production-readiness checklist

Token caps enforced server-side
Routing rules versioned and tested
Cache TTL strategy defined by use case
Per-team usage attribution available
Budget alerts active and routed to on-call
Anomaly detection baseline established
Runbook for cost spike response completed

Final takeaway

Cost optimization is an architecture problem, not a single prompt tweak. Teams that combine routing, caching, batching, and guardrails usually achieve major savings without hurting user experience.

Source

platform/archive/articles/llm-cost-optimization-in-production.md

LLM Cost Optimization in Production

Scenario

Business Problem

Reference Architecture

Core Optimization Levers

Step-by-Step Tutorial

1) Create usage and cache tables

2) Add budget and alerts

3) Implement token caps and routing policy

4) Add semantic cache before calling model

5) Prompt compression pattern

6) Batch asynchronous work

7) FastAPI integration example

8) Usage tracking and cost reporting script

Security and Reliability Considerations

Monitoring and KPIs

Cost Optimization Playbook (Practical)

Production-readiness checklist

Final takeaway

Related Articles

Prompt Engineering Is Becoming Prompt Operations

AI Coding Agents with DeepSeek Latest Model API on AWS and FastAPI

How to Deploy AI Agents in Production on AWS