LLM Cost Optimization in Production
A SaaS company sees its LLM API bill increasing every month and needs a practical strategy to reduce costs without hurting user experience.
LLM Cost Optimization in Production
Scenario
A SaaS company sees its LLM API bill increasing every month and needs a practical strategy to reduce costs without hurting user experience.
Business Problem
LLM bills typically grow because of a few avoidable patterns:
- oversized prompts and weak token caps
- using premium models for simple requests
- no response caching
- repeated synchronous calls that should be batched
- no per-team cost attribution
The goal is to reduce unit cost per request while protecting answer quality and latency SLOs.
Reference Architecture
Core Optimization Levers
- Token optimization
- Dynamic model routing
- Semantic caching
- Batching and async processing
- Prompt compression and context pruning
- Budget guardrails and automated alerts
Step-by-Step Tutorial
1) Create usage and cache tables
export AWS_REGION=us-east-1
export PROJECT=llm-cost-ops
export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
aws dynamodb create-table \
--table-name ${PROJECT}-usage \
--attribute-definitions AttributeName=pk,AttributeType=S AttributeName=ts,AttributeType=S \
--key-schema AttributeName=pk,KeyType=HASH AttributeName=ts,KeyType=RANGE \
--billing-mode PAY_PER_REQUEST \
--sse-specification Enabled=true
aws dynamodb create-table \
--table-name ${PROJECT}-cache \
--attribute-definitions AttributeName=cache_key,AttributeType=S \
--key-schema AttributeName=cache_key,KeyType=HASH \
--billing-mode PAY_PER_REQUEST \
--sse-specification Enabled=true
$env:AWS_REGION = "us-east-1"
$env:PROJECT = "llm-cost-ops"
$env:ACCOUNT_ID = (aws sts get-caller-identity --query Account --output text)
aws dynamodb create-table `
--table-name "$($env:PROJECT)-usage" `
--attribute-definitions AttributeName=pk,AttributeType=S AttributeName=ts,AttributeType=S `
--key-schema AttributeName=pk,KeyType=HASH AttributeName=ts,KeyType=RANGE `
--billing-mode PAY_PER_REQUEST `
--sse-specification Enabled=true
aws dynamodb create-table `
--table-name "$($env:PROJECT)-cache" `
--attribute-definitions AttributeName=cache_key,AttributeType=S `
--key-schema AttributeName=cache_key,KeyType=HASH `
--billing-mode PAY_PER_REQUEST `
--sse-specification Enabled=true
2) Add budget and alerts
aws budgets create-budget \
--account-id "$ACCOUNT_ID" \
--budget '{
"BudgetName":"llm-monthly-prod",
"BudgetLimit":{"Amount":"2000","Unit":"USD"},
"TimeUnit":"MONTHLY",
"BudgetType":"COST"
}'
Add notification thresholds at 50%, 80%, and 100% and route to SNS/email/on-call.
3) Implement token caps and routing policy
router.py
from dataclasses import dataclass
@dataclass
class RequestContext:
task_type: str
user_tier: str
prompt_tokens_estimate: int
def choose_model(ctx: RequestContext) -> str:
# Simple policy; store in config/SSM in production
if ctx.task_type in {"classification", "summarization"} and ctx.prompt_tokens_estimate < 1200:
return "small_model"
if ctx.user_tier == "enterprise":
return "large_model"
return "small_model"
def enforce_token_cap(prompt_tokens: int, max_allowed: int = 4000) -> None:
if prompt_tokens > max_allowed:
raise ValueError(f"Prompt token estimate {prompt_tokens} exceeds cap {max_allowed}")
4) Add semantic cache before calling model
cache_layer.py
import hashlib
import json
import boto3
from datetime import datetime, timedelta, timezone
ddb = boto3.resource("dynamodb")
cache_table = ddb.Table("llm-cost-ops-cache")
def make_key(model: str, prompt: str) -> str:
return hashlib.sha256(f"{model}:{prompt}".encode("utf-8")).hexdigest()
def get_cached(model: str, prompt: str):
key = make_key(model, prompt)
item = cache_table.get_item(Key={"cache_key": key}).get("Item")
if not item:
return None
if item["expires_at"] < int(datetime.now(timezone.utc).timestamp()):
return None
return item["response"]
def put_cached(model: str, prompt: str, response: str, ttl_seconds: int = 3600):
key = make_key(model, prompt)
expires_at = int((datetime.now(timezone.utc) + timedelta(seconds=ttl_seconds)).timestamp())
cache_table.put_item(Item={
"cache_key": key,
"response": response,
"expires_at": expires_at
})
5) Prompt compression pattern
prompt_compression.py
def compress_context(messages: list[str], max_chars: int = 8000) -> str:
# Keep latest messages and critical system instructions.
system = messages[0] if messages else ""
recent = messages[-8:]
merged = "\n".join([system] + recent)
return merged[-max_chars:]
In production, replace this with summarization-based memory compaction and quality checks.
6) Batch asynchronous work
aws sqs create-queue --queue-name llm-cost-ops-batch --attributes VisibilityTimeout=300
Use synchronous response only where necessary. For report generation or large analysis jobs, enqueue and process asynchronously with worker autoscaling.
7) FastAPI integration example
app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from router import choose_model, enforce_token_cap, RequestContext
app = FastAPI()
class Ask(BaseModel):
prompt: str
task_type: str = "qa"
user_tier: str = "standard"
def estimate_tokens(text: str) -> int:
return max(1, len(text) // 4)
@app.post("/ask")
def ask(req: Ask):
est = estimate_tokens(req.prompt)
enforce_token_cap(est, max_allowed=4000)
model = choose_model(RequestContext(req.task_type, req.user_tier, est))
# Plug in cache lookup + model call + usage logging
return {"selected_model": model, "estimated_tokens": est}
8) Usage tracking and cost reporting script
usage_report.py
from collections import defaultdict
records = [
{"team": "search", "prompt_tokens": 12000, "completion_tokens": 3000},
{"team": "support", "prompt_tokens": 8000, "completion_tokens": 1500},
]
summary = defaultdict(lambda: {"prompt": 0, "completion": 0})
for r in records:
summary[r["team"]]["prompt"] += r["prompt_tokens"]
summary[r["team"]]["completion"] += r["completion_tokens"]
for team, s in summary.items():
print(team, s)
Security and Reliability Considerations
- Keep provider API keys in Secrets Manager or SSM Parameter Store.
- Use IAM least privilege for usage datastore updates.
- Redact sensitive payloads before logging.
- Protect against prompt injection before retrieval/tool calls.
Monitoring and KPIs
Track at minimum:
- cost per 1,000 requests
- average prompt tokens and completion tokens
- cache hit rate
- routed-to-large-model percentage
- p95 latency and error rate
Set alarms for:
- daily spend anomaly
- sudden token-per-request increase
- cache hit rate drop
Cost Optimization Playbook (Practical)
- Cap tokens aggressively for non-critical flows.
- Route easy tasks to smaller/cheaper models.
- Cache deterministic outputs.
- Batch offline jobs.
- Compress historical context.
- Move long attachments to retrieval references instead of inline prompt stuffing.
Pricing note: verify current rates directly from provider pages and AWS pricing pages before setting budget thresholds.
Production-readiness checklist
- Token caps enforced server-side
- Routing rules versioned and tested
- Cache TTL strategy defined by use case
- Per-team usage attribution available
- Budget alerts active and routed to on-call
- Anomaly detection baseline established
- Runbook for cost spike response completed
Final takeaway
Cost optimization is an architecture problem, not a single prompt tweak. Teams that combine routing, caching, batching, and guardrails usually achieve major savings without hurting user experience.