Monitoring

Building a Production Monitoring Stack: Alarms, Dashboards, and Incident Response

May 24, 2026·4 min read

Founder and Editor, Smash The Exam

Reviewed: 2026-05-26 · LinkedIn

Building a Production Monitoring Stack: Alarms, Dashboards, and Incident Response breaks the topic into practical decisions, shows what to validate, and explains how to apply it in real engineering workflows.

AWSMonitoringDevOps

Building a Production Monitoring Stack on AWS: Alarms, Dashboards, and Incident Response

Consolidated from real monitoring setup, traffic spike investigations, and alerting pipeline sessions.

Observability Focus 1: Implementation details that change outcomes for predictable operations (Building Production Monitoring)

This article covers building a complete observability stack for a web application on AWS — from CloudWatch alarms and dashboards to SNS notification formatting, traffic spike investigation, and automated incident response.

Editorial review note for Building Production Monitoring

This section was reviewed by a human editor to keep the recommendations actionable and technically grounded. Reviewed by: Med Amine Mahmoud. Last editorial review: 2026-05-26T16:10:01Z.

Observability Focus 3: How this maps to real exam objectives for cleaner ownership (Building Production Monitoring)

When an Alarm Fires

Step-by-step investigation process proven on real incidents:

Step 1: Check Metrics Context

# Request count around alarm time (5-min granularity)
aws cloudwatch get-metric-statistics `
--namespace AWS/ApplicationELB `
--metric-name RequestCount `
--dimensions "Name=LoadBalancer,Value=app/my-alb/EXAMPLE" "Name=TargetGroup,Value=targetgroup/my-tg/EXAMPLE" `
--start-time 2026-05-23T02:00:00Z `
--end-time 2026-05-23T03:00:00Z `
--period 300 --statistics Sum

# HTTP status code distribution
aws cloudwatch get-metric-statistics `
--metric-name HTTPCode_Target_5XX_Count ...
aws cloudwatch get-metric-statistics `
--metric-name HTTPCode_Target_4XX_Count ...

Step 2: Check Service Health

# Target response time during spike
aws cloudwatch get-metric-statistics `
--metric-name TargetResponseTime `
--statistics Average Maximum ...

# Healthy host count
aws cloudwatch get-metric-statistics `
--metric-name HealthyHostCount ...

Step 3: Analyze ALB Access Logs

# List log files for the spike window
aws s3 ls s3://my-alb-logs/alb/AWSLogs/ACCOUNT_ID/elasticloadbalancing/us-east-1/2026/05/23/

# Download and analyze
aws s3 cp s3://my-alb-logs/alb/.../T0230Z_....log.gz spike.gz

# PowerShell: decompress and parse
$stream = [System.IO.File]::OpenRead("spike.gz")
$gz = New-Object System.IO.Compression.GZipStream($stream, [System.IO.Compression.CompressionMode]::Decompress)
$reader = New-Object System.IO.StreamReader($gz)
$content = $reader.ReadToEnd()
$lines = $content -split "`n" | Where-Object { $_ -ne "" }

# Top IPs
$lines | ForEach-Object { ($_ -split ' ')[3] -replace ':\d+$','' } |
Group-Object | Sort-Object Count -Descending | Select-Object -First 10

Step 4: IP Intelligence

curl.exe -s "https://ipinfo.io/SUSPICIOUS_IP/json"

Real Investigation Results

Incident	Source	Method	Verdict
1,230 reqs/5min	Single residential IP (Tunisia)	HEAD, no UA, 1186 unique paths	Benign crawler
75 403s in burst	Coordinated IPs hitting ALB directly	Credential scanner	Blocked by rules
18 404s	Google Cloud IP	WordPress vuln scanner (/wp-admin)	Internet noise
500+ errors	2 IPs using ffuf/feroxbuster	Directory enumeration	Malicious, blocked

Classification Framework

Signal	Likely Benign	Likely Malicious
Method	HEAD, GET only	POST, PUT, various
Paths	Sitemap-matching	/wp-admin, /.env, /admin
User-Agent	Known bot or empty	Tool signatures (ffuf, sqlmap)
Rate	Steady, <10 req/s	Bursting, >50 req/s
Response codes	All 200	Mixed 4xx/5xx
Duration	Minutes	Hours/recurring
Source	Residential ISP	Cloud provider IPs

Observability Focus 4: Failure modes and quick prevention for measurable outcomes (Building Production Monitoring)

Auto-Scaling Response

The auto-scaling policies respond to traffic spikes automatically:

Normal: 1 task
↓ CPU > 60% for 3 min OR RequestCount > 1000/target
Scale Out: up to 8 tasks (adds 1-2 at a time)
↓ Metrics below threshold for 15 min
Scale In: back to 1 task

IP Blocking Automation

# alb-source-ip-access/run.py
import boto3, sys

def block_ip(ip: str, listener_arn: str):
"""Add IP to ALB deny rule"""
elbv2 = boto3.client('elbv2')
# Get current rules, find deny rule, add IP to condition
...

def status():
"""Show current IP access rules"""
elbv2 = boto3.client('elbv2')
rules = elbv2.describe_rules(ListenerArn=LISTENER_ARN)
for rule in rules['Rules']:
conditions = rule.get('Conditions', [])
for c in conditions:
if c['Field'] == 'source-ip':
print(f"Rule {rule['Priority']}: {c['SourceIpConfig']['Values']}")

Observability Focus 5: A cleaner way to operate this pattern for fewer incident surprises (Building Production Monitoring)

CloudWatch Metrics
├── Alarms (6 configured)
│ ├── ALARM → SNS → Lambda Formatter → SES → Email
│ ├── ALARM → SNS → Lambda Auto-Shutdown (dev/pg)
│ └── OK → SNS → Lambda Formatter → SES → Email (recovery notice)
├── Dashboard (12 widgets, auto-refresh)
└── ALB Access Logs → S3 (for forensic analysis)

Auto-Scaling
├── CPU target tracking (60%)
└── Request count target tracking (1000/target)

Observability Focus 6: What to automate first for this workload (Building Production Monitoring)

Alarm Strategy

We implemented tiered alarms covering all critical failure modes:

Alarm	Metric	Threshold	Action
Traffic Spike	ALB RequestCount (Sum)	>1000 in 5 min	Alert + investigate
High Error Rate	ALB HTTP 5XX	>10 in 5 min	Alert + page on-call
CPU Saturation	ECS CPU Utilization	>80% for 5 min	Alert + auto-scale
Dev Idle	ALB RequestCount (Sum)	=0 for 30 min	Auto-shutdown
DB Connections	RDS DatabaseConnections	>80 for 5 min	Alert
PgAdmin Idle	ALB RequestCount (Sum)	=0 for 60 min	Auto-shutdown

Alarm Configuration Pattern

aws cloudwatch put-metric-alarm `
--alarm-name "myapp-prod-traffic-spike" `
--alarm-description "High incoming traffic burst" `
--metric-name RequestCount `
--namespace AWS/ApplicationELB `
--statistic Sum `
--period 300 `
--evaluation-periods 1 `
--threshold 1000 `
--comparison-operator GreaterThanThreshold `
--treat-missing-data notBreaching `
--dimensions "Name=TargetGroup,Value=targetgroup/my-tg/EXAMPLE" "Name=LoadBalancer,Value=app/my-alb/EXAMPLE" `
--alarm-actions "arn:aws:sns:us-east-1:ACCOUNT_ID:myapp-prod-alerts" `
--ok-actions "arn:aws:sns:us-east-1:ACCOUNT_ID:myapp-prod-alerts"

Rich Alarm Descriptions

Each alarm description includes operational context:

Effective meaning: high incoming traffic burst

Resource in alarm: Prod target group behind ALB
Impacted endpoint: Production URL
Metric: AWS/ApplicationELB RequestCount
Condition: Sum > 1000 in 5 minutes
Actions: ALARM and OK both publish to alerts topic
Practical effect: traffic pressure may require scaling actions

Observability Focus 7: How to keep this maintainable at scale for your runbook (Building Production Monitoring)

Dashboard Layout (12 Widgets)

┌─────────────────┬──────────────────┬───────────────────┬────────────────┐
│ ALB Requests │ Response Time │ HTTP Errors │ Target Health │
│ (Sum/5min) │ (p99 + p50) │ (4xx + 5xx) │ (Healthy cnt) │
├─────────────────┼──────────────────┼───────────────────┼────────────────┤
│ Prod ECS CPU │ Prod ECS Memory │ Dev ECS CPU │ Dev ECS Memory │
├─────────────────┼──────────────────┼───────────────────┼────────────────┤
│ RDS CPU │ RDS Connections │ RDS Free Storage │ RDS IOPS │
└─────────────────┴──────────────────┴───────────────────┴────────────────┘

Dashboard as Code

# dashboard-manager/run.py
import boto3, json, sys

def create_or_update(dashboard_name: str, body_file: str):
cw = boto3.client('cloudwatch', region_name='us-east-1')
with open(body_file) as f:
body = f.read()
cw.put_dashboard(DashboardName=dashboard_name, DashboardBody=body)
print(f"Dashboard '{dashboard_name}' updated.")

if __name__ == '__main__':
action = sys.argv[1] if len(sys.argv) > 1 else 'update'
if action == 'update':
create_or_update('MyApp-Operations', 'dashboard.json')
elif action == 'delete':
boto3.client('cloudwatch').delete_dashboards(DashboardNames=['MyApp-Operations'])

Observability Focus 8: Pragmatic guardrails for day two ops for production readiness (Building Production Monitoring)

Problem: Unreadable AWS Alarm Emails

Default SNS alarm notifications are JSON blobs — unreadable for humans.

Solution: Lambda Email Formatter

A Lambda function subscribes to all SNS topics and reformats messages before forwarding via SES:

import boto3, json, os
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText

ses = boto3.client('ses', region_name=os.environ.get('SES_REGION', 'us-east-1'))
RECIPIENT = os.environ['ALERT_RECIPIENT']
SENDER = os.environ['ALERT_SENDER']

def handler(event, context):
for record in event['Records']:
sns_message = json.loads(record['Sns']['Message'])
subject = record['Sns']['Subject'] or 'AWS Alert'

# Format based on alarm type
if 'AlarmName' in sns_message:
body = format_alarm(sns_message)
elif 'source' in sns_message:
body = format_event(sns_message)
else:
body = json.dumps(sns_message, indent=2)

send_email(subject, body)

def format_alarm(alarm):
return f"""
🚨 {alarm['AlarmName']}

State: {alarm['OldStateValue']} → {alarm['NewStateValue']}
Reason: {alarm['NewStateReason']}
Time: {alarm['StateChangeTime']}

Description: {alarm.get('AlarmDescription', 'N/A')}

Metric: {alarm['Trigger']['MetricName']}
Namespace: {alarm['Trigger']['Namespace']}
Threshold: {alarm['Trigger']['Threshold']}
"""

def send_email(subject, body):
msg = MIMEMultipart()
msg['Subject'] = subject
msg['From'] = SENDER
msg['To'] = RECIPIENT
msg.attach(MIMEText(body, 'plain'))

ses.send_raw_email(
Source=SENDER,
Destinations=[RECIPIENT],
RawMessage={'Data': msg.as_string()}
)

SNS Topic Architecture

CloudWatch Alarms → SNS Topic (prod-alerts) → Lambda Formatter → SES → Email
→ SNS Topic (dev-idle) → Lambda (shutdown) + Lambda Formatter
→ SNS Topic (billing) → Lambda Formatter → SES → Email