← Blog/Building a Production Monitoring Stack: Alarms, Dashboards, and Incide…
Monitoring

Building a Production Monitoring Stack: Alarms, Dashboards, and Incident Response

May 24, 2026·4 min read
Med Amine Mahmoud
Med Amine Mahmoud
Founder and Editor, Smash The Exam
Reviewed: 2026-05-26 · LinkedIn

Building a Production Monitoring Stack: Alarms, Dashboards, and Incident Response breaks the topic into practical decisions, shows what to validate, and explains how to apply it in real engineering workflows.

AWSMonitoringDevOps

Building a Production Monitoring Stack on AWS: Alarms, Dashboards, and Incident Response

Consolidated from real monitoring setup, traffic spike investigations, and alerting pipeline sessions.

Observability Focus 1: Implementation details that change outcomes for predictable operations (Building Production Monitoring)

This article covers building a complete observability stack for a web application on AWS — from CloudWatch alarms and dashboards to SNS notification formatting, traffic spike investigation, and automated incident response.


Editorial review note for Building Production Monitoring

This section was reviewed by a human editor to keep the recommendations actionable and technically grounded. Reviewed by: Med Amine Mahmoud. Last editorial review: 2026-05-26T16:10:01Z.

Observability Focus 3: How this maps to real exam objectives for cleaner ownership (Building Production Monitoring)

When an Alarm Fires

Step-by-step investigation process proven on real incidents:

Step 1: Check Metrics Context

# Request count around alarm time (5-min granularity)
aws cloudwatch get-metric-statistics `
--namespace AWS/ApplicationELB `
--metric-name RequestCount `
--dimensions "Name=LoadBalancer,Value=app/my-alb/EXAMPLE" "Name=TargetGroup,Value=targetgroup/my-tg/EXAMPLE" `
--start-time 2026-05-23T02:00:00Z `
--end-time 2026-05-23T03:00:00Z `
--period 300 --statistics Sum

# HTTP status code distribution
aws cloudwatch get-metric-statistics `
--metric-name HTTPCode_Target_5XX_Count ...
aws cloudwatch get-metric-statistics `
--metric-name HTTPCode_Target_4XX_Count ...

Step 2: Check Service Health

# Target response time during spike
aws cloudwatch get-metric-statistics `
--metric-name TargetResponseTime `
--statistics Average Maximum ...

# Healthy host count
aws cloudwatch get-metric-statistics `
--metric-name HealthyHostCount ...

Step 3: Analyze ALB Access Logs

# List log files for the spike window
aws s3 ls s3://my-alb-logs/alb/AWSLogs/ACCOUNT_ID/elasticloadbalancing/us-east-1/2026/05/23/

# Download and analyze
aws s3 cp s3://my-alb-logs/alb/.../T0230Z_....log.gz spike.gz

# PowerShell: decompress and parse
$stream = [System.IO.File]::OpenRead("spike.gz")
$gz = New-Object System.IO.Compression.GZipStream($stream, [System.IO.Compression.CompressionMode]::Decompress)
$reader = New-Object System.IO.StreamReader($gz)
$content = $reader.ReadToEnd()
$lines = $content -split "`n" | Where-Object { $_ -ne "" }

# Top IPs
$lines | ForEach-Object { ($_ -split ' ')[3] -replace ':\d+$','' } |
Group-Object | Sort-Object Count -Descending | Select-Object -First 10

Step 4: IP Intelligence

curl.exe -s "https://ipinfo.io/SUSPICIOUS_IP/json"

Real Investigation Results

IncidentSourceMethodVerdict
1,230 reqs/5minSingle residential IP (Tunisia)HEAD, no UA, 1186 unique pathsBenign crawler
75 403s in burstCoordinated IPs hitting ALB directlyCredential scannerBlocked by rules
18 404sGoogle Cloud IPWordPress vuln scanner (/wp-admin)Internet noise
500+ errors2 IPs using ffuf/feroxbusterDirectory enumerationMalicious, blocked

Classification Framework

SignalLikely BenignLikely Malicious
MethodHEAD, GET onlyPOST, PUT, various
PathsSitemap-matching/wp-admin, /.env, /admin
User-AgentKnown bot or emptyTool signatures (ffuf, sqlmap)
RateSteady, <10 req/sBursting, >50 req/s
Response codesAll 200Mixed 4xx/5xx
DurationMinutesHours/recurring
SourceResidential ISPCloud provider IPs

Observability Focus 4: Failure modes and quick prevention for measurable outcomes (Building Production Monitoring)

Auto-Scaling Response

The auto-scaling policies respond to traffic spikes automatically:

Normal: 1 task
↓ CPU > 60% for 3 min OR RequestCount > 1000/target
Scale Out: up to 8 tasks (adds 1-2 at a time)
↓ Metrics below threshold for 15 min
Scale In: back to 1 task

IP Blocking Automation

# alb-source-ip-access/run.py
import boto3, sys

def block_ip(ip: str, listener_arn: str):
"""Add IP to ALB deny rule"""
elbv2 = boto3.client('elbv2')
# Get current rules, find deny rule, add IP to condition
...

def status():
"""Show current IP access rules"""
elbv2 = boto3.client('elbv2')
rules = elbv2.describe_rules(ListenerArn=LISTENER_ARN)
for rule in rules['Rules']:
conditions = rule.get('Conditions', [])
for c in conditions:
if c['Field'] == 'source-ip':
print(f"Rule {rule['Priority']}: {c['SourceIpConfig']['Values']}")

Observability Focus 5: A cleaner way to operate this pattern for fewer incident surprises (Building Production Monitoring)

CloudWatch Metrics
├── Alarms (6 configured)
│ ├── ALARM → SNS → Lambda Formatter → SES → Email
│ ├── ALARM → SNS → Lambda Auto-Shutdown (dev/pg)
│ └── OK → SNS → Lambda Formatter → SES → Email (recovery notice)
├── Dashboard (12 widgets, auto-refresh)
└── ALB Access Logs → S3 (for forensic analysis)

Auto-Scaling
├── CPU target tracking (60%)
└── Request count target tracking (1000/target)

Observability Focus 6: What to automate first for this workload (Building Production Monitoring)

Alarm Strategy

We implemented tiered alarms covering all critical failure modes:

AlarmMetricThresholdAction
Traffic SpikeALB RequestCount (Sum)>1000 in 5 minAlert + investigate
High Error RateALB HTTP 5XX>10 in 5 minAlert + page on-call
CPU SaturationECS CPU Utilization>80% for 5 minAlert + auto-scale
Dev IdleALB RequestCount (Sum)=0 for 30 minAuto-shutdown
DB ConnectionsRDS DatabaseConnections>80 for 5 minAlert
PgAdmin IdleALB RequestCount (Sum)=0 for 60 minAuto-shutdown

Alarm Configuration Pattern

aws cloudwatch put-metric-alarm `
--alarm-name "myapp-prod-traffic-spike" `
--alarm-description "High incoming traffic burst" `
--metric-name RequestCount `
--namespace AWS/ApplicationELB `
--statistic Sum `
--period 300 `
--evaluation-periods 1 `
--threshold 1000 `
--comparison-operator GreaterThanThreshold `
--treat-missing-data notBreaching `
--dimensions "Name=TargetGroup,Value=targetgroup/my-tg/EXAMPLE" "Name=LoadBalancer,Value=app/my-alb/EXAMPLE" `
--alarm-actions "arn:aws:sns:us-east-1:ACCOUNT_ID:myapp-prod-alerts" `
--ok-actions "arn:aws:sns:us-east-1:ACCOUNT_ID:myapp-prod-alerts"

Rich Alarm Descriptions

Each alarm description includes operational context:

Effective meaning: high incoming traffic burst

Resource in alarm: Prod target group behind ALB
Impacted endpoint: Production URL
Metric: AWS/ApplicationELB RequestCount
Condition: Sum > 1000 in 5 minutes
Actions: ALARM and OK both publish to alerts topic
Practical effect: traffic pressure may require scaling actions

Observability Focus 7: How to keep this maintainable at scale for your runbook (Building Production Monitoring)

Dashboard Layout (12 Widgets)

┌─────────────────┬──────────────────┬───────────────────┬────────────────┐
│ ALB Requests │ Response Time │ HTTP Errors │ Target Health │
│ (Sum/5min) │ (p99 + p50) │ (4xx + 5xx) │ (Healthy cnt) │
├─────────────────┼──────────────────┼───────────────────┼────────────────┤
│ Prod ECS CPU │ Prod ECS Memory │ Dev ECS CPU │ Dev ECS Memory │
├─────────────────┼──────────────────┼───────────────────┼────────────────┤
│ RDS CPU │ RDS Connections │ RDS Free Storage │ RDS IOPS │
└─────────────────┴──────────────────┴───────────────────┴────────────────┘

Dashboard as Code

# dashboard-manager/run.py
import boto3, json, sys

def create_or_update(dashboard_name: str, body_file: str):
cw = boto3.client('cloudwatch', region_name='us-east-1')
with open(body_file) as f:
body = f.read()
cw.put_dashboard(DashboardName=dashboard_name, DashboardBody=body)
print(f"Dashboard '{dashboard_name}' updated.")

if __name__ == '__main__':
action = sys.argv[1] if len(sys.argv) > 1 else 'update'
if action == 'update':
create_or_update('MyApp-Operations', 'dashboard.json')
elif action == 'delete':
boto3.client('cloudwatch').delete_dashboards(DashboardNames=['MyApp-Operations'])

Observability Focus 8: Pragmatic guardrails for day two ops for production readiness (Building Production Monitoring)

Problem: Unreadable AWS Alarm Emails

Default SNS alarm notifications are JSON blobs — unreadable for humans.

Solution: Lambda Email Formatter

A Lambda function subscribes to all SNS topics and reformats messages before forwarding via SES:

import boto3, json, os
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText

ses = boto3.client('ses', region_name=os.environ.get('SES_REGION', 'us-east-1'))
RECIPIENT = os.environ['ALERT_RECIPIENT']
SENDER = os.environ['ALERT_SENDER']

def handler(event, context):
for record in event['Records']:
sns_message = json.loads(record['Sns']['Message'])
subject = record['Sns']['Subject'] or 'AWS Alert'

# Format based on alarm type
if 'AlarmName' in sns_message:
body = format_alarm(sns_message)
elif 'source' in sns_message:
body = format_event(sns_message)
else:
body = json.dumps(sns_message, indent=2)

send_email(subject, body)

def format_alarm(alarm):
return f"""
🚨 {alarm['AlarmName']}

State: {alarm['OldStateValue']} → {alarm['NewStateValue']}
Reason: {alarm['NewStateReason']}
Time: {alarm['StateChangeTime']}

Description: {alarm.get('AlarmDescription', 'N/A')}

Metric: {alarm['Trigger']['MetricName']}
Namespace: {alarm['Trigger']['Namespace']}
Threshold: {alarm['Trigger']['Threshold']}
"""

def send_email(subject, body):
msg = MIMEMultipart()
msg['Subject'] = subject
msg['From'] = SENDER
msg['To'] = RECIPIENT
msg.attach(MIMEText(body, 'plain'))

ses.send_raw_email(
Source=SENDER,
Destinations=[RECIPIENT],
RawMessage={'Data': msg.as_string()}
)

SNS Topic Architecture

CloudWatch Alarms → SNS Topic (prod-alerts) → Lambda Formatter → SES → Email
→ SNS Topic (dev-idle) → Lambda (shutdown) + Lambda Formatter
→ SNS Topic (billing) → Lambda Formatter → SES → Email