← Blog/Zero-Downtime Deployment Pipelines: ECS Fargate Automation
DevOps

Zero-Downtime Deployment Pipelines: ECS Fargate Automation

May 24, 2026·4 min read
Med Amine Mahmoud
Med Amine Mahmoud
Founder and Editor, Smash The Exam
Reviewed: 2026-05-26 · LinkedIn

Zero-Downtime Deployment Pipelines: ECS Fargate Automation focuses on what actually matters in practice: decision context, safe rollout steps, and verification points.

AWSDevOpsDocker

Zero-Downtime Deployment Pipelines for ECS Fargate: Automation Scripts and Workflows

Consolidated from real CI/CD automation sessions covering Docker builds, ECR pushes, ECS deployments, rollbacks, and environment management.

Delivery Focus 1: Pragmatic guardrails for day two ops for this workload (Zero Downtime Deployment)

This article documents building deployment automation for a multi-environment (dev/prod) ECS Fargate application — from manual Docker builds to fully automated Python deployment scripts with health checks, rollback capability, and environment-specific controls.


Editorial review note for Zero Downtime Deployment

This section was reviewed by a human editor to keep the recommendations actionable and technically grounded. Reviewed by: Med Amine Mahmoud. Last editorial review: 2026-05-26T16:10:01Z.

Delivery Focus 3: Signals that tell you this is working for production readiness (Zero Downtime Deployment)

Issue: ECS Crash Loop After Deployment

Symptom: Task starts, passes initial health check, then container exits with code 1.

Diagnosis:

# Check container logs
aws logs get-log-events --log-group-name /ecs/my-app --log-stream-name "ecs/frontend/TASK_ID" --limit 50

# Check task stopped reason
aws ecs describe-tasks --cluster my-cluster --tasks TASK_ARN --query 'tasks[0].stoppedReason'

Common causes:

  • Missing environment variable (SSR needs BACKEND_URL)
  • Port conflict (container port doesn't match task definition)
  • Memory exceeded (OOMKilled — increase task memory)

Issue: Database Schema Drift

Symptom: API returns 500 after deployment with "column does not exist"

Fix: Add migration logic to application startup:

# utils/db.py
def _migrate_columns(self):
"""Add columns that may not exist in older schemas."""
migrations = [
("quiz_sessions", "timed", "BOOLEAN DEFAULT TRUE"),
("users", "display_name", "VARCHAR(50)"),
]
with self.session() as db:
for table, column, definition in migrations:
try:
db.execute(text(f"ALTER TABLE {table} ADD COLUMN {column} {definition}"))
db.commit()
except Exception:
db.rollback() # Column already exists

Issue: Images Built for Wrong Architecture

Symptom: Task fails to start with "exec format error"

Fix: Always build with --platform linux/amd64 for Fargate:

docker build --platform linux/amd64 -t myapp-frontend:latest ./frontend

Delivery Focus 4: How to keep cost and reliability aligned for sustained reliability (Zero Downtime Deployment)

Circuit Breaker

# CloudFormation — ECS Service
DeploymentConfiguration:
MaximumPercent: 200
MinimumHealthyPercent: 100
DeploymentCircuitBreaker:
Enable: true
Rollback: true

If a new deployment fails health checks, ECS automatically rolls back to the previous task definition.

Rollback Strategy

# Manual rollback: revert to previous task definition
$current = aws ecs describe-services --cluster my-cluster --services my-service --query 'services[0].taskDefinition' --output text
$revision = [int]($current -split ':')[-1]
$previous = ($current -replace ":\d+$", ":$($revision - 1)")

aws ecs update-service --cluster my-cluster --service my-service --task-definition $previous

Production Gate

Production deployments require explicit approval (enforced by convention, not automation):

# prod/run.py
if __name__ == '__main__':
print("⚠️ PRODUCTION DEPLOYMENT")
print("This will deploy to https://www.example.com/")
confirm = input("Type 'deploy-prod' to confirm: ")
if confirm != 'deploy-prod':
print("Aborted.")
sys.exit(1)
# proceed with deploy...

Delivery Focus 5: What to document for your team for secure delivery (Zero Downtime Deployment)

Final Platform Structure

platform/
├── local/
│ ├── compose/
│ │ ├── compose.yaml ← Local dev stack
│ │ └── compose.dev.yaml ← Dev overrides
│ └── env/
│ └── .env.example
├── aws/
│ ├── cloudformation/
│ │ ├── stack.yaml ← Production CFN
│ │ └── stack-dev.yaml ← Dev CFN
│ ├── runtime/
│ │ ├── lambda/ ← Lambda function code
│ │ └── task-definitions/ ← ECS task def templates
│ ├── deploy/
│ │ ├── scripts/ ← Shared utilities
│ │ └── workflows/ ← Per-action automation
│ │ ├── deploy-via-ec2/
│ │ ├── dev-env-control/
│ │ ├── pg-env-control/
│ │ ├── alb-source-ip-access/
│ │ └── ecr-cleanup/
│ └── monitoring/
│ ├── dashboard.json
│ └── dashboard-manager/
├── security/
│ └── tests/ ← sqlmap, OWASP ZAP configs
└── docs/
├── runbooks/
└── incidents/

Delivery Focus 6: Where this architecture earns its value for predictable operations (Zero Downtime Deployment)

Dev Environment Start/Stop

# dev-env-control/run.py
import boto3, sys

CLUSTER = "my-cluster"
SERVICE = "my-dev-service"

def get_status():
ecs = boto3.client('ecs', region_name='us-east-1')
resp = ecs.describe_services(cluster=CLUSTER, services=[SERVICE])
svc = resp['services'][0]
return svc['desiredCount'], svc['runningCount']

def scale(count: int):
ecs = boto3.client('ecs', region_name='us-east-1')
ecs.update_service(cluster=CLUSTER, service=SERVICE, desiredCount=count)
print(f"Scaled {SERVICE} to desiredCount={count}")

if __name__ == '__main__':
action = sys.argv[1] if len(sys.argv) > 1 else 'status'

if action == 'status':
desired, running = get_status()
print(f"Dev: desired={desired}, running={running}")
elif action == 'up':
scale(1)
elif action == 'down':
scale(0)

IP Allowlist Management

# alb-source-ip-access/run.py
import boto3, sys

LISTENER_ARN = "arn:aws:elasticloadbalancing:us-east-1:ACCOUNT:listener/app/my-alb/EXAMPLE/EXAMPLE"
RULE_ARNS = {
"dev": "arn:aws:elasticloadbalancing:...:listener-rule/EXAMPLE1",
"pg": "arn:aws:elasticloadbalancing:...:listener-rule/EXAMPLE2",
}

def get_current_ips(rule_arn: str) -> list:
elbv2 = boto3.client('elbv2')
rules = elbv2.describe_rules(RuleArns=[rule_arn])
for condition in rules['Rules'][0]['Conditions']:
if condition['Field'] == 'source-ip':
return condition['SourceIpConfig']['Values']
return []

def update_ip(rule_arn: str, new_ip: str):
elbv2 = boto3.client('elbv2')
elbv2.modify_rule(
RuleArn=rule_arn,
Conditions=[
{'Field': 'source-ip', 'SourceIpConfig': {'Values': [f'{new_ip}/32']}}
]
)

if __name__ == '__main__':
action = sys.argv[1] # 'status' or 'update'
if action == 'status':
for name, arn in RULE_ARNS.items():
ips = get_current_ips(arn)
print(f"{name}: {ips}")
elif action == 'update':
new_ip = sys.argv[2]
for name, arn in RULE_ARNS.items():
update_ip(arn, new_ip)
print(f"Updated {name} → {new_ip}/32")

Delivery Focus 7: Operational notes from real-world usage for exam and field confidence (Zero Downtime Deployment)

Architecture

deploy/workflows/
├── deploy-via-ec2/
│ ├── dev/
│ │ └── run.py ← Deploy to dev
│ └── prod/
│ └── run.py ← Deploy to prod (requires approval)
├── dev-env-control/
│ └── run.py ← Start/stop dev environment
├── pg-env-control/
│ └── run.py ← Start/stop PgAdmin service
└── alb-source-ip-access/
└── run.py ← Update IP allowlist

Core Deployment Script

#!/usr/bin/env python3
"""Deploy local Docker images to ECS Fargate environment."""

import subprocess, sys, time, json

# Configuration
CLUSTER = "my-cluster"
REGION = "us-east-1"
ACCOUNT_ID = "EXAMPLE_ACCOUNT"
ECR_BASE = f"{ACCOUNT_ID}.dkr.ecr.{REGION}.amazonaws.com"

SERVICES = {
"dev": "my-dev-service",
"prod": "my-prod-service",
}

IMAGES = ["myapp-frontend", "myapp-backend", "myapp-nginx"]

def run(cmd: str, check=True) -> subprocess.CompletedProcess:
"""Execute shell command with output."""
print(f" → {cmd}")
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
if check and result.returncode != 0:
print(f" ✗ FAILED: {result.stderr}")
sys.exit(1)
return result

def ecr_login():
"""Authenticate Docker to ECR."""
print("\n[1/5] ECR Login...")
pwd = run(f"aws ecr get-login-password --region {REGION}").stdout.strip()
run(f"docker login --username AWS --password-stdin {ECR_BASE}", check=True)

def build_images():
"""Build all Docker images for linux/amd64."""
print("\n[2/5] Building images...")
for img in IMAGES:
context = img.replace("myapp-", "")
run(f"docker build --platform linux/amd64 -t {img}:latest ./{context}")

def push_images():
"""Tag and push all images to ECR."""
print("\n[3/5] Pushing to ECR...")
for img in IMAGES:
ecr_uri = f"{ECR_BASE}/{img}:latest"
run(f"docker tag {img}:latest {ecr_uri}")
run(f"docker push {ecr_uri}")

def deploy(env: str):
"""Force new ECS deployment and wait for stability."""
service = SERVICES[env]
print(f"\n[4/5] Deploying to {env} ({service})...")
run(f"aws ecs update-service --cluster {CLUSTER} --service {service} --force-new-deployment --region {REGION}")

print(" Waiting for rollout (timeout: 10 min)...")
start = time.time()
timeout = 600

while time.time() - start < timeout:
result = run(f"aws ecs describe-services --cluster {CLUSTER} --services {service} --region {REGION} --query 'services[0].deployments'", check=False)
deployments = json.loads(result.stdout)
primary = next((d for d in deployments if d['status'] == 'PRIMARY'), None)

if primary and primary['runningCount'] == primary['desiredCount'] and len(deployments) == 1:
print(f" ✓ Rollout complete ({int(time.time()-start)}s)")
return True

time.sleep(15)

print(" ✗ Timeout waiting for deployment!")
return False

def verify(env: str):
"""Health check the deployed environment."""
print(f"\n[5/5] Verifying {env}...")
urls = {
"dev": "https://dev.example.com",
"prod": "https://www.example.com",
}
base = urls[env]

checks = [
(f"{base}/", 200),
(f"{base}/api/questions/stats/AZ-900", 200),
]

for url, expected in checks:
result = run(f'curl.exe -s -o NUL -w "%{{http_code}}" {url}', check=False)
code = int(result.stdout.strip())
status = "✓" if code == expected else "✗"
print(f" {status} {url} → {code}")

if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--skip-build", action="store_true")
parser.add_argument("--skip-env-check", action="store_true")
parser.add_argument("--timeout", type=int, default=600)
args = parser.parse_args()

env = "dev" # or "prod" based on script location
ecr_login()
if not args.skip_build:
build_images()
push_images()
success = deploy(env)
if success:
verify(env)
print(f"\n✓ Deploy to {env} COMPLETE")
else:
print(f"\n✗ Deploy to {env} FAILED - check ECS console")
sys.exit(1)

Delivery Focus 8: How to avoid expensive rework for cleaner ownership (Zero Downtime Deployment)

The 9-Step Manual Process

# 1. Build images for linux/amd64 (required for Fargate)
docker build --platform linux/amd64 -t myapp-frontend:latest ./frontend
docker build --platform linux/amd64 -t myapp-backend:latest ./backend
docker build --platform linux/amd64 -t myapp-nginx:latest ./nginx

# 2. Tag for ECR
docker tag myapp-frontend:latest ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/myapp-frontend:latest
docker tag myapp-backend:latest ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/myapp-backend:latest
docker tag myapp-nginx:latest ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/myapp-nginx:latest

# 3. Login to ECR
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com

# 4. Push all images
docker push ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/myapp-frontend:latest
docker push ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/myapp-backend:latest
docker push ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/myapp-nginx:latest

# 5. Force new ECS deployment
aws ecs update-service --cluster my-cluster --service my-service --force-new-deployment

# 6. Wait for rollout
aws ecs wait services-stable --cluster my-cluster --services my-service

# 7. Verify health
curl.exe -s https://www.example.com/ -o NUL -w "%{http_code}"
curl.exe -s https://www.example.com/api/questions/stats/AZ-900

Problems: Tedious, error-prone, no rollback, no environment separation, ~15 minutes manual work per deploy.


Delivery Focus 9: Where teams usually get this wrong for measurable outcomes (Zero Downtime Deployment)

  1. Automate everything — Manual deploys are error-prone and slow
  2. Separate environments — Dev and prod should be independently deployable
  3. Circuit breaker is essential — Bad deploys auto-rollback
  4. Health checks after deploy — Never assume success without verification
  5. Environment control scripts — Start/stop dev to save costs
  6. IP allowlists as code — Dynamic IP updates for restricted environments
  7. Platform linux/amd64 — Always specify architecture for Fargate
  8. Schema migrations at startup — Handle drift gracefully