Running 63 Docker Containers on a 16GB AWS Instance: A Post-Mortem
My production server (ctmprod) runs 63 Docker containers on a single t3a.xlarge EC2 instance with 16GB RAM. It serves 100+ users across 48 different services. And it's been rock solid for 18 months.
Here's what I learned from running Docker at this scale on a budget.
🧠 Why Self-Host?
The SaaS tax is real. Grafana Cloud wants $200/month. Auth0 charges per user. Uptime monitoring tools cost $50/month for basic checks.
My entire AWS bill for ctmprod: $87/month. That includes:
- Grafana + Loki + Prometheus (monitoring stack)
- Authentik SSO (replaces Auth0/Okta)
- n8n (workflow automation)
- Windmill (code-first workflow engine)
- Ghost (this blog)
- PostgreSQL, Redis, RabbitMQ
- ...and 57 more containers
The equivalent SaaS subscriptions would cost $1,200+/month. Self-hosting saves me $13k/year.
⚙️ The Stack
Host: AWS EC2 t3a.xlarge (4 vCPU, 16GB RAM, 100GB EBS)
OS: Ubuntu 22.04 LTS
Reverse proxy: Traefik v2.11
Container runtime: Docker 25.0 with docker-compose
SSL: Automatic Let's Encrypt via Traefik
Networking: Tailscale mesh + public internet via Cloudflare proxy
Traefik is the MVP here. It handles routing for 48+ services, automatic SSL renewal, middleware (auth, rate limiting, compression), and service discovery via Docker labels. Zero manual nginx configs.
🚀 How 16GB Handles 63 Containers
The secret: most containers are idle most of the time.
# Current memory usage on ctmprod
$ docker stats --no-stream --format "table {{.Name}} {{.MemUsage}}" | head -20
NAME MEM USAGE
traefik 180MB / 16GB
grafana 350MB / 16GB
postgres-main 420MB / 16GB
redis 12MB / 16GB
authentik-server 280MB / 16GB
authentik-worker 140MB / 16GB
n8n 320MB / 16GB
windmill-server 180MB / 16GB
windmill-workers 240MB / 16GB
ghost 110MB / 16GB
loki 380MB / 16GB
prometheus 520MB / 16GB
uptime-kuma 95MB / 16GB
vaultwarden 45MB / 16GB
portainer 85MB / 16GB
watchtower 32MB / 16GB
...47 more containers ~3.2GB total
Total usage: 6.8GB / 16GB (42%)Key strategies:
1. Memory limits for every container
# docker-compose.yml example
services:
grafana:
image: grafana/grafana:latest
deploy:
resources:
limits:
memory: 512M
reservations:
memory: 256M
mem_swappiness: 0 # Disable swap for critical servicesThis prevents runaway processes from killing the host. Grafana gets 512MB max. If it tries to use more, Docker kills and restarts it.
2. Shared databases
Instead of 15 PostgreSQL instances (one per app), I run one Postgres container with multiple databases. Same for Redis.
# One Postgres, many databases
postgres-main:
- grafana_db
- authentik_db
- n8n_db
- ghost_db
- windmill_db
- ...15 more databases
# Memory footprint: 420MB (vs 6GB+ for separate instances)3. Alpine-based images
Alpine Linux images are 5-10x smaller than Ubuntu-based ones. Less disk space, faster pulls, lower memory overhead.
# Image size comparison
redis:latest (Debian) 116MB
redis:alpine 41MB
node:18 (Debian) 910MB
node:18-alpine 173MB
postgres:15 (Debian) 376MB
postgres:15-alpine 238MB4. Lazy loading with autoscaling
Services like Windmill workers scale from 0 to 3 replicas based on queue depth. When idle, they consume zero resources.
💡 Traefik Configuration
Here's how I route 48 services through Traefik with zero manual config:
# docker-compose.yml for a service
services:
grafana:
image: grafana/grafana:latest
labels:
- "traefik.enable=true"
- "traefik.http.routers.grafana.rule=Host(`analytics.petieclark.com`)"
- "traefik.http.routers.grafana.entrypoints=websecure"
- "traefik.http.routers.grafana.tls.certresolver=letsencrypt"
- "traefik.http.routers.grafana.middlewares=authentik@docker"
- "traefik.http.services.grafana.loadbalancer.server.port=3000"
networks:
- traefik-public
networks:
traefik-public:
external: trueThat's it. Traefik reads Docker labels, creates routes, provisions SSL certs, and applies SSO middleware. Add a new service? Just add labels. No restarts, no config files.
📊 Real-World Performance
Metrics from the past 90 days:
- Uptime: 99.97% (one planned reboot for kernel update)
- Average CPU: 18% (spikes to 60% during n8n workflows)
- Average RAM: 7.2GB / 16GB (45%)
- Network: 180GB egress/month (well under AWS limits)
- Container restarts: 3 total (all due to OOM on under-resourced containers, fixed with memory limits)
The system handles 10,000+ API requests/day across all services. Response times are sub-200ms for most endpoints (helped by Redis caching and aggressive CDN use).
🔥 What Went Wrong
1. Loki's log explosion (Sept 2025)
Loki filled the 100GB EBS volume in 3 days due to verbose debug logging from a misbehaving app. The host became unresponsive.
Fix: Added log retention limits (7 days) and volume alerts in Grafana.
2. Ghost OOM crashes (Oct 2025)
Ghost's Node.js process leaked memory until hitting 2GB and crashing. No memory limit meant it took down other containers.
Fix: Set 512MB limit. Ghost now auto-restarts before causing issues.
3. Let's Encrypt rate limits (Nov 2025)
I hit the 50 certs/week limit after adding 20 new subdomains in one day. Services went down due to invalid certs.
Fix: Switched to wildcard certs (*.petieclark.com). One cert for unlimited subdomains.
🧠 Lessons Learned
- Memory limits are non-negotiable. Set them for every container, even trusted ones.
- Traefik scales effortlessly. It's handling 48 routes without breaking a sweat. Could easily do 100+.
- Shared databases save massive RAM. One Postgres beats 15 separate instances.
- Alpine images are worth it. Smaller, faster, cheaper. The occasional compatibility issue is a fair trade.
- Monitoring pays for itself. Grafana alerts caught 3 incidents before users noticed.
- Wildcard certs are essential at scale. Don't hit Let's Encrypt rate limits like I did.
- Backups are boring until they're critical. Daily snapshots to S3 (encrypted, versioned) saved me twice.
Self-hosting isn't for everyone. It requires time, discipline, and a willingness to wake up at 3am when Postgres crashes. But the cost savings, control, and learning are worth it.
I'm running 100+ containers across 4 hosts now (ctmprod, overseer, Mac Studio, and a Proxmox cluster). Total monthly cost: $150. SaaS equivalent: $3,000+.
Next up: I'm migrating Prometheus to VictoriaMetrics for better long-term storage efficiency. Stay tuned.