Running 63 Docker Containers on a 16GB AWS Instance: A Post-Mortem

My production server (ctmprod) runs 63 Docker containers on a single t3a.xlarge EC2 instance with 16GB RAM. It serves 100+ users across 48 different services. And it's been rock solid for 18 months.

Here's what I learned from running Docker at this scale on a budget.

🧠 Why Self-Host?

The SaaS tax is real. Grafana Cloud wants $200/month. Auth0 charges per user. Uptime monitoring tools cost $50/month for basic checks.

My entire AWS bill for ctmprod: $87/month. That includes:

Grafana + Loki + Prometheus (monitoring stack)
Authentik SSO (replaces Auth0/Okta)
n8n (workflow automation)
Windmill (code-first workflow engine)
Ghost (this blog)
PostgreSQL, Redis, RabbitMQ
...and 57 more containers

The equivalent SaaS subscriptions would cost $1,200+/month. Self-hosting saves me $13k/year.

⚙️ The Stack

Host: AWS EC2 t3a.xlarge (4 vCPU, 16GB RAM, 100GB EBS)

OS: Ubuntu 22.04 LTS

Reverse proxy: Traefik v2.11

Container runtime: Docker 25.0 with docker-compose

SSL: Automatic Let's Encrypt via Traefik

Networking: Tailscale mesh + public internet via Cloudflare proxy

Traefik is the MVP here. It handles routing for 48+ services, automatic SSL renewal, middleware (auth, rate limiting, compression), and service discovery via Docker labels. Zero manual nginx configs.

🚀 How 16GB Handles 63 Containers

The secret: most containers are idle most of the time.

# Current memory usage on ctmprod
$ docker stats --no-stream --format "table {{.Name}}	{{.MemUsage}}" | head -20

NAME                    MEM USAGE
traefik                 180MB / 16GB
grafana                 350MB / 16GB
postgres-main           420MB / 16GB
redis                   12MB / 16GB
authentik-server        280MB / 16GB
authentik-worker        140MB / 16GB
n8n                     320MB / 16GB
windmill-server         180MB / 16GB
windmill-workers        240MB / 16GB
ghost                   110MB / 16GB
loki                    380MB / 16GB
prometheus              520MB / 16GB
uptime-kuma             95MB / 16GB
vaultwarden             45MB / 16GB
portainer               85MB / 16GB
watchtower              32MB / 16GB
...47 more containers    ~3.2GB total

Total usage: 6.8GB / 16GB (42%)

Key strategies:

1. Memory limits for every container

# docker-compose.yml example
services:
  grafana:
    image: grafana/grafana:latest
    deploy:
      resources:
        limits:
          memory: 512M
        reservations:
          memory: 256M
    mem_swappiness: 0  # Disable swap for critical services

This prevents runaway processes from killing the host. Grafana gets 512MB max. If it tries to use more, Docker kills and restarts it.

2. Shared databases

Instead of 15 PostgreSQL instances (one per app), I run one Postgres container with multiple databases. Same for Redis.

# One Postgres, many databases
postgres-main:
  - grafana_db
  - authentik_db
  - n8n_db
  - ghost_db
  - windmill_db
  - ...15 more databases

# Memory footprint: 420MB (vs 6GB+ for separate instances)

3. Alpine-based images

Alpine Linux images are 5-10x smaller than Ubuntu-based ones. Less disk space, faster pulls, lower memory overhead.

# Image size comparison
redis:latest (Debian)       116MB
redis:alpine                 41MB

node:18 (Debian)            910MB
node:18-alpine              173MB

postgres:15 (Debian)        376MB
postgres:15-alpine          238MB

4. Lazy loading with autoscaling

Services like Windmill workers scale from 0 to 3 replicas based on queue depth. When idle, they consume zero resources.

💡 Traefik Configuration

Here's how I route 48 services through Traefik with zero manual config:

# docker-compose.yml for a service
services:
  grafana:
    image: grafana/grafana:latest
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.grafana.rule=Host(`analytics.petieclark.com`)"
      - "traefik.http.routers.grafana.entrypoints=websecure"
      - "traefik.http.routers.grafana.tls.certresolver=letsencrypt"
      - "traefik.http.routers.grafana.middlewares=authentik@docker"
      - "traefik.http.services.grafana.loadbalancer.server.port=3000"
    networks:
      - traefik-public

networks:
  traefik-public:
    external: true

That's it. Traefik reads Docker labels, creates routes, provisions SSL certs, and applies SSO middleware. Add a new service? Just add labels. No restarts, no config files.

📊 Real-World Performance

Metrics from the past 90 days:

Uptime: 99.97% (one planned reboot for kernel update)
Average CPU: 18% (spikes to 60% during n8n workflows)
Average RAM: 7.2GB / 16GB (45%)
Network: 180GB egress/month (well under AWS limits)
Container restarts: 3 total (all due to OOM on under-resourced containers, fixed with memory limits)

The system handles 10,000+ API requests/day across all services. Response times are sub-200ms for most endpoints (helped by Redis caching and aggressive CDN use).

🔥 What Went Wrong

1. Loki's log explosion (Sept 2025)

Loki filled the 100GB EBS volume in 3 days due to verbose debug logging from a misbehaving app. The host became unresponsive.

Fix: Added log retention limits (7 days) and volume alerts in Grafana.

2. Ghost OOM crashes (Oct 2025)

Ghost's Node.js process leaked memory until hitting 2GB and crashing. No memory limit meant it took down other containers.

Fix: Set 512MB limit. Ghost now auto-restarts before causing issues.

3. Let's Encrypt rate limits (Nov 2025)

I hit the 50 certs/week limit after adding 20 new subdomains in one day. Services went down due to invalid certs.

Fix: Switched to wildcard certs (*.petieclark.com). One cert for unlimited subdomains.

🧠 Lessons Learned

Memory limits are non-negotiable. Set them for every container, even trusted ones.
Traefik scales effortlessly. It's handling 48 routes without breaking a sweat. Could easily do 100+.
Shared databases save massive RAM. One Postgres beats 15 separate instances.
Alpine images are worth it. Smaller, faster, cheaper. The occasional compatibility issue is a fair trade.
Monitoring pays for itself. Grafana alerts caught 3 incidents before users noticed.
Wildcard certs are essential at scale. Don't hit Let's Encrypt rate limits like I did.
Backups are boring until they're critical. Daily snapshots to S3 (encrypted, versioned) saved me twice.

Self-hosting isn't for everyone. It requires time, discipline, and a willingness to wake up at 3am when Postgres crashes. But the cost savings, control, and learning are worth it.

I'm running 100+ containers across 4 hosts now (ctmprod, overseer, Mac Studio, and a Proxmox cluster). Total monthly cost: $150. SaaS equivalent: $3,000+.

Next up: I'm migrating Prometheus to VictoriaMetrics for better long-term storage efficiency. Stay tuned.

🧠 Why Self-Host?

⚙️ The Stack

🚀 How 16GB Handles 63 Containers

💡 Traefik Configuration

📊 Real-World Performance

🔥 What Went Wrong

🧠 Lessons Learned

Get engineering insights delivered

Petie Clark

Keep Reading

Baseline Setup, Part 6 – Automating the Rebuild with Git and Ansible

Baseline Setup, Part 5 – Service Visibility with Homepage

Baseline Setup, Part 4 – Centralized Authentication with Authentik