Server on fire
/* Actual photo of my server during the incident (not really) */

It was a Tuesday morning. The coffee was fresh, the terminal was open, and I was about to make what would become one of the most expensive mistakes of my career.

The Setup

I needed to monitor some API endpoints for a client. Simple health checks every minute. "I'll just write a quick bash script," I thought. Famous last words.

bash
#!/bin/bash

# Monitor API endpoints
while true; do
    for endpoint in "${endpoints[@]}"; do
        status=$(curl -s -o /dev/null -w "%{http_code}" "$endpoint")
        if [ "$status" -ne 200 ]; then
            echo "ALERT: $endpoint is down!" | mail -s "API Alert" [email protected]
        fi
    done
    sleep 60
done
/* Looks innocent enough, right? WRONG. */

The Disaster

I deployed the script to production using cron. The mistake? I forgot to include the sleep 60 in the cron job itself. The result?

500,000 requests per minute to our API endpoints. Our monitoring graphs looked like a ski jump.

Lessons Learned

  1. Always test scripts in staging - My "it works on my machine" mentality backfired spectacularly
  2. Implement rate limiting - Both in your scripts and your APIs
  3. Use proper monitoring tools - Don't roll your own unless you really need to
  4. Coffee doesn't fix everything - Though it did help during the 3am firefighting

/* The server is fine now (mostly) */