
/* Actual photo of my server during the incident (not really) */
It was a Tuesday morning. The coffee was fresh, the terminal was open, and I was about to make what would become one of the most expensive mistakes of my career.
The Setup
I needed to monitor some API endpoints for a client. Simple health checks every minute. "I'll just write a quick bash script," I thought. Famous last words.
bash
#!/bin/bash
# Monitor API endpoints
while true; do
for endpoint in "${endpoints[@]}"; do
status=$(curl -s -o /dev/null -w "%{http_code}" "$endpoint")
if [ "$status" -ne 200 ]; then
echo "ALERT: $endpoint is down!" | mail -s "API Alert" [email protected]
fi
done
sleep 60
done
/* Looks innocent enough, right? WRONG. */
The Disaster
I deployed the script to production using cron. The mistake? I forgot to include the sleep 60
in the cron job itself. The result?
500,000 requests per minute to our API endpoints. Our monitoring graphs looked like a ski jump.
Lessons Learned
- Always test scripts in staging - My "it works on my machine" mentality backfired spectacularly
- Implement rate limiting - Both in your scripts and your APIs
- Use proper monitoring tools - Don't roll your own unless you really need to
- Coffee doesn't fix everything - Though it did help during the 3am firefighting
/* The server is fine now (mostly) */