Our CDN returned 504 and 502 HTTP errors between 19:08 UTC and 19:17 UTC representing about 7.8% of the total (not cached) traffic during 9 minutes. Errors were caused by one of our front server crash which was unable to handle traffic and the CDN did not take immediately into account the DNS redirection which excluded the faulty front server.
19:08 UTC: one of our front server did not handle traffic anymore. 504 HTTP errors appears on our CDN
19:09 UTC: our DNS failover record exclude the faulty front server so that all traffic should go to other healthy front server (the DNS record TTL is 20 seconds).
19:10 UTC: a first engineer is notified of the incident
19:11 UTC: 504 HTTP errors ends but now 502 HTTP errors are returned by our CDN
19:16 UTC: the alert escalate to a second engineer
19:17 UTC: 502 HTTP errors ends on our CDN
19:20 UTC: crashed processes on the faulty front server are restarted
19:21 UTC: our DNS failover record includes again the previously unavailable front server
Time to detect: 7 minutes
Time to resolve: 12 minutes