One of our front server is down
Incident Report for Fasterize
Postmortem

Description

Our CDN returned 504 and 502 HTTP errors between 19:08 UTC and 19:17 UTC representing about 7.8% of the total (not cached) traffic during 9 minutes. Errors were caused by one of our front server crash which was unable to handle traffic and the CDN did not take immediately into account the DNS redirection which excluded the faulty front server.

Timeline

19:08 UTC: one of our front server did not handle traffic anymore. 504 HTTP errors appears on our CDN

19:09 UTC: our DNS failover record exclude the faulty front server so that all traffic should go to other healthy front server (the DNS record TTL is 20 seconds).

19:10 UTC: a first engineer is notified of the incident

19:11 UTC: 504 HTTP errors ends but now 502 HTTP errors are returned by our CDN

19:16 UTC: the alert escalate to a second engineer

19:17 UTC: 502 HTTP errors ends on our CDN

19:20 UTC: crashed processes on the faulty front server are restarted

19:21 UTC: our DNS failover record includes again the previously unavailable front server

Time to detect: 7 minutes

Time to resolve: 12 minutes

Actions plan

  • Assert our CDN take our DNS update as soon as possible (after TTL expiration)
  • identify our HTTP server crash root cause
Posted Apr 17, 2018 - 12:28 CEST

Resolved
The HTTP server crashed on the unavailable front.
Evertything is back to normal now.
Posted Apr 12, 2018 - 21:50 CEST
Investigating
One of our front server is down, all traffic has been rerouted to others front servers.
Posted Apr 12, 2018 - 21:19 CEST