Post mortem : Errors 592 05/02/17
HTTP Errors 592 have been emitted by one of our front servers between 15:55 UTC and 17:15 UTC. We estimate that one percent of the traffic has been impacted. This was due to a human error during the addition of new servers on the platform. New servers started to receive traffic whereas they were not completely provisioned.
Before 15:55 UTC, new cache servers were provisioned in a disabled state.
At 15:55 UTC, update of a first front server to include new cache servers into its configuration. These cache servers were not fully disabled.
At 16:31 UTC, an alert with a low priority signals an abnormal request error ratio.
At 16:51 UTC, one customer created a support ticket.
At 16:58 UTC, engineers started the investigation.
At 17:11 UTC, mis-configuration in the impacted front layer is determined and fixed.
At 17:14 UTC, one customer called the oncall number.
At 17:14 UTC, new cache servers are removed from the update front server configuration.
At 17:37 UTC, a notification is posted on status.fasterize.com.
Time to detect : 36 minutes
Time to resolve : 79 minutes
In our operation procedure, any operations on a server require to disable it. We recently introduced two disabled modes: full and partial. The partial mode is used to disable only one service.
The operation procedure has not been updated accordingly and, new servers were only partially disabled.
Traffic started to be sent to those cache servers and HTTPS traffic was unable to be forwarded to proxy servers as secure network layer was not yet ready. 592 HTTP errors were then replied to our front servers.
We implemented an HTTPS fallback that bypasses the cache and proxy layer in case of errors in the stack. This mechanism worked correctly for most of our customers but four of our customers were not correctly configured for this system to be effective (origin addresses were mis-configured).
About 1.4% of our traffic was a 592 HTTP error until the configuration was fixed.