HTTP Errors 592 have been emitted by one of our front server between 16:55 and 18h15.
Incident Report for Fasterize

Post mortem : Errors 592 05/02/17

Description

HTTP Errors 592 have been emitted by one of our front servers between 15:55 UTC and 17:15 UTC. We estimate that one percent of the traffic has been impacted. This was due to a human error during the addition of new servers on the platform. New servers started to receive traffic whereas they were not completely provisioned.

Timeline

Before 15:55 UTC, new cache servers were provisioned in a disabled state.

At 15:55 UTC, update of a first front server to include new cache servers into its configuration. These cache servers were not fully disabled.

At 16:31 UTC, an alert with a low priority signals an abnormal request error ratio.

At 16:51 UTC, one customer created a support ticket.

At 16:58 UTC, engineers started the investigation.

At 17:11 UTC, mis-configuration in the impacted front layer is determined and fixed.

At 17:14 UTC, one customer called the oncall number.

At 17:14 UTC, new cache servers are removed from the update front server configuration.

At 17:37 UTC, a notification is posted on status.fasterize.com.

Time to detect : 36 minutes

Time to resolve : 79 minutes

Root cause

In our operation procedure, any operations on a server require to disable it. We recently introduced two disabled modes: full and partial. The partial mode is used to disable only one service.

The operation procedure has not been updated accordingly and, new servers were only partially disabled.

Traffic started to be sent to those cache servers and HTTPS traffic was unable to be forwarded to proxy servers as secure network layer was not yet ready. 592 HTTP errors were then replied to our front servers.

We implemented an HTTPS fallback that bypasses the cache and proxy layer in case of errors in the stack. This mechanism worked correctly for most of our customers but four of our customers were not correctly configured for this system to be effective (origin addresses were mis-configured).

About 1.4% of our traffic was a 592 HTTP error until the configuration was fixed.

Actions plan

short terms

  • Standardize the origin update on all layers (proxy, HTTPS fallback and DNS fallback). This work is almost done and will be deployed on February.
  • Better detection of the server status: enabled, fully disabled, partially disabled.
  • Default when disabling a server will be the full disabled mode.
  • Update of the operation procedure that provision a new server in the cluster.

long terms

  • Addition of HTTP fallback alongside HTTPS fallback
  • Review of the platform architecture to simplify and completely automate the addition of a new instance in the cache cluster
Posted 18 days ago. Feb 06, 2018 - 14:33 CET

Resolved
HTTP Errors 592 have been emitted by one of our front server between 16:55 and 18h15.
We estimate that one percent of the trafic has been impacted.
This was due to a human error during addition of new servers on the platform.
New servers started to receive trafic whereas they were not completely provisioned.
Posted 19 days ago. Feb 05, 2018 - 18:41 CET