Traffic temporarily redirected to origin
Incident Report for Fasterize
Postmortem

During the migration of website configurations database, the traffic has been redirected to the customers origin web servers for 32 minutes.

For the majority of websites, the traffic has correctly been served by the origin. However, for a few websites, the origin didn’t succeed to do so due to origin configuration.

Facts and Timeline

All times are UTC+2.

  • 3:30pm: migration starting from the legacy database to the new database after several days of testing on a fraction of the traffic in production and staging environments..
  • 4:02pm: Platform health checks status gradually moves from 100% to 0%. Traffic is automatically routed to the customers’ origin.
  • 4:03pm: An alert indicates that some health checks are red. The alert is immediately taken into account by our tech team, and a crisis team is set up.
  • 4:14pm: The root cause is detected in the new database: record holding information necessary for the health checks response is incorrect.
  • 4:35pm: Health check configurations are changed to allow traffic to return to the platform. Incident ends for clients.
  • 5:45pm: Missing record fixed in the database.
  • 10/04: Health checks are set back to the original settings.

Analysis

On October 3, 2023, the deployment of a new database holding website configurations occurred. During the deployment, the platform health checks switched to an unhealthy state.

Platform health checks consist of multiple monitors sending requests to the platform at regular intervals to validate that all layers in the platform are functional. When these requests fail, the traffic is automatically routed to the customers’ origin.

After the migration, the health checks received 521 errors (meaning that the relevant configuration for a given requested domain was not found).

The issue occurred because the deployment brought in a change in the logic involved in config loading. In the previous release, a request from the health checks was satisfied even if no configuration matched. In the current version, this is not possible. To quickly fix the issue, we created a configuration for health checks.

This issue was not detected in our testing phases for the following reasons:

  • no alert is configured for health checks in our staging environment.
  • the health checks are not correctly covered by automatic testing

By design, redirecting browser traffic to the origin when the platform is considered down is correct. However, we are seeing more and more cases where the origin cannot accept the traffic sent by browsers due to various reasons such as firewalls or incorrect certificates. We will improve our API to manage these edge cases.

Impacts

  • Number of customers impacted: all

Counter measures

Short term

  1. Set up alerting on our staging environment for health check
  2. Add a test covering the health checks routes

Medium term

  • Design a way to avoid origin failover when the origin doesn’t support it
Posted Oct 04, 2023 - 18:26 CEST

Resolved
This incident has been resolved since 16h35. A postmortem is in construction and will be available tomorrow.
Posted Oct 03, 2023 - 17:49 CEST
Monitoring
The situation is restored, all traffic is now routing through Fasterize as expected. A postmortem of the incident will shortly be provided
Posted Oct 03, 2023 - 16:22 CEST
Identified
Traffic is temporarily redirected to the origin of customer websites, the situation will be restored very soon.
Posted Oct 03, 2023 - 16:16 CEST
This incident affected: Acceleration.