Configuration issues
Incident Report for Fasterize
Postmortem

Error 521 for some customers

Date: 04/05/2023

Description of incident

Some customers have experienced resource errors 521. The Fasterize error 521 corresponds to a configuration that is not found in the engine.  After an engine update, some proxies failed to load some configurations in V2 format.

Facts and Timeline

  • 16h38: Launch of the engine update after validation on the staging environment and then in canary mode 
  • 17h10: First alert:High proxy error ratio detected.
  • 17h12: 521 errors are starting to appear.
  • 17h27: The technical team turns off problematic proxies.
  • 17:30: Traffic is back to normal.
  • 17h36: Publication of a message on StatusPage
  • 17:56: Trigger workers rollback
  • 18:25: The technical team is fixing the issue by returning to the previous version of the engine.

Analysis

On February 15, 2023, the deployment of the website-config package (4.14.1) changed the JSON schema used for client configurations in order to introduce a new key. This change should not have been included in the package because the feature was not finished.

The new version of the website-config package moved this new key to another location in the JSON schema.

During deployment, the deletion of the key previously and incorrectly introduced in the validation scheme had the effect of invalidating all V2 configs with this key. However, this key was added automatically by the API if it was not present.

A mechanism to load a configuration even if it is not valid was however introduced during the update but did not work. When processing requests associated with unloaded configurations, the engine responded with a 521 error.

The fallback mechanism at the front level has mitigated the problem at the cache layer. Indeed, a second attempt on another proxy is triggered in the event of a 521 error. However, the return-to-origin system is not in place for 521 errors (to prevent the discovery of configurations).

The message for 521 errors is not clear enough and should render a page like the one used for 592 or 594 errors.

At the rollback level, retrieving the commit corresponding to version N-1 was not so easy. The rollback was not possible via the CI because it took too long to execute and was therefore executed on a developer workstation.

Metrics

Error 521

  • a first peak around 5:05 p.m. (which triggered the alert)
  • from 5:10 p.m. to 5:30 p.m. a large number of requests/s in 521 is observed

As a percentage of all traffic:

Over the duration of the incident

Only on impacted customers

Impacts

  • Number of customers impacted: 12 sites (< 2%)
  • Percentage of requests impacted on all customers

    • Maximum: 1.5%
    • Over the duration of the incident: 0.32%
  • Percentage of queries impacted on impacted customers

    • Maximum: 7.3%
    • Over the duration of the incident: 1.54%

Counter measures

Short term

  1. Fix the engine and the faulty package to remove the breaking change
  2. Secure v2 config schema validation changes
  3. Enable fallback to origin on 521 error

Middle term

  1. Set up a system for migrating V2 configs from one schema version to another.
  2. Improve some documentation (rollback, release) 
  3. Improve crisis internal organization
  4. Put an extra step to enable canary phase with actual production traffic before triggering the rest of the update
Posted May 05, 2023 - 13:12 CEST

Resolved
The rollback has been completed. We will publish a postmortem tomorrow (5th of May) to clarify the root cause of the incident.
Posted May 04, 2023 - 18:25 CEST
Monitoring
No more errors are being generated since 5:42pm (UTC+2) and 90% less errors since 5:30pm. Impacts were limited to websites using config version 2.
Posted May 04, 2023 - 17:49 CEST
Identified
The issue has been identified. A mitigation has been deployed to replace faulty proxies. The root cause is being diagnosed.
Posted May 04, 2023 - 17:37 CEST
Investigating
Some proxies has loaded invalid configurations and cannot process incoming requests.
We are currently investigating this issue.
Posted May 04, 2023 - 17:36 CEST
This incident affected: Acceleration.