Acceleration was disabled between 10:45 AM GMT+1 and 11 AM GMT+1

Incident Report for Fasterize

Postmortem

Healthcheck in error

Date of post mortem : 25/02/2025

Participants of the post mortem : Anthony Barré

Description of the incident

Following a change in the Fasterize API, the health checks of the platform turned red, indicating an unavailability of the platform while it was actually healthy. These health checks were configured to query a URL associated with a domain name that was no longer present in the configuration database.

Immediate corrective action

  • Temporary deactivation of health checks.
  • Rollback of the API configuration change.

Facts & Timeline

All times are UTC+1.

Analysis

Technical context

  • The health checks validate the availability of the platform through several layers (proxy, load balancer, workers, etc …)
  • The failure of the health checks was due to errors at the proxy level, a symptom of the absence of configuration associated with the domain used. 
  • At the proxy level, the health check response is adapted by region to respond to the health checks associated with each region.
  • An outdated process, that was still running despite no longer being required, caused issues when it updated the configuration used by proxies with incorrect data.

Root cause analysis

The API configuration change caused an unstable state. Indeed, an API update did not take place immediately. Thus, some API pods were reading the health check configuration from one environment and others from another environment.

This corrupted the database used by the proxies of our main environment. The proxies no longer responded with 200 but with error code to health check requests.

Impact of incident

🏠 Affected customers : All customers were impacted by the lack of acceleration.

🔴 Specific issue : Two customers experienced major disruptions because their origins did not allow them to receive traffic directly from the Internet.

📊 Incident metrics

  • Severity: 1 (service shutdown impacting a large number of users).
  • Detection time: 4 minutes.
  • 🛠 Resolution time: 20 minutes.

Action Plan 

Short term (immediate - 1 week)

  • Review the platform's health checks so that they are more robust 
  • Disable and delete the obsolete monitoring process ✅
  • Fix the deployment of configmaps in the helm charts.

Medium term

  • Review the API configuration to clarify the use of region-related fields and avoid side effects.
Posted Feb 25, 2025 - 18:15 CET

Resolved

Incident is now closed. Sorry for the inconvenience. A post-mortem will follow.
Posted Feb 25, 2025 - 18:06 CET

Monitoring

Fix has been deployed, everything is back to normal (the platform is properly monitored again). We're still monitoring.
Posted Feb 25, 2025 - 17:37 CET

Update

Root cause has been identified. A post-mortem is being written. We're gonna set up more robust global platform health checks.
Posted Feb 25, 2025 - 16:37 CET

Identified

Acceleration was disabled between 10:45 AM GMT+1 and 11 AM GMT+1 due to a misconfiguration of the global platform health checks.
These health checks are currently being reviewed (but the platform is actually healthy).
Posted Feb 25, 2025 - 11:10 CET
This incident affected: Acceleration.