Healthcheck in error
Date of post mortem : 25/02/2025
Participants of the post mortem : Anthony Barré
Description of the incident
Following a change in the Fasterize API, the health checks of the platform turned red, indicating an unavailability of the platform while it was actually healthy. These health checks were configured to query a URL associated with a domain name that was no longer present in the configuration database.
Immediate corrective action
- Temporary deactivation of health checks.
- Rollback of the API configuration change.
Facts & Timeline
All times are UTC+1.
Analysis
Technical context
- The health checks validate the availability of the platform through several layers (proxy, load balancer, workers, etc …)
- The failure of the health checks was due to errors at the proxy level, a symptom of the absence of configuration associated with the domain used.
- At the proxy level, the health check response is adapted by region to respond to the health checks associated with each region.
- An outdated process, that was still running despite no longer being required, caused issues when it updated the configuration used by proxies with incorrect data.
Root cause analysis
The API configuration change caused an unstable state. Indeed, an API update did not take place immediately. Thus, some API pods were reading the health check configuration from one environment and others from another environment.
This corrupted the database used by the proxies of our main environment. The proxies no longer responded with 200 but with error code to health check requests.
Impact of incident
🏠 Affected customers : All customers were impacted by the lack of acceleration.
🔴 Specific issue : Two customers experienced major disruptions because their origins did not allow them to receive traffic directly from the Internet.
📊 Incident metrics
- Severity: 1 (service shutdown impacting a large number of users).
- ⏱ Detection time: 4 minutes.
- 🛠 Resolution time: 20 minutes.
Action Plan
Short term (immediate - 1 week)
- Review the platform's health checks so that they are more robust
- Disable and delete the obsolete monitoring process ✅
- Fix the deployment of configmaps in the helm charts.
Medium term
- Review the API configuration to clarify the use of region-related fields and avoid side effects.