Platform unavailability

Incident Report for Fasterize

Postmortem

An in-memory database cluster failure occurred leading to service unavailability across multiple Fasterize components — primarily the Optimisation Engine and the API.

At 09:48, an in-memory database cluster failure happens after restarting multiple nodes to release a new engine version. The database cluster began an automatic failover sequence, but each time a new node was promoted as primary, it crashed under excessive connection load.

This cluster serves as a cache layer providing access to configurations. During the outage, Engine instances attempted to reconnect at a very high frequency and fell back to retrieving data directly from the main database.This fallback mechanism worked as intended until 10:32, allowing our optimization engine to continue operating in degraded mode.

At 10:32, however, the proxy layer of our optimisation engine became saturated in network resources, rendering it unavailable from the front layer.

When the proxy layer is unreachable, the platform automatically unplugs the CDN and sites are served directly from their origin servers, without Fasterize optimizations.

Working with our hosting provider, we reduced the Optimisation Engine cluster size at 11:45 to limit reconnection attempts to Redis. By 12:16, the Redis cluster had stabilised and full service was restored.

Later, at 14:15, we detected that the API was unable to write to the Redis cluster. The root cause was a security patch applied by the hosting provider, which restricted the use of some commands in the cluster.The API was patched to remove usage of these commands, and full functionality was restored by 20:15.

Impact

  • Duration: 10h32 – 12:16 (outage), API issue until 20:15
  • Affected components: Optimisation Engine, API
  • User impact: Most websites temporarily served unoptimised content directly from origin; Some websites experienced unavailability due to an dysfunctional DNS fallback mechanismAPI write operations failed, blocking configuration update.

Resolution Timeline

Action plan

Short term:

  • Improve alerting and visibility on in-memory database cluster health and failover events.
  • Review the in-memory database connection logic of the engine to avoid too many reconnection attempts in case of disconnections and avoid a case preventing the start of the process in case of an in-memory database outage. 
  • Adjust failover DNS logic to avoid redirection to the origin when the CDN/fronts are still able to accept the traffic.
  • Upscale in-memory database cluster in order to accept more connections

Medium term: 

  • Review and test disaster recovery procedures for in-memory database cache clusters, including the ability to quickly activate a passive cluster.

Long term:

  • Re-architecture the engine to reduce the number of connections on the itn-memory database cluster
Posted Oct 13, 2025 - 14:19 CEST

Resolved

This incident has been resolved.
The API and the dashboard are now able to handle correctly configuration updates and cache purges.

As mentioned earlier, a postmortem will be written and available here.
Posted Oct 09, 2025 - 20:23 CEST

Identified

The issue has been identified and a workaround is being implemented.
In the meantime, configuration updates and cache purges are unfortunately still unavailable.
The workaround will unfortunately degrade the API performance but it should be temporary until we rollout a definitive solution.

As soon as the fix is delivered, we'll monitor it to be sure that everything's ok.

The postmortem will follow, as we also need to gather further elements from our provider.

We deeply apologize for this incident and we're actively working to solve this issue as soon as possible.
Posted Oct 09, 2025 - 19:04 CEST

Investigating

Acceleration is now fully operational.
Configuration updates are still unavailable, and we are investigating the issue.
Cache purges are also currently unavailable.
Posted Oct 09, 2025 - 17:35 CEST

Update

Configuration updates are not possible from the Fasterize console. We are currently investigating.
Posted Oct 09, 2025 - 15:20 CEST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Oct 09, 2025 - 12:34 CEST

Update

The failing component is gradually recovering, and we are monitoring it until full restoration
Posted Oct 09, 2025 - 12:27 CEST

Identified

We are still investigating and trying to restore the failing component.
Posted Oct 09, 2025 - 11:54 CEST

Investigating

Our platform is experiencing an outage, where possible traffic is automatically routed to origin to mitigate the incident.
We identified the issue and are working on a fix to restore availability
Posted Oct 09, 2025 - 11:01 CEST
This incident affected: API, Dashboard, and Acceleration.