An in-memory database cluster failure occurred leading to service unavailability across multiple Fasterize components — primarily the Optimisation Engine and the API.
At 09:48, an in-memory database cluster failure happens after restarting multiple nodes to release a new engine version. The database cluster began an automatic failover sequence, but each time a new node was promoted as primary, it crashed under excessive connection load.
This cluster serves as a cache layer providing access to configurations. During the outage, Engine instances attempted to reconnect at a very high frequency and fell back to retrieving data directly from the main database.This fallback mechanism worked as intended until 10:32, allowing our optimization engine to continue operating in degraded mode.
At 10:32, however, the proxy layer of our optimisation engine became saturated in network resources, rendering it unavailable from the front layer.
When the proxy layer is unreachable, the platform automatically unplugs the CDN and sites are served directly from their origin servers, without Fasterize optimizations.
Working with our hosting provider, we reduced the Optimisation Engine cluster size at 11:45 to limit reconnection attempts to Redis. By 12:16, the Redis cluster had stabilised and full service was restored.
Later, at 14:15, we detected that the API was unable to write to the Redis cluster. The root cause was a security patch applied by the hosting provider, which restricted the use of some commands in the cluster.The API was patched to remove usage of these commands, and full functionality was restored by 20:15.
Short term:
Medium term:
Long term: