Performance degradation

Incident Report for Fasterize

Postmortem

Description

On Thursday, October 19th, between 4:55 PM UTC+2 and 6:25 PM UTC+2, Fasterize european platform was unable to optimize web pages for all customers. The original version was then delivered.

We discovered that between 4:45 PM UTC+2 and 5:50 PM UTC+2, a specific request was made that caused a failure in the Fasterize engine during optimization and left the process in a non-functional state.

The number of functional processes then decreased until it fell below a critical threshold. Our engine then automatically switched to a degraded mode where pages were no longer optimized and served without delay.

At 5:29 PM UTC+2, the oncall team manually added capacity to the platform to return to a stable state, but this did not definitely improve the situation. Starting from 6:15 PM UTC+2, the optimization processes gradually resumed traffic. The engine then returned to its normal mode of operation.

To prevent any further incidents, the request has been excluded from optimizations and a fix on the optimization engine is being developed.

Action plan

Short term:

Fix the engine to optimize the responsible request without any crashes

Medium term:

Review the health check system at the engine level to automatically restart non-functional processes

Posted Oct 23, 2023 - 23:31 CEST

Resolved

This incident has been resolved at 18h25 (Paris time). A post mortem will follow.

Posted Oct 20, 2023 - 09:22 CEST

Monitoring

We're monitoring the results but everything's fine. Seems to be related to a schema change in a storage component (to be confirmed after the RCA).

Posted Oct 19, 2023 - 18:49 CEST

Identified

We have mitigated the issue. Performance is back to normal. Still investigating for the root cause.

Posted Oct 19, 2023 - 18:35 CEST

Investigating

We currently have some issues on our european infrastructure. Being fixed. Slight impact on acceleration. Some pages can have some slowdowns. Some optimizations are disabled.

Posted Oct 19, 2023 - 18:04 CEST

This incident affected: Acceleration.