Workers were overloaded by an unexpected and huge amount of images optimizations.
Some of the page optimizations were in timeout.
All our customers have been impacted during about one hour.
The optimization engine received a huge number of images to optimize that was superior to the workers capability. Workers started to be saturated and optimization queue became full.
As workers optimize both pages and static assets, this load also affected the HTML pages optimizations.
We decided to stop the optimizations of the implied configuration to return to a normal state.
21:32 : Huge traffic starts to arrive at the platform.
21:35 : The alerting system (PagerDuty, via Pingdom probe) sends an alert on on-call ops phone.
22:02 : Workers starts to recover
22:38 : Optimization is stopped on the implied configuration, customer is contacted
23:40 : The problematic website is rerouted to the origin, in agreement with the customer
Severity: 2 (performance degradation for the majority of our customers)
Time to detect: 2 minutes
Time to resolve: 2h05
Some of our customers were impacted by slower response time.