Degradation of performance
Incident Report for Fasterize
Postmortem

16/11/2016 Outage

  • Date: 18/11/2016
  • Writers: Tech team
  • Times are Paris times (UTC+1)

Description

Conditions

Workers were overloaded by an unexpected and huge amount of images optimizations. Some of the page optimizations were in timeout.
All our customers have been impacted during about one hour.

Cause

The optimization engine received a huge number of images to optimize that was superior to the workers capability. Workers started to be saturated and optimization queue became full. As workers optimize both pages and static assets, this load also affected the HTML pages optimizations.
We decided to stop the optimizations of the implied configuration to return to a normal state.

Timeline

21:32 : Huge traffic starts to arrive at the platform.
21:35 : The alerting system (PagerDuty, via Pingdom probe) sends an alert on on-call ops phone.
22:02 : Workers starts to recover
22:38 : Optimization is stopped on the implied configuration, customer is contacted
23:40 : The problematic website is rerouted to the origin, in agreement with the customer

Metrics

Severity: 2 (performance degradation for the majority of our customers)
Time to detect: 2 minutes
Time to resolve: 2h05

Impacts

Some of our customers were impacted by slower response time.

Countermeasures

  • Improve communication about unexpected plugging of websites by partners.
  • Limit the queue of optimization tasks
  • Configure some workers to work only on page optimizations
  • Check the implementation of our availability probes.
  • To be investigated: ignore single requests to avoid potential useless optimization.
Posted Nov 18, 2016 - 18:52 CET

Investigating
At approximately 9:30 PM (Paris time), Workers were overloaded by an unexpected and huge amount of images optimizations.
Posted Nov 16, 2016 - 13:00 CET