Temporary Platform Unavailability
Incident Report for Fasterize
Postmortem

Post Mortem: Temporary Platform Unavailability

Event Date: May 15, 2023

Incident Duration: 11:29 AM to 11:55 AM

Incident Description:

The platform experienced an outage from 11:29 AM to 11:55 AM. Traffic was automatically routed to origins. Customers therefore lost the benefit of the solution, but the sites remained available during the incident.

The addition of a large number of configurations on the platform increased the consumed memory and the startup time of the front layer services. Some services stopped and did not start correctly.

Event Timeline:

  • 11:17 AM: Addition of new configurations.
  • 11:21 AM: Detection of a memory shortage on a service, leading to the shutdown of a critical process.
  • 11:34 AM: Additional services become unavailable.
  • 11:38 AM: Widespread detection of the incident; automatic traffic redirection.
  • 11:45 AM: Attempts to restart services, partially successful.
  • 12:00 PM - 12:15 PM: Assessment and decision-making on corrective actions.
  • 12:33 PM: Modification of startup configurations to improve tolerance to startup time.

Analysis:

Two main factors lead to this incident :

  • our HTTP server requires a reload in order to load new configuration into account. During this reload, the number of processes for this service is doubled, leading to a risk of memory exhaustion.
  • The start timeout for the HTTP service was set as the default value and we didn’t have a monitor alerting us that the HTTP service start time was close to the limit.

Impact:

All users of the platform were affected by this incident.

Corrective and Preventive Measures:

  • Short term: Review of alert systems and adjustment of service startup configurations.
  • Medium term: Improvement in configuration management to reduce their number and optimize service startup monitoring.
  • Long term: Researching alternative HTTP server to improve update management without impacting performance or memory consumption.

Conclusion:

This incident highlights the importance of constant monitoring and proactive resource management to prevent outages. The measures taken should enhance the stability and reliability of the platform.

Posted May 15, 2024 - 17:54 CEST

Resolved
The platform experienced an outage from 11:29 AM to 11:55 AM. Traffic was automatically routed to origins. Customers therefore lost the benefit of the solution, but the sites remained available during the incident.
Posted May 15, 2024 - 11:30 CEST