Post Mortem: Temporary Platform Unavailability
Event Date: May 15, 2023
Incident Duration: 11:29 AM to 11:55 AM
Incident Description:
The platform experienced an outage from 11:29 AM to 11:55 AM. Traffic was automatically routed to origins. Customers therefore lost the benefit of the solution, but the sites remained available during the incident.
The addition of a large number of configurations on the platform increased the consumed memory and the startup time of the front layer services. Some services stopped and did not start correctly.
Event Timeline:
- 11:17 AM: Addition of new configurations.
- 11:21 AM: Detection of a memory shortage on a service, leading to the shutdown of a critical process.
- 11:34 AM: Additional services become unavailable.
- 11:38 AM: Widespread detection of the incident; automatic traffic redirection.
- 11:45 AM: Attempts to restart services, partially successful.
- 12:00 PM - 12:15 PM: Assessment and decision-making on corrective actions.
- 12:33 PM: Modification of startup configurations to improve tolerance to startup time.
Analysis:
Two main factors lead to this incident :
- our HTTP server requires a reload in order to load new configuration into account. During this reload, the number of processes for this service is doubled, leading to a risk of memory exhaustion.
- The start timeout for the HTTP service was set as the default value and we didn’t have a monitor alerting us that the HTTP service start time was close to the limit.
Impact:
All users of the platform were affected by this incident.
Corrective and Preventive Measures:
- Short term: Review of alert systems and adjustment of service startup configurations.
- Medium term: Improvement in configuration management to reduce their number and optimize service startup monitoring.
- Long term: Researching alternative HTTP server to improve update management without impacting performance or memory consumption.
Conclusion:
This incident highlights the importance of constant monitoring and proactive resource management to prevent outages. The measures taken should enhance the stability and reliability of the platform.