Partial outage due to an unavailable front
Incident Report for Fasterize
Resolved
Date of post-mortem: 21/03/2017

Writers: David, Chakib, Stéphane

Times are Paris times (UTC+1)

# Description

A newly added front server has become inaccessible thus making some of the sites optimized by Fasterize partially unavailable.

One of our Elasticsearch nodes no longer indexed logs from 16/03/2017 to 20:29 UTC+1. The message queues of logs were then blocked and the agents collecting the logs on each server began to store in memory logs for this node.
On the server that has become unavailable, the memory occupation has continuously increased ending in overflowing the swap until exhaustion of memory on 19/03/2017 at 23:22 UTC+1. At that time, the CPU system reached 100% and strongly impacted all server processes. This server was then unable to respond to HTTP and HTTPS traffic.

# Facts / Timeline

- 16/03 at 20:29 UTC+1: stop the indexing of the logs on log02-dc1
- 19/03 at 23:22 UTC+1: steep increase of CPU load on front03-dc1. HAProxy traffic is then off on this front.
- 19/03 at 00:54 UTC+1: restart of the log collector and release of the memory.
- 19/03 at 00:56 UTC+1: resumption of traffic on front03-dc1
- 20/03 at 10:00 UTC+1: manual exclusion of front03-dc1, not properly monitored by our failover system
- 21/03 at 14:10 UTC+1: end of the restart of Elasticsearch cluster and log collector agents. Indexing rate of logs returns to normal.

# Metrics

Severity 1: Unplanned site that affects a significant number of users
Time to detect: 1h22 min
Time to resolve: 1h34min

# Countermeasures

- Prioritize the front03-dc1 monitoring by our failover system
- Prioritize the evolution of node-logstash to stop the collection of logs when it fills its message queue
- Detect Elasticsearch indexing problems
Posted Mar 19, 2017 - 13:00 CET