All Systems Operational
API ? Operational
Dashboard ? Operational
Acceleration ? Operational
Website ? Operational
Collect ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
System Metrics Month Week Day
Platform uptime (Europe)
Fetching
Fasterize website
Fetching
Past Incidents
Mar 23, 2017

Our hosting service upgraded some network equipment (https://status.online.net/index.php?do=details&task_id=843) which caused some network perturbations: around 3% of requests ended in timeout while other requests response times were increased. Two perturbation periods happened between 07:03 and 07:15, and between 08:00 and 08:11 (UTC+1).

Mar 22, 2017

No incidents reported.

Mar 21, 2017

No incidents reported.

Mar 20, 2017
Postmortem - Read details
Mar 21, 11:55 CET
Resolved - Our configuration of our front SSL has been fixed.
The situation is back to normal. We will write a post mortem.
Mar 20, 18:19 CET
Update - We have disabled Fasterize and all traffic will be re-router to your origins
Mar 20, 18:12 CET
Investigating - We are currently investigating this issue.
Mar 20, 18:08 CET
Mar 19, 2017

Date of post-mortem: 21/03/2017

Writers: David, Chakib, St├ęphane

Times are Paris times (UTC+1)

# Description

A newly added front server has become inaccessible thus making some of the sites optimized by Fasterize partially unavailable.

One of our Elasticsearch nodes no longer indexed logs from 16/03/2017 to 20:29 UTC+1. The message queues of logs were then blocked and the agents collecting the logs on each server began to store in memory logs for this node.
On the server that has become unavailable, the memory occupation has continuously increased ending in overflowing the swap until exhaustion of memory on 19/03/2017 at 23:22 UTC+1. At that time, the CPU system reached 100% and strongly impacted all server processes. This server was then unable to respond to HTTP and HTTPS traffic.

# Facts / Timeline

- 16/03 at 20:29 UTC+1: stop the indexing of the logs on log02-dc1
- 19/03 at 23:22 UTC+1: steep increase of CPU load on front03-dc1. HAProxy traffic is then off on this front.
- 19/03 at 00:54 UTC+1: restart of the log collector and release of the memory.
- 19/03 at 00:56 UTC+1: resumption of traffic on front03-dc1
- 20/03 at 10:00 UTC+1: manual exclusion of front03-dc1, not properly monitored by our failover system
- 21/03 at 14:10 UTC+1: end of the restart of Elasticsearch cluster and log collector agents. Indexing rate of logs returns to normal.

# Metrics

Severity 1: Unplanned site that affects a significant number of users
Time to detect: 1h22 min
Time to resolve: 1h34min

# Countermeasures

- Prioritize the front03-dc1 monitoring by our failover system
- Prioritize the evolution of node-logstash to stop the collection of logs when it fills its message queue
- Detect Elasticsearch indexing problems

Mar 18, 2017

No incidents reported.

Mar 17, 2017

No incidents reported.

Mar 16, 2017

No incidents reported.

Mar 15, 2017

No incidents reported.

Mar 14, 2017

No incidents reported.

Mar 13, 2017

No incidents reported.

Mar 12, 2017

No incidents reported.

Mar 11, 2017

No incidents reported.

Mar 10, 2017
Completed - The scheduled maintenance has been completed.
Mar 10, 04:04 CET
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Mar 9, 23:00 CET
Scheduled - For a best appreciation of deployment impacts, we use a scale from 0 to 3 :
0 : not critical (no impact is expected after the deployment of the feature or fix, it's often related to the Fasterize website, API or auxiliary tools)
1 : minor (it may have a minor impact on optimized websites operations during the deployment, it's often related to the optimization engine or the cache layer)
2 : major (it may have a major impact on optimized websites. It is advisable to monitor the websites during this deployment).

Fix
[1] [engine] Fix HTTP query params integrity
[0] [engine] Improve handling the resources with an empty body
[1] [engine] Improve custom HTML tag insertion
[1] [engine] Improve corrupted image detection

Feature
[0][api] expose customer HTTP access logs
Mar 8, 20:37 CET