🧾 Incident Summary
As part of preparations for the summer sales and in response to a load handling issue observed on June 12, the tech team has planned an update of a DNS record pointing to the platform’s frontend layer.
The goal was to review the way CDN Edge servers send traffic to the platform.
Since the DNS record could not be edited in place, it was deleted and immediately recreated with a new routing policy.
During the brief deletion (~10 seconds), some DNS resolvers fell back to a default wildcard record with a 3600s TTL, pointing to decommissioned IPs instead of returning NXDOMAIN.
As a result, several CDN Edge servers cached these incorrect IPs and continued to use them for up to an hour, causing a global service outage on the affected POPs.
Some health check probes also resolved to the wrong IPs but not enough of them failed at once to trip the health check thresholds (which require consecutive failures to switch to a degraded state).
‌
This outage was not due to an infrastructure failure but to a human error during maintenance without sufficient safeguards.
We consider this as a major governance failure in infrastructure change management.
The affected DNS zone was created over 12 years ago and currently holds legacy records. Over time, several have become outdated or unused. However, due to the risk of accidental deletion and unexpected impact, no regular cleanup has been performed.
This lack of maintenance allowed a misconfigured wildcard record to persist, which was unintentionally triggered during the temporary deletion of a critical DNS record.
We’ve now initiated a full audit of this DNS zone to identify, document, and progressively remove obsolete records. A cross-validation policy will be enforced before any future changes.
Although a staging environment exists and is used to validate infrastructure changes, the specific scenario here — related to active CDN traffic and DNS behavior during the few seconds of deletion — was not anticipated.
Due to the distributed and asynchronous nature of CDN DNS propagation and caching (per POP), the issue could hardly be replicated in a staging environment.
Short-term:
Medium-term:
‌
This incident highlights how even short-lived DNS misconfigurations can cause major disruptions in distributed systems like CDNs.Rigorous TTL management and better anticipation of critical DNS usages are essential to avoid similar outages in the future.
Despite this incident, the technical and customer success team remains fully committed to delivering a smooth and successful sales period for our clients.A temporary change freeze is in place, and additional planning and capacity measures are being taken to ensure high availability and reliability throughout this peak traffic period.