Unavailability of the origin on several CDN Edge Servers leading to 502/504 errors for end users

Incident Report for Fasterize

Postmortem

🧾 Incident Summary

As part of preparations for the summer sales and in response to a load handling issue observed on June 12, the tech team has planned an update of a DNS record pointing to the platform’s frontend layer.

The goal was to review the way CDN Edge servers send traffic to the platform.

Since the DNS record could not be edited in place, it was deleted and immediately recreated with a new routing policy.

During the brief deletion (~10 seconds), some DNS resolvers fell back to a default wildcard record with a 3600s TTL, pointing to decommissioned IPs instead of returning NXDOMAIN.

As a result, several CDN Edge servers cached these incorrect IPs and continued to use them for up to an hour, causing a global service outage on the affected POPs.

Some health check probes also resolved to the wrong IPs but not enough of them failed at once to trip the health check thresholds (which require consecutive failures to switch to a degraded state).

‌

📅 Timeline

🧩 Root Cause Analysis

This outage was not due to an infrastructure failure but to a human error during maintenance without sufficient safeguards.

We consider this as a major governance failure in infrastructure change management.

The affected DNS zone was created over 12 years ago and currently holds legacy records. Over time, several have become outdated or unused. However, due to the risk of accidental deletion and unexpected impact, no regular cleanup has been performed.

This lack of maintenance allowed a misconfigured wildcard record to persist, which was unintentionally triggered during the temporary deletion of a critical DNS record.

We’ve now initiated a full audit of this DNS zone to identify, document, and progressively remove obsolete records. A cross-validation policy will be enforced before any future changes.

Although a staging environment exists and is used to validate infrastructure changes, the specific scenario here — related to active CDN traffic and DNS behavior during the few seconds of deletion — was not anticipated.

Due to the distributed and asynchronous nature of CDN DNS propagation and caching (per POP), the issue could hardly be replicated in a staging environment.

✅ Immediate Fixes

  • Re-creation of the DNS record with the correct configuration
  • Manual checks to ensure resolvers and CDN POPs are now resolving to the correct origin

🔒 Preventive countermeasures

Short-term:

  • Freeze on infrastructure changes for two weeks

Medium-term:

  • Improve the staging environment to better simulate CDN-specific issues
  • Clean up outdated records in the platform's DNS zone
  • Formalize an emergency fallback procedure
  • Introduce more logical zones to avoid widespread impact across clients

‌

Conclusion

This incident highlights how even short-lived DNS misconfigurations can cause major disruptions in distributed systems like CDNs.Rigorous TTL management and better anticipation of critical DNS usages are essential to avoid similar outages in the future.

Despite this incident, the technical and customer success team remains fully committed to delivering a smooth and successful sales period for our clients.A temporary change freeze is in place, and additional planning and capacity measures are being taken to ensure high availability and reliability throughout this peak traffic period.

Posted Jun 19, 2025 - 10:39 CEST

Resolved

This incident is now resolved. Post-mortem will follow but to sum up: the root cause is a change in a DNS record. During that change, the record pointing to our DC took a temporary wrong value that was captured by some edge servers and stored for one hour. This affected only a subset of edge servers and only a subset of the health checkers responsible for triggering the failover mechanism. This explains why the failover mechanism wasn't fully triggered.
Posted Jun 18, 2025 - 10:55 CEST

Update

The failover mechanism didn't trigger. We trigger it manually.
Posted Jun 18, 2025 - 09:49 CEST

Investigating

We currently have some issues on one of our european DC. Being fixed. Trafic is interrupted for a large portion of the customers. Really sorry for the inconvenience.
Posted Jun 18, 2025 - 09:41 CEST
This incident affected: Acceleration and CDN.