This week, Facebook suffered its worst outage since 2008 and was down for six hours. The economic cost of this outage will be large in both dollars and customer goodwill. Facebook will undoubtedly spend many person-hours on post-mortems and repairs to prevent this from happening again. While this was not a DNS issue, it’s worth considering how the operators of websites and Internet services can protect themselves against unexpected events, such as interruptions and outages in the DNS.
Every DNS record set has a TTL (time-to-live) value that specifies how long any DNS resolver is allowed to hold the DNS record set in cache. The TTL on critical DNS records is very important, but there is little solid guidance on how to set TTL values. Intuitively, service operators often feel that having low TTLs provides agility. This can be true, but even slightly longer TTLs can provide an important buffer against disaster. This article will examine some of the pitfalls that come with low TTLs and will help you select appropriate TTL values.
A short TTL for agility
Service operators sometimes view DNS caching as an enemy rather than a friend. When it becomes necessary to move a website or service to another set of machines or another hosting provider, the IP address will change. If this change is unplanned and must occur suddenly, then for the length of the TTL of the address record, clients of the website or service may experience an outage. One obvious solution to this potential problem is to set TTLs to a low value, like 15 seconds. This would guarantee that no matter what infrastructure changes happen in the future, clients will never experience any significant outage. The ability to make agile infrastructure changes in a timely manner is certainly important, and it is true that low TTLs can offer more agility.
Resolution latency
It may seem that all DNS TTLs should therefore be low, but there are some pitfalls involved. The first pitfall is resolution latency. If the DNS records for a service are constantly expiring from DNS caches around the world, then resolvers are continually having to re-fetch the records from the authoritative DNS servers. Latency in user experience, especially during web browsing, is a critical metric for many services. Realistically, the latency of an extra DNS resolution is usually not a huge problem. Applications resolve and fetch in parallel, and some DNS resolvers re-fetch DNS records as they approach cache expiration. But if latency is important, it should be remembered that lower TTLs will result in slightly higher average latency.
Longer TTLs as defense against DNS outages
The second pitfall is more insidious and more dangerous: how will a website or service be impacted by a DDoS attack, DNS-impacting network issue, or even a brief DNS outage lasting a few minutes? The DNS is the lifeblood of the Internet, but it can experience periods of unavailability caused by DDoS attacks on DNS providers or networks, as well as by outages caused by bad networking changes and bugs.
This is where the choice of TTL arguably matters most. A TTL of 15 seconds means that suddenly moving the service to a new IP address can result in a worst-case outage of 15 seconds. However, it also means that if the authoritative DNS servers are unreachable for several seconds, many clients will experience an outage. If DNS was interrupted for longer than 15 seconds, then virtually all clients of the service would experience total outage!
Of course, one can argue that this is unlikely to occur. Most of the time, the DNS is extremely reliable. As the owner of a website or service, you might go months or years of operation with 15 second TTLs and never notice problems. However, even a relatively brief disruption in the DNS will cause a service backed by 15 second TTL records to effectively be offline. By choosing a longer TTL, the service is somewhat protected against this scenario. The ability to survive a short outage caused by Internet DDoS attack, network issue, or DNS provider outage means that the service will have fewer headaches and happier customers.
The DNS industry is hard at work trying to mitigate the impact of outages on short TTL records by serving stale records when the authoritative DNS servers cannot be reached. However, the major DNS resolvers have not yet implemented this proposed standard and while it is very promising, it is not a total solution.
TTLs and load balancing
Another factor that must be considered when selecting TTLs is load balancing. Sometimes load balancing is performed by giving out different DNS answers to different clients in a near-random fashion. Higher DNS TTLs can interfere with this. If your service relies on this type of load balancing, you may not have much flexibility in increasing your TTL values, but some increase could probably still be made.
A different TTL per DNS record type
The last point to consider is TTLs on different types of DNS data in a zone. For records that do not change often, such as SOA, NS, SPF, and similar, longer TTLs are always a good idea. A common recommendation for DNS records that are generally static in nature is a TTL of an hour or even longer, up to a few days. This is especially true for NS records. NS records are critical in the DNS, so they should usually be relatively static and have somewhat longer TTLs than other records in the DNS.
Final thoughts
Be cautious when using short TTLs. Short TTLs may work perfectly well under normal conditions, but they make a website or service unable to survive even the briefest of interruptions in the DNS. Instead, consider a longer TTL. Ideally, the longest TTL that the service can realistically tolerate. If agility is absolutely needed on some DNS records, consider using a TTL of perhaps 5 minutes instead of a few seconds. For records that are unlikely to need urgent changes, use a TTL of an hour or even longer. This may seem like conservative advice, but consider how Google manages their DNS presence. At the time of writing, the TTL on the NS records for google.com is two days, the TTL on the SPF record set for google.com is one hour, and the TTL on the A record for www.google.com is five minutes. These values provide a reasonable balance between agility and outage survivability.
In typical situations, the following table gives a reasonable rule of thumb for setting TTL values.
RECORD TYPE | CONSERVATIVE TTL | RECOMMENDED TTL | AGGRESSIVE TTL |
A/AAAA | 1 hour | 5 minutes | 60 seconds |
NS | >2days | 2 days | 12 hours |
MX | 1 day | 4 hours | 1 hour |
TXT (for SPF) | 1 day | 1 hour | 5 minutes |
Comentarios