NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Stop using low DNS TTLs (blog.apnic.net)
csense 14 hours ago [-]
DNS is something you rarely change that has costly consequences if you mess it up: It can bring down an entire domain and keep it down until TTL passes.

If you set your TTL to an hour, it raises the costs of DNS issues a lot: A problem that you fix immediately turns into an hour-long downtime. A problem that you don't fix on the first attempt and have to iteratively try multiple fixes turns into an hour-per-iteration downtime.

Setting a low TTL is an extra packet and round-trip per connection; that's too cheap to meter [1].

When I first started administering servers I set TTL high to try to be a good netizen. Then after several instances of having to wait a long time for DNS to update, I started setting TTL low. Theoretically it causes more friction and resource usage but in practice it really hasn't been noticeable to me.

[1] For the vast majority of companies / applications. I wouldn't be surprised to learn someone somewhere has some "weird" application where high TTL is critical to their functionality or unit economics but I would be very surprised if such applications were relevant to more than 5% of websites.

compumike 19 hours ago [-]
The big thing that articles like this miss completely is that we are no longer in the brief HTTP/1.0 era (1996) where every request is a new TCP connection (and therefore possibly a new DNS query).

In the HTTP/1.1 (1997) or HTTP/2 era, the TCP connection is made once and then stays open (Connection: Keep-Alive) for multiple requests. This greatly reduces the number of DNS lookups per HTTP request.

If the web server is configured for a sufficiently long Keep-Alive idle period, then this period is far more relevant than a short DNS TTL.

If the server dies or disconnects in the middle of a Keep-Alive, the client/browser will open a new connection, and at this point, a short DNS TTL can make sense.

(I have not investigated how this works with QUIC HTTP/3 over UDP: how often does the client/browser do a DNS lookup? But my suspicion is that it also does a DNS query only on the initial connection and then sends UDP packets to the same resolved IP address for the life of that connection, and so it behaves exactly like the TCP Keep-Alive case.)

hannasm 18 hours ago [-]

  > patched an Encrypted DNS Server to store the original TTL of a response, defined as the minimum TTL of its records, for each incoming query

The article seems to be based on capturing live dns data from some real network. While it may be true that persistent connections help reduce ttl it certainly seems like the article is accounting for that unless their network is only using http1.0 for some reason.

I agree that low TTL could help during an outage if you actually wanted to move your workload somewhere else, and I didn't see it mentioned in the article, but I've never actually seen this done in my experience, setting TTL extremely low for some sort of extreme DR scenario smells like an anti pattern to me.

Consider the counterpoint, having high TTL can prevent your service going down if the dns server crashes or loses connectivity.

tracker1 1 days ago [-]
I usually set mine to between an hour and a day, unless I'm planning to update/change them "soon" ... though I've been meaning to go from a /29 to /28 on my main server for a while, just been putting off switching all the domains/addresses over.

Maybe this weekend I'll finally get the energy up to just do it.

Neywiny 1 days ago [-]
I guess I'm not sure I understand the solution. I use a low value (idk 15 minutes maybe?) because I don't have a static ip and I don't want that to cause issues. It's just me to my home server so I'm not adding noticable traffic like a real company or something, but what am I supposed to do? Is there a way for me to send an update such that all online caches get updated without needing to wait for them to time out?
viraptor 1 days ago [-]
For a private server with not many users this is mostly irrelevant. Use low ttl if you want to, since you're putting basically 0 load on the DNS system.

> such that all online caches get updated

There's no such thing. Apart from millions of dedicated caching servers, each end device will have it's own cache. You can't invalidate DNS entries at that scope.

zamadatix 1 days ago [-]
I used to get more excited about this but even when browsers don't do a DNS prefetch (or even a complete preload) the latency for lookups is usually still so low on the list of performance impacting design decisions that it is unlikely to ever outweigh even the slightest advantages (or be worth correcting misperceived advantages) until we all switch to writing really really REALLY optimized web solutions.
gertop 1 days ago [-]
The irony here is that news.ycombinator.com has a 1 second TTL. One DNS query per page load and they don't care, yay!
a012 1 days ago [-]
Joke on them because I use NextDNS with caching so all TTL is 3600s
garciasn 2 days ago [-]
Could it be because folks set it low for initial propagation and then never change it back after they set it up.
fukawi2 2 days ago [-]
That's not how TTL works. Or do you mean propagation after changing an existing RR?

It's "common" to lower a TTL in preparation for a change to an existing RR, but you need to make sure you lower it at least as long as the current TTL prior to the change. Keeping the TTL low after the change isn't beneficial unless you're planning for the possibility of reverting the change.

A low TTL on a new record will not speed propagation. Resolvers either have the new record cached or they don't. If it's cached, the TTL doesn't matter because it already has the record (propogated). If it doesn't have it cached, then it doesn't know the TTL so doesn't matter if it's 1 second or 1 month.

garciasn 1 days ago [-]
I meant both. Initial (which you say doesn't matter; TIL) and edits after-the-fact. I learned something new today and I've been doing DNS crap for decades; I feel like a doofus.
bigstrat2003 1 days ago [-]
Technically the initial propagation does depend somewhat on TTL. If you query the server and get the response that the record doesn't exist, that negative response gets cached too (based on the TTL of the SOA record). But it's pretty unusual for that to matter if you're standing up a new server.
deceptionatd 2 days ago [-]
Maybe, but I don't think TTL matters for speed of initial propagation. I do set it low when I first configure a website so I don't have to wait hours to correct a mistake I might not have noticed.
kevincox 1 days ago [-]
Yes. Statistically the most likely time to change a record is shortly after previously changing it. So it is a good idea to use a low TTL when you change it, then after a stability period raise the TTL as you are less likely to change it in the future.
deceptionatd 2 days ago [-]
I have mine set low on some records because I want to be able to change the IP associated with specific RTMP endpoints if a provider goes down. The client software doesn't use multiple A records even if I provide them, so I can't use that approach; and I don't always have remote admin access to the systems in question so I can't just use straight IPs or a hostfile.
jurschreuder 1 days ago [-]
It's because updating dns does not work reliably so it's always a lot if trail and error which you can only see after the cache updates
joelthelion 21 hours ago [-]
Could you make your changes with a low TTL and switch to a longer one once you are satisfied with the results?
throw20251220 14 hours ago [-]
How does that help you if you have to wait for long ttl to expire before the short one takes over?
GuinansEyebrows 1 days ago [-]
i was taught this as a matter of professional courtesy in my first job working for an ISP that did DNS hosting and ran its own DNS servers (15+ years ago). if you have a cutover scheduled, lower the TTL at $cutover_time - $current_ttl. then bring the TTL back up within a day or two in order to minimize DNS chatter. simple!

of course, as internet speeds increase and resources are cheaper to abuse, people lose sight of the downstream impacts of impatience and poor planning.

effnorwood 1 days ago [-]
Sometimes they need to be low if you use the values to send messages to people.
1970-01-01 1 days ago [-]
(2019)
zamadatix 1 days ago [-]
Discussed in 2022 (106 comments) https://news.ycombinator.com/item?id=33527642

And a similar version of the same blog post on a personal blog in 2019 https://news.ycombinator.com/item?id=21436448 (thanks to ChrisArchitect for noting this in the only comment on a copy from 2024).

ece 21 hours ago [-]
I've never changed the default, Squarespace, Godaddy, Cloudflare, Porkbun, all have been an hour or so.
bjourne 2 days ago [-]
I don't understand why the author doesn't consider load balancing and failover legitimate use cases for low ttl. Cause it wrecks their argument?
kevincox 1 days ago [-]
Because unless your TTL is exceptionally long you will almost always have a sufficient supply of new users to balance. Basically you almost never need to move old users to a new target for balancing reasons. The natural churn of users over time is sufficient to deal with that.

Failover is different and more of a concern, especially if the client doesn't respect multiple returned IPs.

mannyv 13 hours ago [-]
You are misunderstanding how HA works with DNS TTLs.

Now there are multiple kinds of HA, so we'll go over a bunch of them here.

Case 1: You have one host (host A) on the internet and it dies, and you have another server somewhere (host B) that's a mirror but with a different IP. When host A dies you update DNS so clients can still connect, but now they connect to host B. In that case the client will not connect to the new IP until their DNS resolver gets the new IP. This was "failover" back in the day. That is dependent on the DNS TTL (and the resolver, because many resolvers and aches ignore the TTL and used their own).

In this case a high TTL is bad, because the user won't be able to connect to your site for TTL seconds + some other amount of time. This is how everyone learned it worked, because this is the way it worked when the inter webs were new.

Case 2: instead of one DNS record with one host you have a DNS record with both hosts. The clients will theoretically choose one host or the other (round robin). In reality it's unclear if that actually do that. Anecdotal evidence shows that it worked until it didn't, usually during a demo to the CEO. But even if it did that means that 50% of your requests will hit a X second timeout as the clients try to connect to a dead host. That's bad, which is why nobody in their right minds did it. And some clients always picked the first host because that's how DNS clients are sometimes.

Putting a load balancer in front of your hosts solves this. Do load balancers die? Yeah, they do. So you need two load balancers...which brings you back to case 1.

These are the basic scenarios that a low DNS TTL fixes. There are other, more complicated solutions, but they're really specialized and require more control of the network infrastructure...which most people don't have.

This isn't an "urban legend" as the author states. These are hard-won lessons from the early days of the internet. You can also not have high availability, which is totally fine.

johntash 1 days ago [-]
I'm assuming OP means cloud-based load balancers (listening on public ips). Some providers scale load balancers pretty often depending on traffic which can result in a set of new IPs.
electroly 20 hours ago [-]
Being specific: AWS load balancers use a 60 second DNS TTL. I think the burden of proof is on TFA to explain why AWS is following an "urban legend" (to use TFA's words). I'm not convinced by what is written here. This seems like a reasonable use case by AWS.
BitPirate 2 days ago [-]
Why do you need a low ttl for those? You can add multiple IPs to your A/AAAA records for very basic load balancing. And DNS is a pretty bad idea for any kind of failover. You can set a very low ttl, but providers might simply enforce a larger one.
toast0 2 days ago [-]
You don't want to add too many A/AAAA records, or your response gets too big and you run into fun times. IIRC, you can do about 8 of each before you get to the magic 512 byte length (yeah, you're supposed to be able to do more, 1232 bytes as of 2020-10-01, but if you can fit in 512 bytes, you might have better results on a few networks that never got the memo)

And then if you're dealing with browsers, they're not the best at trying everything, or they may wait a long time before trying another host if the first is non-responsive. For browsers and rotations that really do change, I like a 60 second TTL. If it's pretty stable most of the time, 15 minutes most of the time, and crank it down before intentional changes.

If you've got a smart client that will get all the answers, and reasonably try them, then 5-60 minutes seems reasonable, depending on how often you make big changes.

All that said, some caches will keep your records basically forever, and there's not much you can do about that. Just gotta live with it.

deceptionatd 1 days ago [-]
It's not good as a first line of defense for failover, but with some client software and/or failure mechanisms there aren't any better approachs I'm aware of. Some of the software I administer doesn't understand multiple A/AAAA records.

And a BGP failure is a good example too. It doesn't matter how resilient the failover mechanisms for one IP are if the routing tables are wrong.

Agreed about some providers enforcing a larger one, though. DNS propagation is wildly inconsistent.

Matheus28 2 days ago [-]
If you add multiple IPs to a record, a lot of resolvers will simply use the first one. So even in that case you need a low TTL to shuffle them constantly
bjourne 1 days ago [-]
What if your site is slashdotted or HN hugged to death? Most requests will hit the IP of the first record, while the others idle.
Bender 2 days ago [-]
Perhaps as most these days are using Anycast [1] to do failovers. It's faster and not subject to all the oddities that come with every application having its own interpretation of DNS RFC's most notably java and all its work-arounds that people may or may not be using and all the assorted recursive cache servers that also have their own quirks thus making Anycast a more reliable and predictable choice.

[1] - https://en.wikipedia.org/wiki/Anycast

c45y 2 days ago [-]
Probably an expectation for floating IPs for load balancing instead of DNS.

Relatively simple inside a network range you control but no idea how that works across different networks in geographical redundant setups

deceptionatd 2 days ago [-]
Agreed; I have no idea how you'd implement that across multiple ASNs, which is definitely a requirement for multi-cloud or geo-redundant architectures.

Seems like you'd be trying to work against the basic design principles of Internet routing at that point.

_bernd 24 hours ago [-]
You can configure your assigned network numbers that other AS are allowed to announce certain networks of your own. Not uncommon for in examples authoritative name server addresses.
deceptionatd 23 hours ago [-]
TIL, I always thought IP:ASN mappings were 1:1.
_bernd 23 hours ago [-]
With cloud providers and such the wording could also be "bring your own address".
preisschild 2 days ago [-]
Anycast pretty much
UltraSane 13 hours ago [-]
DNS TTLs are a terrible method of directing traffic because you can't rely on clients to honor it.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 10:05:38 GMT+0000 (Coordinated Universal Time) with Vercel.