Series: System Design · Networking & Protocols — Pillar 2 of 8

DNS: The Phone Book That Runs the Internet

Systems Design

#	Post	What it covers
00	Networking & Protocols: How Bytes Actually Travel	Before you can design systems that scale, you need to understand how bytes actually travel. Eight concepts every backend engineer must know. (148 chars)
01	The OSI Model: The Map Every Engineer Needs	The OSI model isn't just interview theory — it's the map that tells you exactly where in the stack a network problem lives. Here's how to use it. (152 chars)
02	TCP vs UDP: Reliability vs Speed at the Transport Layer	TCP guarantees delivery. UDP doesn't look back. Understanding why each exists — and when to reach for each — is fundamental to network design. (150 chars)
03	HTTP vs HTTPS: The Language of the Web and Its Secure Version	301 Moved Permanently, 302 Found, 304 Not Modified
04	TLS/SSL: How HTTPS Actually Works Under the Hood	TLS is what puts the S in HTTPS. Here's how the handshake works, what a certificate actually contains, and why TLS 1.3 matters for performance. (152 chars)
05	DNS: The Phone Book That Runs the Internet ← you are here	DNS is the phone book of the internet — and one of the most misunderstood layers in the stack. Here's how it works and how it fails. (133 chars)
06	DNS Load Balancing: Traffic Distribution at the Name Layer	DNS load balancing distributes traffic before a single packet reaches your servers. Here's how it works, where it excels, and where it falls short. (154 chars)
07	Anycast Routing: One Address, Everywhere at Once	One IP address, dozens of locations, zero client configuration. Anycast is how the fastest global infrastructure works — here's the mechanism behind it. (158 chars)
08	CDN: Moving Content Closer to the People Who Need It	A CDN isn't just a cache in front of your server. Here's how content delivery networks work, when they help, and when they add complexity for nothing. (154 chars)
09	Networking & Protocols: Wrap-Up	A complete recap of the eight core networking concepts — OSI, TCP, HTTP, TLS, DNS, CDN — and how they connect into a complete picture. (135 chars)

In the URL shortener: sho.rt needs an A record (or AAAA for IPv6) pointing to the load balancer or CDN. If using a CDN like Cloudflare, you'd typically use a CNAME pointing to the CDN's hostname, and the CDN handles the A record. A CAA record restricting certificate issuance to your chosen CA is a low-effort security hardening step worth adding.

TTL: the caching contract

Every DNS record has a TTL (Time To Live) — the number of seconds resolvers are permitted to cache the answer before asking again. TTL is the single most important DNS configuration decision for operational flexibility.

TTL tradeoff:

High TTL (86400 = 24 hours)
  ✓ Fast resolution (served from cache almost always)
  ✓ Lower load on authoritative nameservers
  ✗ Changes take up to TTL seconds to propagate globally
  ✗ Decommissioning old IPs is risky during propagation window

Low TTL (60 = 1 minute)
  ✓ Changes propagate quickly
  ✓ Failover and traffic shifting are fast
  ✗ Higher query load on authoritative nameservers
  ✗ More uncached lookups, marginally higher latency for users

The standard operational pattern: run high TTL (3600–86400) during normal operation for performance. Drop TTL to 60–300 seconds before a planned change — infrastructure migrations, IP changes, failover tests — so that when you make the change, propagation is fast. Raise TTL again after the change stabilises.

The mistake that causes incidents: making a change with a high TTL already in place, then discovering that rollback requires waiting out the full TTL window while users hit broken infrastructure.

The resolver chain

Between your application and the authoritative nameserver, there are typically multiple layers of caching:

Application
    │ (check local cache)
    ▼
OS DNS cache (nscd, systemd-resolved)
    │ (check OS cache)
    ▼
Recursive Resolver (your ISP, 8.8.8.8, 1.1.1.1)
    │ (check resolver cache)
    ▼
Authoritative Nameserver

Each layer caches independently up to the TTL. Lowering your TTL doesn't instantly clear caches that already hold the old answer — it only affects lookups that happen after the existing cached entry expires. This is why "lower the TTL first, then make the change" is the correct operational sequence.

DNS failure modes

DNS is deceptively simple in the happy path. The failure modes are where engineers get caught out:

Propagation delay. DNS changes are not instantaneous. Even with a 60-second TTL, some resolvers — particularly ISP resolvers in certain regions — don't respect TTLs and cache records longer than specified. Global propagation of a DNS change should be treated as taking 5–15 minutes minimum, not instantaneous.

NXDOMAIN negative caching. When a DNS lookup fails — the record doesn't exist — many resolvers cache the negative result (NXDOMAIN) for a period defined by the SOA record's minimum TTL. This means that if you fix a DNS misconfiguration, clients that already received NXDOMAIN may continue to get it from cache for minutes.

DNS as a single point of failure. If your authoritative DNS provider experiences an outage, no client can resolve your domain — even if your servers are perfectly healthy. The Dyn DNS outage of 2016 took down Twitter, Reddit, Spotify, and dozens of other major services simultaneously because they all used the same DNS provider with no secondary. Using multiple DNS providers (or a secondary DNS provider) is the mitigation.

DNS hijacking and poisoning. DNS responses can be spoofed or poisoned — an attacker inserts false records into a resolver's cache, redirecting traffic to a malicious server. DNSSEC (DNS Security Extensions) adds cryptographic signatures to DNS records, allowing resolvers to verify their authenticity. Adoption is still incomplete but growing.

Split-horizon DNS. A common production pattern where the same hostname resolves to different IPs depending on whether the query comes from inside or outside the network — internal queries get a private IP, external queries get a public IP. Works well until an internal service makes an external DNS query and gets an answer it can't route to, or until debugging becomes confusing because the same hostname returns different results depending on where you run the lookup.

The tradeoffs

Managed DNS vs self-hosted. Self-hosting authoritative nameservers gives maximum control but introduces operational risk — your DNS infrastructure becomes a SPOF you're responsible for. Managed DNS providers (Cloudflare, Route 53, Google Cloud DNS) provide global Anycast infrastructure, DDoS protection, and high availability for a modest cost. For almost all teams, managed DNS is the right choice.

Single DNS provider vs multi-provider. A single managed DNS provider is a SPOF at the DNS layer, as the Dyn incident demonstrated. Multi-provider DNS — splitting authoritative DNS across two providers using NS records — eliminates this risk. The operational complexity is higher: changes must be applied to both providers, and zone file synchronisation must be managed. The tradeoff is worth it for services where DNS unavailability is catastrophic.

DNS-based health checking. Many DNS providers offer health checks that automatically remove unhealthy IP addresses from DNS responses. This provides basic load balancing and failover at the DNS layer. The limitations are real: health checks have latency, TTL means changes don't propagate instantly, and DNS has no session affinity — a client that received an IP before it was removed will continue using it until its connection drops. DNS health checking is a useful tool, not a substitute for proper load balancing.

The one thing to remember

DNS is the first thing that happens for every user request, and the last thing most engineers think about. Your TTL is a contract with the world's resolvers — understand what you've promised before you need to change it under pressure. Lower TTL before planned changes, not during them. Use a managed DNS provider with multiple nameservers. And treat your DNS provider as a critical dependency worthy of the same resilience thinking you apply to your databases and load balancers.

← Previous: TLS/SSL — we've established that HTTPS uses TLS for security. The next post goes inside the TLS handshake: how encryption is negotiated, what a certificate actually contains, and why TLS 1.3 matters for performance.

→ Next: DNS Load Balancing — DNS resolves names to IPs. The next post covers how to use that resolution step itself as a traffic distribution mechanism — and where it works brilliantly and where it falls short.

DNS: The Phone Book That Runs the Internet

DNS: The Phone Book That Runs the Internet

Systems Design

TTL: the caching contract

The resolver chain

DNS failure modes

The tradeoffs

The one thing to remember

Comments

Systems Design

DNS Load Balancing: Traffic Distribution at the Name Layer

More from this blog

Networking & Protocols: Wrap-Up

CDN: Moving Content Closer to the People Who Need It

Anycast Routing: One Address, Everywhere at Once

DNS Load Balancing: Traffic Distribution at the Name Layer

Command Palette

DNS: The Phone Book That Runs the Internet

Systems Design

TTL: the caching contract

The resolver chain

DNS failure modes

The tradeoffs

The one thing to remember

Comments

Systems Design

DNS Load Balancing: Traffic Distribution at the Name Layer

More from this blog