DNS Deep Dive - Mohammadali Bazyar

What This Article Is About

You type example.com into a browser. A second later you see a webpage. In that one second, somewhere between five and twenty different servers around the world cooperated to figure out which IP address actually belongs to "example.com", and the answer was cached at four or five different layers along the way. That whole machinery is DNS.

DNS is one of those systems where everything seems boring until something breaks. Then suddenly your site is "down" even though every server is healthy, your CDN is fine, and your code is working. About 30% of internet outages have a DNS root cause somewhere. If you understand DNS well, you become the person who actually finds the bug instead of guessing.

This is a complete walkthrough. Every layer, every record type that matters, the failure modes, and the practical patterns you'll actually use.

The Core Problem DNS Solves

Computers route packets using IP addresses (numbers like 93.184.216.34 or 2606:2800:220:1:248:1893:25c8:1946). Humans cannot remember those. Humans want names like example.com.

DNS is a globally distributed key-value store, where the keys are names and the values are mostly IP addresses (plus a bunch of other useful data). It needs to be:

Globally consistent enough. Every device in the world should be able to look up example.com and get the same answer (within reason).
Decentralized. No single company owns the namespace. Anyone can buy a domain.
Fast. Every web request, every email, every API call needs DNS first. Slow DNS means slow internet.
Scalable. Trillions of queries per day across hundreds of millions of names.

It achieves all of this through hierarchy, delegation, and aggressive caching.

The Hierarchy

DNS is shaped like a tree. The root is at the top, then top-level domains (TLDs), then second-level domains, then subdomains. Each level is administered by different people.

DNS Tree (Top to Bottom)

Root (the dot)

. (root zone)

delegates to

Top-Level Domains (TLDs)

.com

.org

.io

.uk

delegates to

Second-Level (the domain you buy)

example.com

mabazyar.com

delegates to

Subdomains (you control)

www.example.com

api.example.com

mail.example.com

Root servers (.): there are 13 logical root servers, named A through M. Each is actually anycasted to hundreds of physical machines around the world. They're operated by different organizations (Verisign, ICANN, NASA, the US Department of Defense, etc.). Their only job: tell you which servers handle which TLD.

TLD servers (.com, .org, .io): operated by registries (Verisign for .com and .net, Public Interest Registry for .org, etc.). Their job: tell you which authoritative nameservers handle a specific second-level domain.

Authoritative nameservers: the ones with the actual answers for your domain. Either run by you or, more commonly, by a DNS provider (Route 53, Cloudflare, NS1, Dyn, GoDaddy). Their job: return the actual records.

The genius is the delegation. The root doesn't know your domain's IP. It just knows who knows who knows.

How a Lookup Actually Works

Let's trace a fresh lookup of www.example.com, assuming nothing is cached anywhere.

Recursive DNS Resolution Flow

Your Browser

1. give me www.example.com

OS Stub Resolver

2. forward query

Recursive Resolver (1.1.1.1)

3a. who handles .com?

Root Server

3b. ask .com TLD

.com TLD Server

3c. ask example.com NS

Authoritative NS for example.com

3d. returns A record

Resolver caches answer

4. answer back to you

Browser connects to IP

Step by step:

1. Your browser asks the OS for the IP of www.example.com. The OS has a small "stub resolver".

2. The stub forwards the query to a recursive resolver. This is configured per machine (DHCP usually pushes one from your ISP). Common public ones: 1.1.1.1 (Cloudflare), 8.8.8.8 (Google), 9.9.9.9 (Quad9).

3. The resolver does the recursive work:

3a. It asks a root server: "where do I find .com info?" Root says "ask any of these .com TLD servers".
3b. It asks a .com TLD server: "where is example.com?" TLD says "ask ns1.example-dns.com or ns2.example-dns.com". This is called a "referral".
3c. It asks the authoritative nameserver: "what's the A record for www.example.com?" The authoritative server returns 93.184.216.34.
3d. The resolver caches the answer (for the TTL specified) and returns it.

4. Browser connects to 93.184.216.34 over TCP/443 and starts the TLS handshake.

If anything is already cached, layers are skipped. In practice, most queries terminate at step 2 or 3a because the resolver already has the answer.

The Caching Layers

DNS is fast because almost nothing actually does the full walk. Caching happens at every layer.

Browser cache: Chrome, Firefox, etc. cache DNS lookups in memory. Visit chrome://net-internals/#dns to see Chrome's cache.
OS cache: Windows has a DNS Client service. macOS has mDNSResponder. Linux varies (systemd-resolved on most modern distros).
Recursive resolver cache: the heaviest cache. ISP and public resolvers cache popular records aggressively.
Authoritative server cache: usually they don't cache (they're the source of truth) but they might cache zone transfers from primaries.

When you change a DNS record, all these caches still hold old values until their TTL expires. This is what people mean by "DNS propagation". Nothing is being pushed anywhere. The change is instant at your authoritative server. Caches around the world simply don't know yet.

Record Types You Actually Use

The DNS database supports dozens of record types. The ones you'll actually deal with:

A: maps a name to an IPv4 address. example.com IN A 93.184.216.34. The most common record.

AAAA: maps a name to an IPv6 address. Same purpose as A, just for v6. (Pronounced "quad A".)

CNAME: alias from one name to another. www.example.com CNAME example.com means "www is just another name for the apex". Important rule: a CNAME cannot coexist with other records at the same name. You can't have a CNAME and an MX on example.com simultaneously. This causes endless config grief.

MX: mail exchange. Tells other mail servers "to send mail to @example.com, deliver it here". Has a priority field; lower is higher priority.

TXT: arbitrary text. Originally for human notes, now used for everything: SPF (anti-spoofing for email), DKIM (email signing), domain ownership verification (Google, AWS, etc. ask you to add a TXT record to prove you own the domain), DMARC policy, and more.

NS: nameserver records. Tells the parent zone "these are the authoritative servers for this domain". When you "change DNS providers", you're changing your NS records at the registrar.

SOA: "Start of Authority". Metadata about the zone: primary nameserver, admin email, serial number, refresh/retry/expire timers. There's exactly one SOA per zone.

CAA: "Certificate Authority Authorization". Lists which CAs can issue TLS certs for your domain. example.com CAA 0 issue "letsencrypt.org" means only Let's Encrypt is allowed to issue certs. CAs check this before issuing.

SRV: service records. Encode service location, port, priority, weight. Used by SIP, XMPP, Kerberos, some Microsoft services. Less common in web stacks.

PTR: reverse DNS. Maps an IP back to a name. Mail servers check PTR records to detect spam (does the sender's IP have matching forward and reverse DNS?).

ALIAS / ANAME (provider-specific): like a CNAME but works at the apex (where CNAMEs aren't allowed). The provider does CNAME-like resolution server-side. Not a standard DNS record; provider-specific feature.

TTL: The Knob That Matters Most

Every record has a TTL (Time To Live), in seconds. It tells caches "you may keep this answer for up to N seconds before asking again".

This is the most consequential setting in DNS. Get it wrong and either your site is slow or your changes don't take effect.

Short TTL (60-300 seconds): changes propagate fast. But every cache miss = a fresh query. Costs more, slower for end users (more lookups), more load on authoritative servers.

Medium TTL (1 hour): nice balance. Common default.

Long TTL (24 hours+): cheap, fast for users. But emergencies (failover, accidental misconfig) take a long time to fix.

The standard pattern for planned changes: a few days before the change, lower the TTL to 300 seconds. Wait for old caches to expire. Make the change. After the change is verified, raise the TTL back up. This way you get fast deploy + cheap normal operation.

The "DNS Propagation" Confusion

Beginners and tutorials say "wait for DNS to propagate" as if there's a process pushing your record change to servers around the world. There isn't.

What actually happens:

1. You update the record at your authoritative server. This takes effect immediately. Your authoritative server now returns the new value to anyone who asks.

2. Recursive resolvers around the world have the OLD value cached (if anyone recently looked it up). They will return the old value until their cached entry's TTL expires.

3. As cached entries expire, those resolvers fetch the new value from your authoritative server.

So "propagation time" = "longest TTL of any cache that has the old value". Worst case: you set TTL=86400 (24 hours), changed the record one second after a popular ISP cached the old value. That ISP will return the old value for almost 24 hours.

Tools like whatsmydns.net ask resolvers around the world simultaneously and show whose cache has updated. Useful for confirming "yes, the change is visible globally now".

DNS Over UDP (Mostly)

DNS queries are tiny (typically < 512 bytes). UDP fits perfectly: no handshake, one packet out, one packet back, sub-millisecond when cached.

Why UDP works for DNS:

1. Queries are small.
2. If a UDP packet is lost, the resolver just retries.
3. No connection state to maintain across millions of queries per second.

For larger responses (DNSSEC adds signatures, big TXT records, AXFR zone transfers), DNS falls back to TCP. EDNS0 also lets DNS use larger UDP packets (up to 4096 bytes) before falling back.

Modern privacy variants:

DoT (DNS over TLS): DNS queries over TLS on port 853. Encrypted, but a network observer can still see you're doing DNS.
DoH (DNS over HTTPS): DNS queries inside HTTPS. Indistinguishable from normal web traffic. Browsers (Firefox, Chrome) increasingly use DoH by default to defeat ISP-level DNS surveillance and hijacking.
DoQ (DNS over QUIC): newer, fastest. Combines TLS-level privacy with QUIC's low latency.

The privacy upside: ISPs (and anyone on your network) can't see what domains you visit. The infrastructure downside: corporate networks lose visibility into DNS, which is sometimes useful for security and content filtering. Many enterprises block DoH externally and force their own internal resolvers.

Anycast: How a Single IP Lives Everywhere

Public resolvers like 1.1.1.1 answer in 5 milliseconds globally. How?

Anycast. Multiple physical servers around the world all announce the same IP via BGP. When you send a packet to 1.1.1.1, the internet's routing protocols deliver it to whichever instance is "closest" (by BGP path, not necessarily geography). Each instance has its own cache.

Root servers and most large DNS providers use anycast. The "13 root servers" are 13 logical servers, each anycasted to hundreds of physical instances.

GeoDNS

The same name can return different IPs depending on who's asking. This is called GeoDNS.

Use case: serving users from the closest CDN edge. cdn.example.com resolves to a US east coast IP for users in New York and to a Frankfurt IP for users in Berlin.

How resolvers see you: by default, the authoritative server only sees the resolver's IP, not the end user's. So if a user in Brazil uses 8.8.8.8 (Google's resolver, which might be in the US), they get routed as if they're in the US.

EDNS Client Subnet (ECS): resolvers can pass a /24 of the user's IP to the authoritative server. This lets GeoDNS make decisions based on the actual user, not the resolver. Major resolvers and CDNs support ECS.

Round-Robin DNS

Multiple A records for the same name. example.com IN A 1.2.3.4 and example.com IN A 1.2.3.5. The authoritative server returns both, often in random or rotated order. Clients pick the first.

Crude load balancing. Problems:

If one IP is dead, half your users still get sent there until they fail and retry.
Caching means the same client keeps hitting the same IP for the TTL.
No real health checks.

Use it as a poor man's HA. Use a real load balancer (or health-checked DNS) for anything serious.

Health-Checked DNS

The authoritative server runs health checks against your backends. Only IPs of currently-healthy backends are returned. Failed servers get pulled out of rotation automatically.

Route 53 calls this "DNS failover". NS1, Dyn, and Cloudflare have similar features.

Common use: multi-region failover. Primary region in US East, secondary in US West. Health check pings both. If primary fails, DNS returns secondary IPs. Users start hitting the secondary as their cached entries expire.

The TTL trap: failover speed is bounded by your TTL. If TTL is 300, expect up to 5 minutes of partial outage before traffic shifts.

DNSSEC

Plain DNS has no authentication. A man-in-the-middle (or a corrupt resolver) can return forged answers, redirecting your traffic to a phishing site. This is "DNS poisoning".

DNSSEC adds cryptographic signatures. Each record is signed with a private key. The public key is published as a DNSKEY record. The parent zone signs your DNSKEY (DS record). Resolvers can verify the chain from root down to your record.

Used everywhere? No. Adoption is uneven. Reasons:

Operationally complex. Key rotation is fraught. Misconfiguration breaks your domain entirely.
Many resolvers don't validate.
DoH/DoT achieve confidentiality + a different trust model (you trust the resolver, not the chain), which many find sufficient.

Enable it where you can, especially for high-stakes domains. But it isn't a silver bullet.

Why DNS Causes Outages

DNS is famously the source of strange production incidents. Categories:

1. Misconfiguration. Wrong A record. CNAME pointing to a deleted resource. Updated NS records that don't match your provider's actual servers. Your site disappears for everyone, anywhere from minutes (short TTL) to days (long TTL with bad cached value).

2. Provider outages. October 21, 2016: Dyn was DDoSed and Twitter, Reddit, GitHub, Netflix, Spotify all went offline. None of them had broken servers. They had Dyn as their DNS provider. November 18, 2025: Cloudflare's DNS infrastructure had a software bug; large parts of the internet went down for hours.

3. DDoS on your authoritative servers. If your nameservers go down, every cache eventually expires and your domain becomes unreachable.

4. Domain expiration. Forgot to renew. Domain returns to the registrar pool. Site disappears. Possibly someone else buys it. Companies have lost millions this way (Marketo, Foursquare, others).

5. Registrar account compromise. Attacker gets into your registrar account, changes NS records to point to their nameservers, can now MitM all your traffic. Defense: registrar lock, two-factor authentication, separate credentials for registry-locked domains.

6. TTL miscalculation. You changed a record but forgot you'd left the TTL at 86400. Your fix takes a day to fully roll out.

7. Negative caching. Resolvers cache "this name doesn't exist" answers (NXDOMAIN) too. If you mistyped a name once and the resolver cached the negative, your real change has to wait for that to expire.

Best Practices

Use multiple DNS providers. Don't have all four NS records at one provider. Split between two (Route 53 + NS1, for example). If one provider has an outage, the other answers. The 2016 Dyn outage made this advice mainstream.

Lock your registrar account. Enable registrar lock (prevents NS record changes via the registry). Enable 2FA. Use a strong, unique password. Treat the registrar account like the most important credential in your company; for many companies it is.

Auto-renew domains. Forgetting to renew is the most preventable disaster. Auto-renew, with multiple expiry warnings, with backup payment methods.

Monitor your DNS. External monitoring service that resolves your records every few minutes from multiple regions and alerts on changes or failures. Also alert when records change unexpectedly.

Set sane TTLs. Default 1 to 24 hours for stable records. 60 to 300 seconds for actively-managed records (failover, deploys). Lower briefly before planned changes.

Enable DNSSEC where the upside justifies the operational complexity. High-value domains, financial services, government.

Have a runbook for DNS emergencies. "What do I do if our DNS provider is down?" "What do I do if our domain is hijacked?" Both are rare and high-stakes; you don't want to be googling at 3 AM.

Tools You'll Use

dig: the swiss army knife. dig example.com A, dig example.com MX, dig +trace example.com shows the full recursion.
nslookup: older, simpler, on every system.
host: simpler still. host example.com.
kdig: like dig but with DoH/DoT/DoQ support.
whatsmydns.net: web tool. Shows current A/AAAA/MX/etc. across resolvers worldwide.
dnsviz.net: visualizes the DNSSEC chain. Helps debug DNSSEC issues.
mxtoolbox: general-purpose web-based DNS troubleshooting.

Edge Cases and Gotchas

Apex CNAMEs are illegal. You cannot have example.com CNAME some-cdn.example.net. RFC says no. Workarounds: ALIAS/ANAME records (provider-specific), flattening at the edge (Cloudflare), or using A records pointing directly at IPs.

Stale negative caches. Misspelled subdomains can be cached as NXDOMAIN. Your real fix waits for the negative TTL.

Resolver behavior varies. Some resolvers ignore TTLs (cap at their own minimum). Some return SERVFAIL for unsigned domains if DNSSEC is configured wrong upstream.

Browser DNS prefetch. Browsers preemptively resolve names found in HTML. Sometimes unexpected requests hit your nameservers.

Local hosts files. /etc/hosts on Linux/macOS, C:\Windows\System32\drivers\etc\hosts on Windows. Overrides DNS. Useful for testing. Sometimes a forgotten entry there is the bug.

Search domains. Some networks configure search domains so foo is auto-completed to foo.corp.example.com. Surprising when it kicks in unexpectedly.

VPN DNS leaks. Connected to a VPN but DNS still uses the local resolver. Can be a privacy issue or a routing oddity.

The One Thing to Remember

DNS looks simple from the outside ("turn name into IP") but is actually a globally distributed cache hierarchy with delegation, TTLs, multiple record types, and lots of failure modes. Most "is the site down?" questions are actually DNS questions in disguise. Master the hierarchy (root, TLD, authoritative), the resolution flow, the TTL knob, and the common record types, and you'll diagnose 80% of internet weirdness faster than the average engineer. Treat your DNS provider, registrar account, and TTL settings as production-critical infrastructure, because they are.