Designing a Web Crawler - Mohammadali Bazyar

The Problem That Hides Real Complexity

Build a system that systematically visits web pages, extracts their content and links, and follows those links to discover more pages. At Google scale: hundreds of billions of pages, refreshed continuously, without overloading any single website, while respecting their crawl preferences.

The naive description fits in one paragraph. The actual implementation is one of the largest distributed systems anyone has ever built. The problems are not algorithmic; they are operational. Every assumption about the web is violated by some weird corner of it. Every politeness rule has exceptions. Every optimization has a failure mode that creates an outage somewhere.

This article walks through how to design a crawler that handles all of it: politeness, dedup, dynamic content, freshness, scale, and the inevitable bad actors.

Step 1: Requirements

Functional Requirements

Discover URLs

Start from seed URLs, fetch them, extract links, follow links recursively.

Fetch Pages

Download HTML (and optionally rendered JS) for each visited URL.

Extract Content

Parse HTML, extract text, links, metadata. Hand off to indexing.

Politeness

Respect robots.txt. Rate limit per domain. Don't overwhelm sites.

Dedup

Don't crawl the same URL or content twice if avoidable.

Refresh

Re-visit pages periodically based on how often they change.

Non-Functional Requirements

Scale: hundreds of billions of unique pages over time. Tens of billions of fresh fetches per day at maturity.
Politeness: never hit a single domain too fast. The average across the web is 1 request every several seconds per domain.
Robustness: the web breaks in every conceivable way. The crawler must recover gracefully.
Cost-effective: at this scale, every wasted request adds up to real money.
Distributed: thousands of crawler nodes coordinating without serializing on any shared bottleneck.

Step 2: Capacity Estimation

Metric

Calculation

Result

Pages indexed (mature)

target

~100 billion

Re-crawls per page/year

avg, varies wildly

~12 (monthly)

Total fetches/year

100B × 12

~1.2 trillion

Average fetches/sec

1.2T / (365 × 86400)

~38,000/sec

Peak fetches/sec

avg × 5

~200,000/sec

Average page size

HTML, no JS render

~100 KB

Bandwidth at peak

200K × 100KB

~20 GB/sec

Storage (raw HTML)

100B × 100KB

~10 PB

Bandwidth measured in tens of gigabytes per second sustained. Storage in petabytes. Number of nodes in the thousands. This is industrial-scale infrastructure.

Step 3: The Core Loop

Every web crawler implements the same conceptual loop:

1. Pop a URL from the frontier (queue of URLs to visit).
2. Check it is allowed by robots.txt and not in the dedup set.
3. Fetch the page.
4. Parse HTML, extract content and outgoing links.
5. Save the content for indexing.
6. Add discovered links to the frontier.
7. Repeat.

Now make this work for billions of URLs across thousands of machines without crashing the websites being crawled. The simplicity disappears immediately.

Step 4: The URL Frontier

The frontier is the queue of URLs waiting to be visited. It is the most important data structure in the system. It must support:

Politeness: never pop two URLs from the same domain in rapid succession. There must be at least N seconds between requests to example.com.
Priority: important sites should be crawled more often than obscure ones.
Freshness: recently-changing pages should be re-visited sooner.
Distribution: work spread across thousands of crawler workers.
Persistence: if a node fails, the queue survives.

Frontier Implementation

A common design: per-domain queues. The frontier is a collection of queues, one per active domain, each with a "next allowed crawl time" timestamp.

Workers don't pop a single global queue. Instead, they pop from a "ready" set: domains whose next-allowed-time has passed. Then within that domain's queue, they pop the highest-priority URL.

This naturally enforces politeness: each domain is rate-limited to its own queue's release rate.

Priority Tiers

Within each domain, URLs have priority. Tiers like:

Tier 0: Critical. News homepages, major sites' index pages. Re-crawled every few minutes.
Tier 1: Important. Active blogs, popular content. Daily.
Tier 2: Standard. Most of the web. Weekly to monthly.
Tier 3: Cold. Archived or rarely-updated pages. Yearly.

Priority is computed from signals like inbound link count, recent change frequency, social engagement (for news), and explicit metadata (sitemap declarations of update frequency).

Step 5: Politeness Rules

The most fundamental constraint. Get this wrong and you get blocked everywhere.

Robots.txt

Every site can publish a example.com/robots.txt file declaring which paths can be crawled by which user agents. Honor it strictly. Disallowed paths must never be fetched.

Robots.txt files are themselves cached (with their own TTL, usually a few hours). The crawler fetches and parses each site's robots.txt before crawling any of its pages.

Per-Domain Rate Limiting

Default: at most one request every few seconds per domain. The exact value varies. Robots.txt may specify a Crawl-Delay header. Larger sites might tolerate (or even welcome) faster crawling.

The frontier enforces this by setting the next-allowed-time after every fetch.

User-Agent Identification

The crawler sends a unique User-Agent header so site owners can identify it. Major crawlers (Googlebot, Bingbot) have published user agents and IP ranges so sites can verify and block if needed.

Respecting HTTP Signals

If a site returns 429 (Too Many Requests) or 503 (Service Unavailable), back off aggressively. If they return 5xx errors persistently, drop the domain temporarily.

Why This Matters

Bad crawlers get blocked, IP-banned, sued, or all three. Good crawlers (Googlebot is the gold standard) follow conventions strictly because their long-term ability to crawl depends on site owners trusting them.

Step 6: Deduplication

The same URL appears in many places. The same content appears under different URLs. Without deduplication, you waste bandwidth and storage.

URL Deduplication

Before adding a URL to the frontier, check if it has been seen before. With billions of URLs, a hash map doesn't fit in memory. Use a Bloom filter: a probabilistic data structure with a small false-positive rate (you might rarely think a new URL is seen) and zero false-negatives (you never miss seeing a URL again).

A Bloom filter of 100 billion URLs with 1% false-positive rate fits in about 100 GB. Distributed Bloom filters per shard.

URL canonicalization first: http://example.com, http://example.com/, and https://example.com/?foo=1#anchor might all be the same page. Normalize URLs before hashing.

Content Deduplication

Different URLs sometimes serve the same content (mirror sites, syndicated articles, shopping sites with multiple categories pointing to the same product). After fetching, hash the content. If the hash matches something seen before, skip indexing.

Storage: a content-hash store, also Bloom-filtered for the hot path.

Near-Duplicate Detection

Pages can be near-duplicates: same article with different ad placements, slightly different boilerplate. Locality-sensitive hashing (LSH) techniques like SimHash can detect these. Used in indexing, not the crawl loop.

Step 7: Full Architecture

Web Crawler Architecture

Frontier

URL Frontier
per-domain queues with politeness

workers pull URLs

Fetch

Crawler Workers
thousands of nodes

DNS Cache

Robots.txt Cache

store + extract

Process

Content Store
raw HTML in object storage

Link Extractor

Content Hasher

URL Bloom Filter

Content Bloom Filter

add new URLs

Loop

Back to URL Frontier

indexing pipeline

Downstream

Parser

Indexer Queue

Search Index

Worker Lifecycle

A typical crawler worker:

1. Pop a URL from the frontier.
2. Look up the domain's robots.txt (cache).
3. Check robots.txt allows this URL.
4. DNS lookup (cache).
5. HTTP fetch with timeout.
6. Parse response. Status check.
7. Hash content. Check content Bloom filter.
8. Save HTML to content store.
9. Extract links. Push new ones to frontier (after dedup check).
10. Update domain's next-allowed-time.
11. Loop.

Step 8: The DNS Bottleneck

Every fetch requires a DNS lookup. With 200,000 fetches per second, that is 200,000 DNS queries per second. Public DNS servers will rate-limit you.

Solutions:

Aggressive DNS caching: per-worker and shared cluster cache, with TTLs from response.
Distributed resolvers: spread queries across many resolver IPs.
Self-hosted resolver: at very high scale, run your own DNS resolver to avoid public DNS rate limits entirely.
Pre-resolution: when adding a URL to the frontier, resolve its domain in advance.

Google's crawler resolves at its own DNS infrastructure, hitting authoritative servers directly. Smaller crawlers use public resolvers with caching.

Step 9: Handling the Dynamic Web

Modern web pages are JavaScript-heavy. The HTML returned by a fetch might be a near-empty page. The actual content only appears after JavaScript runs and modifies the DOM. Twitter, Reddit, many React/Vue/Angular sites work this way.

HTML-Only Crawling

Fast and cheap. Suitable for the long tail of static and server-rendered content. Misses JS-rendered text entirely.

Headless Browser Crawling

Use a real browser (Chromium via Puppeteer or Playwright) to render the page. Wait for the DOM to settle. Extract content from the rendered DOM.

Pros: captures all content, including JS-rendered.
Cons: 10-100x slower per page. 10-100x more expensive in CPU and memory. Browser fingerprinting issues; some sites block headless browsers.

The Hybrid Approach

Production crawlers do both. A first pass uses cheap HTML fetching. If that returns a thin page (low text, lots of JS), the URL is queued for the headless rendering pipeline. The expensive pipeline is reserved for important sites or sites known to require JS.

Pre-Rendering

Some sites publish a server-side rendered version specifically for crawlers. They detect Googlebot and serve full HTML. This is "dynamic rendering" or "prerendering."

Step 10: Refresh Policy

The web changes constantly. Pages need to be re-crawled. But not all pages change at the same rate.

Adaptive Re-Crawl Frequency

Different sites change at different rates:

News homepages: re-crawl every few minutes.
Active blogs: daily.
Most pages: weekly to monthly.
Static pages: rarely.
Long-tail pages: yearly or never.

The system tracks observed change frequency per page (or per domain). Each re-crawl compares the new content hash to the previous one. If they match (no change), reduce frequency. If they differ, maintain or increase frequency.

Sitemap declarations help: many sites publish sitemaps with `lastmod` and `changefreq` metadata. Use as hints.

Detecting Change Without Re-Crawling

Some signals can hint at change without a full re-crawl:

HTTP HEAD requests return Last-Modified or ETag headers. If unchanged, skip the body.
RSS/Atom feeds tell you when new content appears.
Sitemap pings notify search engines of updates.

Use these to prioritize the frontier without paying full crawl cost.

Step 11: Storage

Crawl data is enormous. Petabytes of raw HTML. Stored efficiently:

Raw HTML store: distributed file system (HDFS) or object storage (S3, GCS). Compressed (gzip or zstd) for ~10x size reduction.
URL metadata store: last fetch time, fetch result, content hash, priority, error count. Sharded SQL or wide-column (Bigtable, Cassandra). Used by the frontier and refresh planner.
Bloom filters: in-memory in the frontier service, persisted periodically. Sharded by URL prefix.
Robots.txt cache: Redis. Per-domain, with TTL.
DNS cache: in-memory in workers, plus a shared Redis cache.

The raw HTML feeds the indexing pipeline downstream. The crawler itself usually doesn't query its own storage; it just produces the data.

Step 12: Edge Cases and Operational Concerns

Spider Traps

Some sites accidentally or maliciously generate infinite URLs. A calendar with "next month" links forever. A site with many filter parameters in URLs (sort by, filter by, page by). Without protection, a crawler can spend infinite time in one site.

Defenses: per-domain crawl quotas (no domain gets more than X URLs); URL canonicalization to dedup query parameter combinations; depth limits per site; pattern-detection for parameter explosion.

Slow Servers

Some servers respond slowly. A worker waiting 30 seconds for one page is wasting capacity. Aggressive timeouts (5-10 seconds for HTML, more for headless render) keep workers productive.

Server-Side Anti-Bot

Cloudflare, Akamai, and CAPTCHAs deliberately block automated traffic. Big crawlers negotiate access (Googlebot is whitelisted). Smaller crawlers must respect the block.

Politeness Variations

One request per second is fine for big sites. For small WordPress blogs on shared hosting, even one per second can be too aggressive. Adaptive politeness: monitor response times and slow down if the server is struggling.

Distributed Coordination

Thousands of crawler workers must share the frontier and dedup state without serializing. Sharding the frontier by domain (one worker handles all of a domain at a time) avoids most coordination. Bloom filters allow distributed dedup with eventual consistency.

Geographic Diversity

Some content is geo-fenced. To see what users in Japan see, the crawler may need to fetch from a Japanese IP. Crawlers operate from datacenters in multiple regions.

Politeness Failures = Public Incidents

If a crawler accidentally hits a small site with thousands of requests per second, it could DDoS them. Site operators complain on social media. The reputation hit is real. Mature crawlers have multiple safety nets and conservative defaults.

Crawl Budget Allocation

You can crawl X pages per day. How do you allocate? More on important domains, less on long tail. Per-domain priority is computed from PageRank-like signals, recent traffic, and update frequency.

Step 13: Recap of Key Decisions

Per-domain politeness queues. Politeness enforced by the queue structure itself.
Bloom filters for URL and content dedup. Probabilistic but extremely memory-efficient.
Aggressive DNS caching, sometimes self-hosted resolver. DNS would otherwise become a bottleneck.
Adaptive re-crawl frequency per page. Pages that change often get re-crawled often.
Hybrid HTML + headless rendering. Most pages get cheap HTML; important or JS-heavy pages get rendered.
Robots.txt strictly honored. Long-term ability to crawl depends on respecting it.
Spider trap defenses. Per-domain quotas, URL canonicalization, parameter dedup.
Adaptive rate limiting. Slow down if the server seems to struggle.

The One Thing to Remember

A web crawler is mostly about politeness and dedup, not about fetching speed. The actual fetch-and-parse logic is straightforward; you can write a single-machine crawler in an afternoon. The hard parts are operational: not crashing the internet, not crawling the same URL twice, visiting important pages often without ignoring the long tail, surviving spider traps, and respecting the social contract that lets you keep crawling. Decisions like crawl rate, refresh policy, JavaScript rendering, and politeness defaults define how good your search index will be. The crawler is the front door to everything downstream (indexing, ranking, search). Get it wrong and the whole search product suffers.