Designing YouTube - Mohammadali Bazyar

The Problem At Civilization Scale

500 hours of video uploaded every minute. Billions of users watching, each getting smooth playback regardless of their device or network. Storage measured in exabytes. Bandwidth that could fill ocean cables many times over. Most watch time driven by recommendations, not search. Live streaming, content moderation, copyright, monetization all layered on top.

YouTube is not just a video site. It is an industrial-scale content delivery network with a sophisticated upload pipeline, recommendation system, and trust-and-safety apparatus on top. The scale of engineering required is nearly unique in the consumer internet.

This article walks through how to build it.

Step 1: Requirements

Functional Requirements

Upload

Creators upload video files in any format. The system accepts large files (multi-GB).

Transcode

Original video converted to many resolutions, codecs, and bitrates for compatible playback.

Adaptive Streaming

Players adapt bitrate to network speed in real time. No buffering.

Search & Recommendations

Find videos by title, channel, topic. Surface related and personalized recommendations.

Live Streaming

Real-time broadcast from creator to viewers, sub-30 second latency.

Comments, Likes, Subscriptions

Standard social features. Integrated with notifications and trust-and-safety.

Non-Functional Requirements

Latency: first-frame playback within 1-2 seconds. Adaptive bitrate prevents buffering thereafter.
Availability: 99.99%. YouTube being down is news.
Scale: billions of users, hundreds of hours uploaded per minute, exabytes total storage.
Global reach: users on every continent. Latency and quality must be acceptable everywhere.
Content quality: no buffering on a normal connection. 4K playback supported.
Cost: bandwidth dominates. Storage is large but bandwidth is the line item that matters.

Step 2: Capacity Estimation

Metric

Calculation

Result

Hours uploaded / min

given

~500

Hours uploaded / day

500 × 60 × 24

~720,000

Avg upload size (raw)

~1 GB / hour at original quality

~1 GB

Daily ingest

720K × 1 GB

~720 TB raw

After transcoding

5x for derivative formats

~3.5 PB / day

Daily watch hours

2B users × 30 min/day

~1 billion

Daily egress at 1080p

1B hours × 5 GB/hr (1080p)

~5 EB / day

Avg egress / sec

5 EB / 86400

~60 TB/sec

The egress number explains why YouTube is built around CDN. 60 TB/sec sustained globally cannot come from origin. It must come from edge caches close to users. Most engineering effort goes into making this work cheaply.

Step 3: The Two Pipelines

YouTube splits cleanly into two pipelines with very different characteristics:

Upload + Processing: rare events (each upload happens once). High CPU per event (transcoding). Asynchronous.
Playback: common events (billions per day). Low CPU per event (just serve bytes). Synchronous and latency-critical.

These pipelines need totally different infrastructure. Upload uses a transcoding farm. Playback uses a CDN. Storage is the bridge.

Step 4: The Upload Pipeline

Upload + Processing Pipeline

Creator

Creator uploads video

resumable upload

Ingest

Upload Service

Original Blob
raw video stored

enqueue jobs

Process

Transcoding Queue

Transcoders
multiple formats

Thumbnails / Captions / Moderation / Copyright

distribute

CDN

Edge Caches Worldwide

Resumable Uploads

Multi-GB videos take time. Networks fail. The upload protocol must support resume: client uploads in chunks, server tracks which chunks succeeded, on retry only re-sends missing chunks.

This uses standards like the resumable upload protocol (similar to tus.io). Modern.

Original Storage

The raw upload is stored as-is in object storage. This is the source of truth for all derivatives.

Transcoding Queue

After upload, the system enqueues a "transcode this video" job in a queue (Kafka or equivalent). Workers consume the queue.

Each video produces many derivative jobs:

- One per resolution (240p, 360p, 480p, 720p, 1080p, 1440p, 2160p/4K).
- One per codec (H.264 for compatibility, VP9 for efficiency on YouTube, AV1 for emerging support).
- One per bitrate per resolution (multiple bitrate variants for adaptive streaming).
- Audio tracks separately.
- Thumbnails (many candidates).
- Optional: captions via speech recognition.

A 10-minute video might generate 50+ derivative files totaling 5-10x the size of the original.

The Transcoding Farm

Transcoding is CPU-intensive. Software transcoding can take many minutes for a 10-minute video. Hardware acceleration (specialized ASICs, GPUs with video encoding hardware) speeds this up by 10-100x.

YouTube uses massive farms of transcoder workers. Some are dedicated to specific codecs or resolutions because the optimization characteristics differ.

Side Pipelines: Thumbnails, Captions, Moderation, Copyright

Thumbnails: automatically extract candidate frames. Run them through a thumbnail-quality classifier. Creator can also upload custom.
Captions: automatic speech recognition produces timed text. Translation to many languages. Quality is moderate; creators can edit.
Content moderation: ML classifiers run on every uploaded video. Detect violations: nudity, violence, hate speech, misinformation. Borderline cases queued for human review. Clear violations blocked or removed.
Content ID (copyright): compare against a database of copyrighted material (audio fingerprints, video fingerprints). Owners can choose to block, monetize, or track matches. This is the system that lets music labels claim revenue from videos using their songs.

All these run in parallel after the original is stored. The video doesn't go live until they pass.

Step 5: Adaptive Streaming (HLS / DASH)

The video player doesn't download one giant file. It downloads small segments (2-10 seconds each), choosing the resolution that matches current network speed.

How Adaptive Streaming Works

For each video, segments exist at multiple resolutions and bitrates. A "manifest" file lists all available variants and segment URLs.

The player:

1. Downloads the manifest. Sees the available qualities.
2. Picks an initial quality based on network estimate.
3. Downloads the first segment. Times the download.
4. Adapts: if download was fast, try higher quality next; if slow, lower quality.
5. Repeats per segment.

Network drops? Next segment is fetched at lower quality. Network improves? Quality climbs back. The video keeps playing without buffering. This is the core insight that makes streaming feel smooth.

HLS vs DASH

Two competing standards.

HLS (HTTP Live Streaming): Apple's standard. M3U8 manifest format. Default on iOS.
DASH (Dynamic Adaptive Streaming over HTTP): open standard. MPD manifest. Default on most non-Apple browsers.

Both are essentially the same idea with different file formats. Most video services support both, choosing based on the client.

Segment Storage

Each segment is a small file in object storage. Per video, hundreds of files (across resolutions, bitrates, segments). Per segment, just a few MB.

Segment files are static. They never change after creation. This makes them perfectly cacheable at the CDN.

Step 6: The CDN — Where 80% of Engineering Goes

Most of YouTube's engineering effort goes into the CDN. Why: bandwidth is the largest cost, and serving video from the user's nearest edge is the only way to keep streaming smooth globally.

Google built their own CDN (the Edge Network). It is one of the largest private networks on Earth, with thousands of points of presence worldwide. Smaller services use Cloudflare, Akamai, Fastly.

How a Playback Request Flows

1. User clicks a video. Player loads the manifest from a CDN edge.
2. Manifest contains segment URLs. Player requests the first segment.
3. CDN edge checks its cache. Hit? Serve immediately.
4. Cache miss? Edge fetches from regional cache. Regional cache fetches from origin. Eventually serves the user, populates the cache for next time.
5. Subsequent users requesting the same segment get a cache hit at the edge. Microseconds.

Hot videos (trending, viral) replicated to many edges. Cold videos (old, rarely watched) might not be cached anywhere; first request takes a few hundred ms.

Predictive Caching

The trick: predict which videos will be popular and pre-warm the cache.

ML models predict views per region per hour. Popular videos get pushed to edge nodes before users request them. Concert clips post-event, news footage during breaking events, etc.

For routine videos, the first cache miss in a region populates the cache for everyone else who asks afterward. The first viewer pays a small latency cost; the millions after pay none.

Cache Hierarchies

Modern CDNs are tiered:

Edge POP: closest to user. Smaller cache.
Regional cache: larger, fewer locations.
Origin shield: a single layer in front of origin.
Origin: the actual storage.

Even on a cache miss at the user's edge, the request likely hits a regional cache before reaching origin. Origin gets very few requests.

Step 7: Storage

Originals plus all transcoded outputs equals exabytes of data total. Stored in distributed object storage (proprietary at Google scale; smaller services use S3, GCS, ADLS).

Tiering

Storage costs vary by access tier:

Hot (recent, viral): replicated across many regions for low-latency access. Higher cost per byte.
Cold (older, less-watched): stored in cheaper tiers. Single replica, maybe in fewer regions.
Frozen (very old, archival): some videos go to extremely cold storage (tape archives) where retrieval takes hours. The trade-off is cost: ~10x cheaper than hot.

Tiering decisions: based on age and recent view count. A video viewed yesterday stays hot. A video unviewed for years can be cooled.

Replication for Durability

Videos must never be lost. Erasure coding or 3x replication for hot tier. Cold tiers might use larger erasure codes (e.g., 12 of 18 redundancy) for cost efficiency.

Database for Metadata

Video metadata (title, description, channel, view count, like count, upload date, tags, transcription) lives in a sharded SQL database, plus heavy caching. Per-video records are small but read very frequently.

Step 8: Recommendations

Most YouTube watch time comes from recommendations, not search. The recommendation pipeline is similar to news feed (covered in another article):

1. Candidate generation: from billions of videos, narrow to thousands. Sources include videos similar to what the user has watched, videos popular among similar users, trending in the user's locale.
2. Ranker: ML model scores each candidate. Predicts watch probability and watch duration.
3. Re-ranker: applies diversity, dedup, freshness, content policy filters.
4. Returns the final list (autoplay queue, sidebar suggestions, homepage feed).

Models are personalized per user. Massive feature stores. Critical to the business: a small lift in watch time per user multiplies to enormous revenue.

Watch History as Signal

Every video watched, every dwell time, every skip, every like, every comment feeds into the engagement stream. Real-time updates to streaming features mean recommendations adapt within minutes.

Cold Start

New users with no history get demographic-based defaults: popular videos in their country, recent uploads from broad-appeal channels. After a few sessions, personal recommendations take over.

Step 9: Live Streaming

Even harder than VOD (Video on Demand). The same architecture but the upload pipeline runs continuously and in real time.

Live Architecture

1. Creator's encoder pushes RTMP stream to ingest server.
2. Ingest server receives the stream, splits it into chunks.
3. Real-time transcoder produces multiple resolutions of each chunk.
4. Chunks pushed to CDN.
5. Players consume HLS/DASH manifest, fetch chunks as they appear.

Latency from "happens" to "viewer sees" is typically 10-30 seconds for standard live. Low-latency HLS or LL-DASH (with smaller chunks and HTTP/2 push) gets it under 5 seconds.

Live Chat

Comments during live streams are a separate real-time system: WebSocket server, fan-out to viewers, rate limiting, moderation. At very large concurrent live (millions of viewers), this is its own scaling challenge.

DVR / Replay

While live, users can rewind to earlier in the stream. The CDN holds recent chunks even after they have been consumed. After the stream ends, it converts to a regular VOD video.

Step 10: Edge Cases and Operational Concerns

The Bandwidth Cost Problem

Bandwidth is enormously expensive. YouTube spends billions on it. Optimization tactics:

Better codecs (VP9 vs H.264 saves 30-50% bandwidth). AV1 saves more.
Adaptive bitrate (don't deliver 4K to a 360p phone screen).
Caching at every edge.
Network peering with ISPs (saves transit fees).

Content Moderation at Scale

Hundreds of hours uploaded per minute. Even a 0.1% problematic rate is unmanageable for human review alone. ML scales the work; humans handle borderline cases.

The infrastructure: classifier models for major policy categories, decision pipelines that combine classifier scores with metadata signals, queues for human reviewers, dashboards for trust-and-safety operators.

Copyright (Content ID)

Content ID compares every uploaded video against a fingerprint database. Audio and video fingerprints. Matches are surfaced to copyright owners; they choose to block, monetize, or track.

The system runs at scale: every video, against millions of registered claims. Hash-based audio fingerprinting and perceptual video fingerprinting (robust to compression, cropping, etc.) make this feasible.

Comments and Community

Each video's comment thread is its own scaling problem. Popular videos get tens of thousands of comments. Nested replies, ranking, spam filtering, abuse handling. A separate system from the video itself.

Search Index

Indexes title, description, captions, channel, plus engagement signals. Updated continuously. Distinct from recommendation: search starts from a query; recommendation has none.

Monetization

Ads inserted at chosen positions (pre-roll, mid-roll). Ad selection itself is a real-time auction (a separate large system). Revenue split with creators tracked at scale.

Analytics for Creators

Creators see detailed analytics: views, watch time, demographics, traffic sources, retention curves. This data comes from the engagement stream, aggregated into per-video and per-channel reports.

Geographic Restrictions

Some videos are blocked in certain countries (legal, regional rights). The CDN enforces this at the edge: requests from blocked geos return 403 instead of the video.

Privacy and Controls

Private and unlisted videos. Age-restricted content. Comment controls. All layered as access policies on top of the basic infrastructure.

Step 11: Recap of Key Decisions

Two pipelines: upload (transcoding farm) and playback (CDN). Different requirements; different infrastructure.
Adaptive streaming with HLS/DASH. Network-aware playback prevents buffering.
Massive transcoding farm. Each upload produces dozens of derivatives via hardware-accelerated workers.
Predictive CDN caching. ML predicts popularity; popular videos get pre-warmed.
Cache hierarchies (edge to regional to origin). Cache hit rate dominates economics.
Tiered storage by access pattern. Hot videos replicated; cold videos cheap.
ML-driven recommendations. Drive most watch time. Multi-stage funnel like news feed.
Real-time live streaming as parallel infrastructure. Same primitives at lower latency.
Content moderation and Content ID at scale. ML classifiers plus human review.
Better codecs save bandwidth = save money. Even modest codec efficiency gains translate to billions saved.

The One Thing to Remember

YouTube is not really a video site. It is a CDN with a sophisticated upload pipeline and a recommendation system on top. The streaming part (manifest plus segments plus adaptive bitrate) is well-understood and largely solved by HLS/DASH. The hard parts are storing and distributing exabytes of video to billions of users without their connections buckling, and predicting what each individual will want to watch next. Most of the engineering goes into the transcoding farm, the predictive caching at the edge, and the recommendation system that keeps users watching. Everything else (live streaming, comments, monetization, content moderation) sits on top of this foundation. The lesson: if your problem looks like "deliver bytes to users globally," the answer almost always involves a CDN with smart caching and adaptive delivery. Building the rest matters less than getting that foundation right.