The Problem At Civilization Scale
500 hours of video uploaded every minute. Billions of users watching, each getting smooth playback regardless of their device or network. Storage measured in exabytes. Bandwidth that could fill ocean cables many times over. Most watch time driven by recommendations, not search. Live streaming, content moderation, copyright, monetization all layered on top.
YouTube is not just a video site. It is an industrial-scale content delivery network with a sophisticated upload pipeline, recommendation system, and trust-and-safety apparatus on top. The scale of engineering required is nearly unique in the consumer internet.
This article walks through how to build it.
Step 1: Requirements
Functional Requirements
Non-Functional Requirements
Latency: first-frame playback within 1-2 seconds. Adaptive bitrate prevents buffering thereafter.
Availability: 99.99%. YouTube being down is news.
Scale: billions of users, hundreds of hours uploaded per minute, exabytes total storage.
Global reach: users on every continent. Latency and quality must be acceptable everywhere.
Content quality: no buffering on a normal connection. 4K playback supported.
Cost: bandwidth dominates. Storage is large but bandwidth is the line item that matters.
Step 2: Capacity Estimation
The egress number explains why YouTube is built around CDN. 60 TB/sec sustained globally cannot come from origin. It must come from edge caches close to users. Most engineering effort goes into making this work cheaply.
Step 3: The Two Pipelines
YouTube splits cleanly into two pipelines with very different characteristics:
Upload + Processing: rare events (each upload happens once). High CPU per event (transcoding). Asynchronous.
Playback: common events (billions per day). Low CPU per event (just serve bytes). Synchronous and latency-critical.
These pipelines need totally different infrastructure. Upload uses a transcoding farm. Playback uses a CDN. Storage is the bridge.
Step 4: The Upload Pipeline
raw video stored
multiple formats
Resumable Uploads
Multi-GB videos take time. Networks fail. The upload protocol must support resume: client uploads in chunks, server tracks which chunks succeeded, on retry only re-sends missing chunks.
This uses standards like the resumable upload protocol (similar to tus.io). Modern.
Original Storage
The raw upload is stored as-is in object storage. This is the source of truth for all derivatives.
Transcoding Queue
After upload, the system enqueues a "transcode this video" job in a queue (Kafka or equivalent). Workers consume the queue.
Each video produces many derivative jobs:
- One per resolution (240p, 360p, 480p, 720p, 1080p, 1440p, 2160p/4K).
- One per codec (H.264 for compatibility, VP9 for efficiency on YouTube, AV1 for emerging support).
- One per bitrate per resolution (multiple bitrate variants for adaptive streaming).
- Audio tracks separately.
- Thumbnails (many candidates).
- Optional: captions via speech recognition.
A 10-minute video might generate 50+ derivative files totaling 5-10x the size of the original.
The Transcoding Farm
Transcoding is CPU-intensive. Software transcoding can take many minutes for a 10-minute video. Hardware acceleration (specialized ASICs, GPUs with video encoding hardware) speeds this up by 10-100x.
YouTube uses massive farms of transcoder workers. Some are dedicated to specific codecs or resolutions because the optimization characteristics differ.
Side Pipelines: Thumbnails, Captions, Moderation, Copyright
Thumbnails: automatically extract candidate frames. Run them through a thumbnail-quality classifier. Creator can also upload custom.
Captions: automatic speech recognition produces timed text. Translation to many languages. Quality is moderate; creators can edit.
Content moderation: ML classifiers run on every uploaded video. Detect violations: nudity, violence, hate speech, misinformation. Borderline cases queued for human review. Clear violations blocked or removed.
Content ID (copyright): compare against a database of copyrighted material (audio fingerprints, video fingerprints). Owners can choose to block, monetize, or track matches. This is the system that lets music labels claim revenue from videos using their songs.
All these run in parallel after the original is stored. The video doesn't go live until they pass.
Step 5: Adaptive Streaming (HLS / DASH)
The video player doesn't download one giant file. It downloads small segments (2-10 seconds each), choosing the resolution that matches current network speed.
How Adaptive Streaming Works
For each video, segments exist at multiple resolutions and bitrates. A "manifest" file lists all available variants and segment URLs.
The player:
1. Downloads the manifest. Sees the available qualities.
2. Picks an initial quality based on network estimate.
3. Downloads the first segment. Times the download.
4. Adapts: if download was fast, try higher quality next; if slow, lower quality.
5. Repeats per segment.
Network drops? Next segment is fetched at lower quality. Network improves? Quality climbs back. The video keeps playing without buffering. This is the core insight that makes streaming feel smooth.
HLS vs DASH
Two competing standards.
HLS (HTTP Live Streaming): Apple's standard. M3U8 manifest format. Default on iOS.
DASH (Dynamic Adaptive Streaming over HTTP): open standard. MPD manifest. Default on most non-Apple browsers.
Both are essentially the same idea with different file formats. Most video services support both, choosing based on the client.
Segment Storage
Each segment is a small file in object storage. Per video, hundreds of files (across resolutions, bitrates, segments). Per segment, just a few MB.
Segment files are static. They never change after creation. This makes them perfectly cacheable at the CDN.
Step 6: The CDN — Where 80% of Engineering Goes
Most of YouTube's engineering effort goes into the CDN. Why: bandwidth is the largest cost, and serving video from the user's nearest edge is the only way to keep streaming smooth globally.
Google built their own CDN (the Edge Network). It is one of the largest private networks on Earth, with thousands of points of presence worldwide. Smaller services use Cloudflare, Akamai, Fastly.
How a Playback Request Flows
1. User clicks a video. Player loads the manifest from a CDN edge.
2. Manifest contains segment URLs. Player requests the first segment.
3. CDN edge checks its cache. Hit? Serve immediately.
4. Cache miss? Edge fetches from regional cache. Regional cache fetches from origin. Eventually serves the user, populates the cache for next time.
5. Subsequent users requesting the same segment get a cache hit at the edge. Microseconds.
Hot videos (trending, viral) replicated to many edges. Cold videos (old, rarely watched) might not be cached anywhere; first request takes a few hundred ms.
Predictive Caching
The trick: predict which videos will be popular and pre-warm the cache.
ML models predict views per region per hour. Popular videos get pushed to edge nodes before users request them. Concert clips post-event, news footage during breaking events, etc.
For routine videos, the first cache miss in a region populates the cache for everyone else who asks afterward. The first viewer pays a small latency cost; the millions after pay none.
Cache Hierarchies
Modern CDNs are tiered:
Edge POP: closest to user. Smaller cache.
Regional cache: larger, fewer locations.
Origin shield: a single layer in front of origin.
Origin: the actual storage.
Even on a cache miss at the user's edge, the request likely hits a regional cache before reaching origin. Origin gets very few requests.
Step 7: Storage
Originals plus all transcoded outputs equals exabytes of data total. Stored in distributed object storage (proprietary at Google scale; smaller services use S3, GCS, ADLS).
Tiering
Storage costs vary by access tier:
Hot (recent, viral): replicated across many regions for low-latency access. Higher cost per byte.
Cold (older, less-watched): stored in cheaper tiers. Single replica, maybe in fewer regions.
Frozen (very old, archival): some videos go to extremely cold storage (tape archives) where retrieval takes hours. The trade-off is cost: ~10x cheaper than hot.
Tiering decisions: based on age and recent view count. A video viewed yesterday stays hot. A video unviewed for years can be cooled.
Replication for Durability
Videos must never be lost. Erasure coding or 3x replication for hot tier. Cold tiers might use larger erasure codes (e.g., 12 of 18 redundancy) for cost efficiency.
Database for Metadata
Video metadata (title, description, channel, view count, like count, upload date, tags, transcription) lives in a sharded SQL database, plus heavy caching. Per-video records are small but read very frequently.
Step 8: Recommendations
Most YouTube watch time comes from recommendations, not search. The recommendation pipeline is similar to news feed (covered in another article):
1. Candidate generation: from billions of videos, narrow to thousands. Sources include videos similar to what the user has watched, videos popular among similar users, trending in the user's locale.
2. Ranker: ML model scores each candidate. Predicts watch probability and watch duration.
3. Re-ranker: applies diversity, dedup, freshness, content policy filters.
4. Returns the final list (autoplay queue, sidebar suggestions, homepage feed).
Models are personalized per user. Massive feature stores. Critical to the business: a small lift in watch time per user multiplies to enormous revenue.
Watch History as Signal
Every video watched, every dwell time, every skip, every like, every comment feeds into the engagement stream. Real-time updates to streaming features mean recommendations adapt within minutes.
Cold Start
New users with no history get demographic-based defaults: popular videos in their country, recent uploads from broad-appeal channels. After a few sessions, personal recommendations take over.
Step 9: Live Streaming
Even harder than VOD (Video on Demand). The same architecture but the upload pipeline runs continuously and in real time.
Live Architecture
1. Creator's encoder pushes RTMP stream to ingest server.
2. Ingest server receives the stream, splits it into chunks.
3. Real-time transcoder produces multiple resolutions of each chunk.
4. Chunks pushed to CDN.
5. Players consume HLS/DASH manifest, fetch chunks as they appear.
Latency from "happens" to "viewer sees" is typically 10-30 seconds for standard live. Low-latency HLS or LL-DASH (with smaller chunks and HTTP/2 push) gets it under 5 seconds.
Live Chat
Comments during live streams are a separate real-time system: WebSocket server, fan-out to viewers, rate limiting, moderation. At very large concurrent live (millions of viewers), this is its own scaling challenge.
DVR / Replay
While live, users can rewind to earlier in the stream. The CDN holds recent chunks even after they have been consumed. After the stream ends, it converts to a regular VOD video.
Step 10: Edge Cases and Operational Concerns
The Bandwidth Cost Problem
Bandwidth is enormously expensive. YouTube spends billions on it. Optimization tactics:
Better codecs (VP9 vs H.264 saves 30-50% bandwidth). AV1 saves more.
Adaptive bitrate (don't deliver 4K to a 360p phone screen).
Caching at every edge.
Network peering with ISPs (saves transit fees).
Content Moderation at Scale
Hundreds of hours uploaded per minute. Even a 0.1% problematic rate is unmanageable for human review alone. ML scales the work; humans handle borderline cases.
The infrastructure: classifier models for major policy categories, decision pipelines that combine classifier scores with metadata signals, queues for human reviewers, dashboards for trust-and-safety operators.
Copyright (Content ID)
Content ID compares every uploaded video against a fingerprint database. Audio and video fingerprints. Matches are surfaced to copyright owners; they choose to block, monetize, or track.
The system runs at scale: every video, against millions of registered claims. Hash-based audio fingerprinting and perceptual video fingerprinting (robust to compression, cropping, etc.) make this feasible.
Comments and Community
Each video's comment thread is its own scaling problem. Popular videos get tens of thousands of comments. Nested replies, ranking, spam filtering, abuse handling. A separate system from the video itself.
Search Index
Indexes title, description, captions, channel, plus engagement signals. Updated continuously. Distinct from recommendation: search starts from a query; recommendation has none.
Monetization
Ads inserted at chosen positions (pre-roll, mid-roll). Ad selection itself is a real-time auction (a separate large system). Revenue split with creators tracked at scale.
Analytics for Creators
Creators see detailed analytics: views, watch time, demographics, traffic sources, retention curves. This data comes from the engagement stream, aggregated into per-video and per-channel reports.
Geographic Restrictions
Some videos are blocked in certain countries (legal, regional rights). The CDN enforces this at the edge: requests from blocked geos return 403 instead of the video.
Privacy and Controls
Private and unlisted videos. Age-restricted content. Comment controls. All layered as access policies on top of the basic infrastructure.
Step 11: Recap of Key Decisions
Two pipelines: upload (transcoding farm) and playback (CDN). Different requirements; different infrastructure.
Adaptive streaming with HLS/DASH. Network-aware playback prevents buffering.
Massive transcoding farm. Each upload produces dozens of derivatives via hardware-accelerated workers.
Predictive CDN caching. ML predicts popularity; popular videos get pre-warmed.
Cache hierarchies (edge to regional to origin). Cache hit rate dominates economics.
Tiered storage by access pattern. Hot videos replicated; cold videos cheap.
ML-driven recommendations. Drive most watch time. Multi-stage funnel like news feed.
Real-time live streaming as parallel infrastructure. Same primitives at lower latency.
Content moderation and Content ID at scale. ML classifiers plus human review.
Better codecs save bandwidth = save money. Even modest codec efficiency gains translate to billions saved.
The One Thing to Remember
YouTube is not really a video site. It is a CDN with a sophisticated upload pipeline and a recommendation system on top. The streaming part (manifest plus segments plus adaptive bitrate) is well-understood and largely solved by HLS/DASH. The hard parts are storing and distributing exabytes of video to billions of users without their connections buckling, and predicting what each individual will want to watch next. Most of the engineering goes into the transcoding farm, the predictive caching at the edge, and the recommendation system that keeps users watching. Everything else (live streaming, comments, monetization, content moderation) sits on top of this foundation. The lesson: if your problem looks like "deliver bytes to users globally," the answer almost always involves a CDN with smart caching and adaptive delivery. Building the rest matters less than getting that foundation right.