Designing Uber's Geo Service - Mohammadali Bazyar

The Problem That Looks Simple

You open Uber. You tap "Request a ride." Within a few seconds, the app shows a nearby driver, an ETA of 4 minutes, and a price. The driver is heading toward you. This experience feels effortless.

Behind it sits one of the hardest real-time geo problems in computer science. Millions of drivers around the world, all moving constantly, sending location pings every few seconds. Millions of riders requesting rides at any moment. The system must instantly know "where are the closest available drivers to this exact lat/lon, right now?" and answer in milliseconds.

"Find the closest drivers to a point" sounds like a database query. At scale, that database query melts every database. This article walks through how to actually build it.

Step 1: Requirements

Functional Requirements

Driver Location Tracking

Active drivers report their location every few seconds. The system stores the latest position of each driver.

Find Nearby Drivers

Given a rider's lat/lon, find the closest N available drivers within some radius (e.g., 5 km).

Dispatch

Offer a ride to the closest driver. If declined, offer to the next closest. Repeat until accepted.

ETA

Estimate how long until the driver arrives. Account for real road network and traffic.

Surge Pricing

When demand exceeds supply in an area, multiply prices to balance the market.

Trip Lifecycle

Track a trip from request to pickup to drop-off. Persist for billing and history.

Non-Functional Requirements

Latency: matching a rider to a driver must complete in under 1-2 seconds end-to-end. Each subsystem (location lookup, dispatch decision, ETA computation) has tight budgets.
Availability: 99.99%. An outage in a major market (San Francisco at rush hour) is a public incident.
Scale: millions of active drivers globally, tens of millions of riders, peaking on Friday/Saturday nights.
Geographic distribution: users are everywhere. Latency from Tokyo to a US data center is 200ms+ before any work happens. Regional deployment is a hard requirement.
Strong consistency on dispatch: a driver cannot be dispatched to two riders at once. Race conditions here are catastrophic.

Step 2: Capacity Estimation

Metric

Calculation

Result

Active drivers

given

~5 million worldwide

Location ping interval

given

~4 seconds

Location updates / sec

5M / 4

~1.25 million/sec

Ride requests / sec (peak)

global peak estimate

~5,000/sec

Ride lookups / sec

requests + idle map browsing

~50,000/sec

Two things stand out. First, the location update rate is staggering. 1.25 million writes per second, just for "where is each driver right now." No relational database can sustain that.

Second, riders also browse the map even before requesting a ride. Looking at a map = querying nearby drivers. So lookups are 10x ride requests. The lookup path must be cheap.

Step 3: Why The Naive Approach Fails

The obvious design:

1. Store every driver's lat/lon in a database.
2. To find nearby drivers, query: WHERE distance(lat, lon, my_lat, my_lon) < 5km.
3. Return.

This dies for three reasons:

Computing distance is expensive at scale. Even with a spatial index (PostGIS), running this query a hundred thousand times per second on millions of rows is unsustainable.

Updates kill it. Million writes per second to a single database with spatial index? No. The index alone can't keep up.

Global tables don't shard well by location. If you shard by user_id, every "find nearby drivers" query has to fan out to every shard. If you shard by location, you have to figure out HOW to shard location without one shard becoming a hotspot.

The answer to all three problems is the same: don't compute distance at query time. Pre-bucket the world into discrete cells, and route queries to the cell that matters.

Step 4: Geohashing and S2 Cells

The core idea: convert continuous lat/lon coordinates into a discrete cell ID. Two coordinates in the same cell are physically near each other. Looking up "drivers in cell X" is a single hash lookup, not a distance computation.

Geohash

An older technique. Recursively divide the world into rectangles. Each subdivision adds one character to the cell ID. A 6-character geohash is roughly 1 km square. A 7-character one is about 150 meters.

Cells with shared prefixes are close: 9q8yyk1 and 9q8yyk2 are neighbors. To find nearby cells, just look at prefix neighbors.

Problem with geohash: rectangles distort badly near the poles, and cells at the boundary of two prefix groups can be physically adjacent but have totally different geohash strings. Workarounds exist but they are messy.

S2 Cells (What Uber Uses)

Google's improvement. Uses a Hilbert curve to map 2D space onto a 1D ordering. Cells are roughly square (much less distortion). Cells have hierarchy: a level-12 S2 cell is about 3 km on a side; a level-15 cell is about 0.4 km.

The key win: numerically nearby S2 IDs are physically nearby. This makes range queries efficient. And the cell hierarchy lets you adapt: dense urban areas use small cells (level 15+), sparse rural areas use big cells (level 10).

How Driver Locations Get Indexed

Every time a driver pings their location:

1. The server takes the lat/lon and computes the S2 cell ID at the chosen level (say level 12).
2. It stores the driver in a per-cell data structure: cell_id => set of (driver_id, current_lat, current_lon, status, last_update).
3. If the driver moved into a new cell, the old cell removes them and the new cell adds them.

Now "find drivers near (lat, lon)" becomes:

1. Compute the S2 cell for the rider's location.
2. Look up that cell, plus neighboring cells (to handle boundary cases).
3. Return all drivers in those cells.
4. Optionally compute exact distance to filter by radius and rank.

This is a single hash lookup plus a few neighbor lookups. Hundred microseconds. Trivially scales.

Step 5: The Full Architecture

Geo Service Architecture

Drivers

Driver app pings every 4s

via WebSocket

Edge

Load Balancer

Location Ingest
regional

writes

Geo State

Cell Index
cell_id to drivers, in-memory

Driver State DB
full driver record

queried by

Match

Dispatch Service

ETA Service
routing engine

Pricing Service
surge multiplier

side outputs

Side

Trip Events Stream

Trip History DB

Analytics

Components Explained

Location Ingest accepts driver pings via WebSocket. Each ping is a small message: (driver_id, lat, lon, timestamp, status). The ingest service computes the S2 cell ID and updates the Cell Index.

Cell Index is the heart of the system. It is an in-memory data store keyed by S2 cell ID, with each cell holding the current set of drivers in that area. Implemented as Redis sorted sets (one per cell), or as a custom in-memory service for higher throughput.

Driver State DB holds the complete driver record (vehicle type, status, ratings, etc.) for queries that need more than just location.

Dispatch Service handles ride requests. Given a rider's location, it looks up the cell, gets candidate drivers, picks one based on distance/rating/wait-time, and offers the trip.

ETA Service computes how long the chosen driver will take, using a road network and traffic model.

Pricing Service computes the fare, including any surge multiplier for the cell.

Trip Events Stream records every state change of every trip (requested, accepted, picked up, dropped off, paid, rated). Used for analytics, fraud detection, and post-hoc audits.

Step 6: The Dispatch Flow in Detail

When a rider taps "Request":

1. Their app sends a request: {rider_id, pickup_lat, pickup_lon, dest_lat, dest_lon, vehicle_type}.
2. Dispatch Service computes the S2 cell for the pickup location.
3. It looks up that cell + 8 neighbors (to handle drivers near the boundary).
4. It filters: driver must be available, of the right vehicle type, and within reasonable distance.
5. It calls ETA Service to compute realistic arrival times for the candidates.
6. It picks the best driver. "Best" means a weighted combination of: distance, ETA, driver acceptance rate, and other signals.
7. The driver gets a push: "Trip request, 4 min away, $12 fare." They have ~10 seconds to accept.
8. If accepted: trip is created, both parties notified, driver enroute.
9. If declined or timed out: try the next best candidate.

Why Cell Hierarchy Matters

The right cell size depends on context. Manhattan: tens of drivers per small cell, lookups should use small cells. Wyoming: maybe one driver per 10-cell radius, lookups should use big cells.

Production systems either pick a fixed level globally (simpler) or adapt the level dynamically based on regional driver density (more efficient but complex).

Step 7: The Hot Cell Problem

Times Square at 8 PM Friday. Hundreds of drivers and thousands of ride requests in one cell. That single cell becomes a hotspot.

Symptoms:

The cell's data structure is constantly read and written.
If sharded onto one node, that node is overloaded.
Latency spikes for everyone whose request touches that cell.

Mitigations

Read replicas for hot cells: the writes still go to one place, but reads are served by many replicas. Slight staleness is acceptable (drivers move every few seconds anyway).

Sub-cell sharding: dynamically split a hot cell into smaller cells. Cell-level-12 becomes four level-13 cells. Each smaller cell handles a fraction of the load.

Aggressive caching: for the "show me nearby drivers" map preview (not the actual dispatch), cache the cell's driver list for 1-2 seconds. Most map browsing tolerates stale-by-1-second.

Pre-allocation in dense areas: known dense zones (airports, downtown areas) get extra resources permanently.

Step 8: Computing ETA

"Driver is 4 minutes away" sounds simple. It is not. Doing it right requires:

Road graph: a graph of roads, intersections, one-way streets, turn restrictions. OpenStreetMap or proprietary.
Routing engine: finds the optimal path from the driver's current location to the rider's pickup. Algorithms like A* or Contraction Hierarchies for speed.
Traffic model: historical and real-time speed data per road segment. Slows the driver down at rush hour.
Vehicle type adjustments: trucks are slower than cars; bikes use bike lanes.

Routing services like Mapbox, Google Maps, or Uber's internal "deck.gl" / "kepler.gl" stack do this work. Each ETA call typically costs 30-100ms.

For dispatch, ETA is computed once per candidate driver. With 5 candidates per request, that's 5 routing calls. Hence the "5,000 requests/sec" estimate translating to ~25,000 routing calls/sec at peak. The routing service must be horizontally scalable.

Step 9: Surge Pricing

Surge is a feedback loop. When demand (ride requests) exceeds supply (available drivers) in a cell, the price goes up. Riders see the higher price; some give up. Drivers see the higher fare; some flip from "off" to "on" or drive into the surge area. Equilibrium restored.

Implementation:

A separate Pricing Service watches each cell's recent metrics (request count, driver count, completion rate) every minute or so. It computes a multiplier (1.0x normal, 1.5x busy, 3x crisis). The multiplier is stored per cell.

When a rider requests a ride, Dispatch calls Pricing to apply the multiplier to the base fare. The rider sees the surged price upfront.

Surge zones are deliberately fuzzed to prevent gaming. If you display "exactly cell X is at 2x", drivers cluster at the boundary instead of moving into the cell. So surge is shown as a wider zone with smoothed transitions.

Step 10: Storage Choices

Cell Index: in-memory. Either Redis with sorted sets (one set per cell, scored by distance to cell center or by recency) or a custom in-process service. The data is ephemeral; it can be rebuilt from driver pings within seconds if a node fails.

Driver State: Cassandra or a similar wide-column store. Sharded by driver_id. Holds vehicle, status, current trip, ratings.

Trip History: after a trip ends, persisted to a SQL database (sharded by user_id or by trip_id). This is where billing, support tickets, and analytics pull from.

Surge state: Redis. Key per cell, value is the current multiplier. TTL of a few minutes; recomputed continuously.

Map data + road graph: read-heavy, rarely changes. Replicated heavily, often in-memory in the routing service.

Step 11: Edge Cases and Operational Concerns

Drivers at Cell Boundaries

A driver standing 50 meters from a cell border might be the closest to a rider whose cell is on the other side of the border. The lookup must include the rider's cell PLUS its neighboring cells. For a level-12 cell, that's 8 neighbors.

Driver Disconnects

Driver app loses internet for 30 seconds. The system must not dispatch them while disconnected. Solution: the Cell Index requires a recent heartbeat. Any driver whose last ping is older than ~10 seconds is treated as unavailable.

Race Conditions on Dispatch

Two riders request rides at nearly the same time, both targeting the same closest driver. Without protection, both might be told "your driver is on the way" while the driver only knows about one.

Fix: dispatch uses a per-driver lock. The first request to claim a driver wins. The second request gets the next-closest driver. This is critical for correctness and is enforced via Redis or a similar atomic store.

Geographic Latency

A rider in Singapore should not wait for a US-based server. The system runs region-by-region. Each region has its own Cell Index, its own Dispatch Service, its own data centers. Cross-region traffic is rare (the rider and driver are in the same city).

Failure of a Cell Node

The Cell Index is sharded across many nodes. If one fails, the cells it owned go dark. Drivers in those cells cannot be dispatched until rebuild.

Recovery is fast because the data is ephemeral. Replicas exist; failover takes seconds. Drivers themselves keep pinging, so the index repopulates within a few seconds anyway.

Multi-Region Failover

For disaster scenarios, traffic from a failed region can be routed to a neighbor. Quality degrades (latency, possibly fewer cars), but the service stays up.

Step 12: Recap of Key Decisions

Spatial partitioning via S2 cells. Turns continuous lat/lon into discrete shard keys. The single most important design choice.
In-memory Cell Index. Updates 1.25M times/sec. No disk-based store can match this.
Decoupled subsystems. Location ingest, dispatch, ETA, pricing, surge all scale independently.
Regional deployment. Latency and resilience.
Per-driver locks on dispatch. Prevents race conditions during the critical "who gets this driver?" step.
Hot cell mitigation built in. Replicas, sub-cell sharding, and caching for known dense areas.

The One Thing to Remember

The Uber problem is solved by a single key insight: discretize space. Once continuous lat/lon coordinates become discrete cell IDs, the system's hardest operations (find nearby, dispatch, surge) all become simple hash lookups. Most location-aware systems solve their version of this problem the same way: food delivery, scooter sharing, dating apps, ad targeting. The hard work isn't matching itself; it is the location update rate, the hot cells in dense urban areas, and the boundary cases at the edges of every cell. Get the partitioning right and the rest of the architecture follows naturally. Get it wrong and no amount of horizontal scaling will save you.