Data Pipelines Explained

Why Data Pipelines Even Exist

Imagine you run an online store. Every second, things are happening. Customers click on products, add items to carts, complete purchases. Your warehouse system updates inventory. Your payment processor confirms transactions. Your support team logs tickets. Your marketing tools track ad clicks.

All of this generates data. Lots of it.

Now imagine your CEO asks a simple question on Monday morning: "How were sales last weekend, broken down by region?" Easy question. Hard answer. The data lives in 10 different places. Some of it is in your operational database. Some is in log files. Some is in your payment processor's system. Some is in your shipping provider's API.

To answer that question reliably, you need to collect data from all those places, clean it up, combine it, store it somewhere queryable, and serve it to whoever needs it. Doing this manually for every question is impossible.

A data pipeline is the system that does this automatically. It moves data from where it is created to where it can be analyzed, transformed, or used by other systems. It runs continuously, handles failures, and keeps the freshest version of the world available to the people and systems that depend on it.

Every modern company that does anything with data, AI, machine learning, analytics, dashboards, or recommendations, has data pipelines underneath. Often dozens. Sometimes thousands.

The Five Phases of Every Data Pipeline

Almost every data pipeline, no matter how simple or complex, has the same five phases:

Collect

Get data from where it lives

Ingest

Bring it into your system

Store

Persist it somewhere durable

Compute

Clean, transform, aggregate

Consume

Make it useful to humans and systems

Let us walk through each phase with the e-commerce example.

Phase 1: Collect

This is where data is generated and pulled from its source. The challenge: data lives everywhere, in totally different formats, with different rates of change.

For our e-commerce store, sources might include:

Operational databases (PostgreSQL, MySQL) holding orders, customers, products.
Application logs spitting out every click, every error, every API call.
Third-party APIs like Stripe for payments, Shopify for products, Salesforce for customers.
Event streams from real-time systems like checkout events, cart updates.
IoT or device data if you have warehouse scanners or delivery trackers.
Files dropped by partners or vendors (CSVs, Excel, Parquet).

Each of these has a different way of being collected. Some you query (databases). Some you tail (logs). Some you subscribe to (event streams). Some you poll (APIs). Some you watch a folder for (file drops).

Phase 2: Ingest

Once you have collected data, you need to bring it into your data platform in a controlled way. This is ingestion.

Why is this its own phase? Because data does not come at a polite, predictable rate. Sometimes you get a flood (Black Friday traffic). Sometimes you get a trickle (3am Tuesday). Sometimes the source system has issues and stops sending data for an hour, then dumps everything at once.

Ingestion systems sit between your sources and your storage, absorbing these spikes and turning them into a steady, ordered stream. The most common pattern is a message queue or event streaming platform.

Why You Need an Ingestion Layer

Without ingestion buffer

Source
spiky traffic

overwhelms

Storage
crashes

Direct writes from spiky sources can crash storage during peaks.

With ingestion buffer

Source
spiky traffic

Queue
buffers

Storage
steady

Queue absorbs spikes, smooths the flow, gives storage a constant rate.

The ingestion layer also gives you something else valuable: decoupling. The source does not need to know where the data ends up. It just writes to the queue. Multiple downstream systems can read from the same queue. If you add a new system tomorrow, it just subscribes. No source changes required.

Phase 3: Store

Now your data is flowing into your platform. You need to put it somewhere persistent. Where you put it depends on what you want to do with it.

There are three common types of storage in modern data platforms:

Data Warehouse

Structured, schema-on-write, optimized for SQL analytics

Best for: Business intelligence, reporting, dashboards

Snowflake, BigQuery, Redshift

Data Lake

Raw files in any format, schema-on-read, cheap

Best for: ML training data, raw logs, archives

S3, Azure Data Lake, GCS

Data Lakehouse

Hybrid: lake economics with warehouse query speed

Best for: Unified analytics and ML on the same data

Databricks, Iceberg, Delta Lake

Quick rule of thumb:

If your data is structured and queries are well-defined, use a warehouse.
If your data is raw, semi-structured, and you do not know yet how it will be used, use a lake.
If you want both worlds, use a lakehouse.

Most large companies use all three. The lake holds raw data cheaply. The warehouse holds curated data for reporting. The lakehouse blurs the line between them.

Phase 4: Compute

Raw data is rarely useful as-is. It needs to be cleaned, transformed, joined with other data, and shaped into something a human or model can actually use. This is the compute phase.

Typical compute work for our e-commerce example:

Cleaning: remove duplicate orders, fix bad email addresses, standardize country codes.
Joining: combine orders with customer data and product data to make a unified view.
Aggregation: total sales per region per day. Average cart size. Top 10 products.
Format conversion: turn JSON logs into Parquet for fast querying.
Partitioning: split data by date, region, or other useful keys for faster reads.
Enrichment: add geolocation from IP, currency conversion, customer segments.

Compute happens in two main flavors: batch and streaming. The choice between them is one of the most important architectural decisions in any data platform, so let us look at it carefully.

Batch vs Streaming

Batch Processing

Process a lot at once, every so often

How it works: Collect data over a window (1 hour, 1 day), then process it all at once.

Latency: Minutes to hours.

Cost: Cheap. You only run compute when needed.

Complexity: Simple. Like a scheduled job.

Use cases: Daily sales reports, ML training, billing.

Spark, Hadoop, Airflow, dbt

Stream Processing

Process every event as it arrives

How it works: Each event is processed as soon as it shows up.

Latency: Milliseconds to seconds.

Cost: Higher. Compute runs continuously.

Complexity: Hard. Late data, ordering, state management.

Use cases: Fraud detection, live dashboards, real-time alerts.

Kafka Streams, Flink, Spark Streaming

Most real systems combine both. Streaming for things that need to be fresh (live order monitoring, fraud). Batch for everything else (daily reports, weekly retraining of ML models). The reason: streaming is more expensive and more error-prone, so you only use it where freshness actually matters.

ETL vs ELT vs EtLT

You will hear these terms thrown around constantly. They describe the order of operations in your pipeline.

ETL

Extract

Transform

Load

Old school. Transform data before loading it. The transformation happens on a separate compute server. The warehouse only ever sees clean, ready data.

ELT

Extract

Load

Transform

Modern. Load raw data into the warehouse first, then transform inside it using SQL. Cloud warehouses are powerful enough to handle the transformation work.

EtLT

Extract

light t

Load

Transform

Hybrid. Light cleanup on extract (PII removal, format normalization), then load, then heavy transforms in the warehouse.

The shift from ETL to ELT happened because cloud data warehouses became cheap and powerful. It used to be that warehouses charged by the row and could not handle big transformations, so you cleaned data first. Now, warehouses scale almost infinitely, and storage is cheap, so you load everything raw and transform inside.

This shift is one of the biggest changes in modern data engineering. Tools like dbt exist specifically to do transformation inside the warehouse using SQL.

Lambda vs Kappa Architectures

When you combine batch and streaming, you face a question: should you build two separate pipelines (one batch, one streaming) or just one streaming pipeline that does everything?

This led to two famous architectural patterns.

Lambda Architecture

Two parallel layers. A batch layer processes complete, accurate data on a delay. A speed layer processes recent data in real time, even if not perfectly. The serving layer combines both views.

Lambda Architecture

Data
Source

Batch Layer

Store all data

Process periodically

Accurate, slow

Speed Layer

Stream events

Real-time compute

Approximate, fast

Serving Layer
combines both views

Good: Both fresh and accurate. Mature, proven.

Bad: Two systems to build, maintain, and keep in sync. Logic duplicated.

Kappa Architecture

One streaming layer that does everything. Reprocess historical data by replaying the stream from the beginning if needed.

Kappa Architecture

Data
Source

flows to

Streaming Layer
handles real-time AND historical reprocessing

serves

Serving Layer

Good: One system. Logic written once. Simpler operations.

Bad: All complexity in one place. Need a stream that supports replay (like Kafka). Stream processing is harder than batch.

Which to use? If you can keep all your data in a replayable stream and your team is comfortable with stream processing, Kappa is simpler. If you have heavy historical analytics that work fine in batch, Lambda is more pragmatic.

In practice, most teams end up somewhere in the middle. Critical real-time paths (fraud, alerting) are streaming. Heavy analytics are batch. The two coexist.

Phase 5: Consume

Finally, the data is clean, joined, aggregated, and stored. Now somebody needs to use it. The consume phase is about exposing your processed data to the people and systems that depend on it.

Common consumers:

BI dashboards like Tableau, Looker, Power BI for human-facing reports.
SQL clients for ad hoc analysis by analysts and data scientists.
Machine learning training pipelines that pull cleaned data to train models.
ML serving systems that need fresh features for real-time predictions.
User-facing apps that show personalized content, recommendations, or analytics.
External APIs exposing data to partners or customers.
Reverse ETL tools that push data back into operational systems (Salesforce, Hubspot, etc).

This is also where data quality becomes visible. If your pipeline silently corrupted data, the dashboard shows wrong numbers, the ML model makes bad predictions, the recommendations are off. Consumers are the first to notice. That is why monitoring and validation throughout the pipeline matter so much.

The Tools at Each Stage

The data ecosystem has exploded over the last decade. Here is a rough mapping of common tools to each phase. You do not need to use all of these. A small team might use 3 to 5 tools total. A large company might use 30.

Collect

Fivetran Airbyte Stitch Debezium Custom scripts

Ingest

Kafka Kinesis Pulsar RabbitMQ Pub/Sub

Store

Snowflake BigQuery Redshift S3 Databricks Iceberg Delta Lake

Compute

Spark Flink dbt Airflow Dagster Prefect Beam

Consume

Tableau Looker Power BI Metabase Hightouch Census

Common Failure Modes (And How to Handle Them)

Pipelines fail. Not occasionally, constantly. Source APIs go down. Schemas change unexpectedly. Bad data slips through. Network blips happen. Servers crash mid-job. Production data engineering is mostly about making the pipeline resilient to these failures.

Back-pressure

When downstream systems cannot keep up with upstream data flow. Solution: queue with proper sizing, consumer parallelism, throttling at the source if needed.

Dead Letter Queue

Bad messages that cannot be processed. Instead of crashing the whole pipeline, route them to a separate queue for inspection and replay later.

Idempotency

Pipelines often retry on failure, which means the same record might arrive twice. Make every operation safe to run multiple times. The same input always produces the same output.

Late-arriving Data

Events from yesterday that show up today. Handle with windowing strategies, watermarks, or backfill jobs that revisit old partitions.

Schema Drift

Source system adds a new column or changes a type without warning. Use schema registries, validation at ingest, and alerting on unexpected changes.

Silent Data Loss

Records dropped without errors. The worst kind of failure because nobody notices. Solution: row counts, checksums, end-to-end audits comparing source totals to destination totals.

When You Need a Pipeline vs When a Cron Job Works

Not every data movement needs a real pipeline. Sometimes a simple scheduled script is enough.

A cron job is fine when:

Data volume is small (megabytes, not gigabytes).
Failure can be handled by re-running the job.
The schedule is predictable (daily, hourly).
Only one source, one destination, simple transformation.

You need a real pipeline when:

Data volume is large or growing fast.
Multiple sources need to be combined.
Some processing must be near real-time.
Failures must be handled automatically and audited.
Schemas evolve and need versioning.
Many downstream consumers depend on the data.
Data quality and lineage need to be tracked.

The honest truth: most companies start with cron jobs and outgrow them painfully. The transition from "a few scripts" to "a real platform" is one of the hardest moments in a data team's life. If you can predict that you will outgrow scripts, invest in pipeline infrastructure earlier.

The One Thing to Remember

A data pipeline is not a single tool. It is a chain of stages, each handling a specific job, and the quality of the whole chain depends on the quality of every link.

The same five phases (collect, ingest, store, compute, consume) appear in almost every data platform, regardless of size or industry. The choices that matter are about how you implement each phase, what tools you pick, where you draw the line between batch and streaming, and how you handle failures.

The companies that get this right are the ones whose dashboards always show the right numbers, whose ML models always have fresh data, and whose decisions are grounded in reality. The ones who do not, eventually find themselves making important calls based on data they cannot trust.

Data pipelines are infrastructure. Boring when they work. Catastrophic when they do not. Worth investing in.