Why Data Pipelines Even Exist
Imagine you run an online store. Every second, things are happening. Customers click on products, add items to carts, complete purchases. Your warehouse system updates inventory. Your payment processor confirms transactions. Your support team logs tickets. Your marketing tools track ad clicks.
All of this generates data. Lots of it.
Now imagine your CEO asks a simple question on Monday morning: "How were sales last weekend, broken down by region?" Easy question. Hard answer. The data lives in 10 different places. Some of it is in your operational database. Some is in log files. Some is in your payment processor's system. Some is in your shipping provider's API.
To answer that question reliably, you need to collect data from all those places, clean it up, combine it, store it somewhere queryable, and serve it to whoever needs it. Doing this manually for every question is impossible.
A data pipeline is the system that does this automatically. It moves data from where it is created to where it can be analyzed, transformed, or used by other systems. It runs continuously, handles failures, and keeps the freshest version of the world available to the people and systems that depend on it.
Every modern company that does anything with data, AI, machine learning, analytics, dashboards, or recommendations, has data pipelines underneath. Often dozens. Sometimes thousands.
The Five Phases of Every Data Pipeline
Almost every data pipeline, no matter how simple or complex, has the same five phases:
Let us walk through each phase with the e-commerce example.
Phase 1: Collect
This is where data is generated and pulled from its source. The challenge: data lives everywhere, in totally different formats, with different rates of change.
For our e-commerce store, sources might include:
Operational databases (PostgreSQL, MySQL) holding orders, customers, products.
Application logs spitting out every click, every error, every API call.
Third-party APIs like Stripe for payments, Shopify for products, Salesforce for customers.
Event streams from real-time systems like checkout events, cart updates.
IoT or device data if you have warehouse scanners or delivery trackers.
Files dropped by partners or vendors (CSVs, Excel, Parquet).
Each of these has a different way of being collected. Some you query (databases). Some you tail (logs). Some you subscribe to (event streams). Some you poll (APIs). Some you watch a folder for (file drops).
Phase 2: Ingest
Once you have collected data, you need to bring it into your data platform in a controlled way. This is ingestion.
Why is this its own phase? Because data does not come at a polite, predictable rate. Sometimes you get a flood (Black Friday traffic). Sometimes you get a trickle (3am Tuesday). Sometimes the source system has issues and stops sending data for an hour, then dumps everything at once.
Ingestion systems sit between your sources and your storage, absorbing these spikes and turning them into a steady, ordered stream. The most common pattern is a message queue or event streaming platform.
spiky traffic
crashes
spiky traffic
buffers
steady
The ingestion layer also gives you something else valuable: decoupling. The source does not need to know where the data ends up. It just writes to the queue. Multiple downstream systems can read from the same queue. If you add a new system tomorrow, it just subscribes. No source changes required.
Phase 3: Store
Now your data is flowing into your platform. You need to put it somewhere persistent. Where you put it depends on what you want to do with it.
There are three common types of storage in modern data platforms:
Quick rule of thumb:
If your data is structured and queries are well-defined, use a warehouse.
If your data is raw, semi-structured, and you do not know yet how it will be used, use a lake.
If you want both worlds, use a lakehouse.
Most large companies use all three. The lake holds raw data cheaply. The warehouse holds curated data for reporting. The lakehouse blurs the line between them.
Phase 4: Compute
Raw data is rarely useful as-is. It needs to be cleaned, transformed, joined with other data, and shaped into something a human or model can actually use. This is the compute phase.
Typical compute work for our e-commerce example:
Cleaning: remove duplicate orders, fix bad email addresses, standardize country codes.
Joining: combine orders with customer data and product data to make a unified view.
Aggregation: total sales per region per day. Average cart size. Top 10 products.
Format conversion: turn JSON logs into Parquet for fast querying.
Partitioning: split data by date, region, or other useful keys for faster reads.
Enrichment: add geolocation from IP, currency conversion, customer segments.
Compute happens in two main flavors: batch and streaming. The choice between them is one of the most important architectural decisions in any data platform, so let us look at it carefully.
Batch vs Streaming
Most real systems combine both. Streaming for things that need to be fresh (live order monitoring, fraud). Batch for everything else (daily reports, weekly retraining of ML models). The reason: streaming is more expensive and more error-prone, so you only use it where freshness actually matters.
ETL vs ELT vs EtLT
You will hear these terms thrown around constantly. They describe the order of operations in your pipeline.
The shift from ETL to ELT happened because cloud data warehouses became cheap and powerful. It used to be that warehouses charged by the row and could not handle big transformations, so you cleaned data first. Now, warehouses scale almost infinitely, and storage is cheap, so you load everything raw and transform inside.
This shift is one of the biggest changes in modern data engineering. Tools like dbt exist specifically to do transformation inside the warehouse using SQL.
Lambda vs Kappa Architectures
When you combine batch and streaming, you face a question: should you build two separate pipelines (one batch, one streaming) or just one streaming pipeline that does everything?
This led to two famous architectural patterns.
Lambda Architecture
Two parallel layers. A batch layer processes complete, accurate data on a delay. A speed layer processes recent data in real time, even if not perfectly. The serving layer combines both views.
Source
combines both views
Kappa Architecture
One streaming layer that does everything. Reprocess historical data by replaying the stream from the beginning if needed.
Source
handles real-time AND historical reprocessing
Which to use? If you can keep all your data in a replayable stream and your team is comfortable with stream processing, Kappa is simpler. If you have heavy historical analytics that work fine in batch, Lambda is more pragmatic.
In practice, most teams end up somewhere in the middle. Critical real-time paths (fraud, alerting) are streaming. Heavy analytics are batch. The two coexist.
Phase 5: Consume
Finally, the data is clean, joined, aggregated, and stored. Now somebody needs to use it. The consume phase is about exposing your processed data to the people and systems that depend on it.
Common consumers:
BI dashboards like Tableau, Looker, Power BI for human-facing reports.
SQL clients for ad hoc analysis by analysts and data scientists.
Machine learning training pipelines that pull cleaned data to train models.
ML serving systems that need fresh features for real-time predictions.
User-facing apps that show personalized content, recommendations, or analytics.
External APIs exposing data to partners or customers.
Reverse ETL tools that push data back into operational systems (Salesforce, Hubspot, etc).
This is also where data quality becomes visible. If your pipeline silently corrupted data, the dashboard shows wrong numbers, the ML model makes bad predictions, the recommendations are off. Consumers are the first to notice. That is why monitoring and validation throughout the pipeline matter so much.
The Tools at Each Stage
The data ecosystem has exploded over the last decade. Here is a rough mapping of common tools to each phase. You do not need to use all of these. A small team might use 3 to 5 tools total. A large company might use 30.
Common Failure Modes (And How to Handle Them)
Pipelines fail. Not occasionally, constantly. Source APIs go down. Schemas change unexpectedly. Bad data slips through. Network blips happen. Servers crash mid-job. Production data engineering is mostly about making the pipeline resilient to these failures.
When You Need a Pipeline vs When a Cron Job Works
Not every data movement needs a real pipeline. Sometimes a simple scheduled script is enough.
A cron job is fine when:
Data volume is small (megabytes, not gigabytes).
Failure can be handled by re-running the job.
The schedule is predictable (daily, hourly).
Only one source, one destination, simple transformation.
You need a real pipeline when:
Data volume is large or growing fast.
Multiple sources need to be combined.
Some processing must be near real-time.
Failures must be handled automatically and audited.
Schemas evolve and need versioning.
Many downstream consumers depend on the data.
Data quality and lineage need to be tracked.
The honest truth: most companies start with cron jobs and outgrow them painfully. The transition from "a few scripts" to "a real platform" is one of the hardest moments in a data team's life. If you can predict that you will outgrow scripts, invest in pipeline infrastructure earlier.
The One Thing to Remember
A data pipeline is not a single tool. It is a chain of stages, each handling a specific job, and the quality of the whole chain depends on the quality of every link.
The same five phases (collect, ingest, store, compute, consume) appear in almost every data platform, regardless of size or industry. The choices that matter are about how you implement each phase, what tools you pick, where you draw the line between batch and streaming, and how you handle failures.
The companies that get this right are the ones whose dashboards always show the right numbers, whose ML models always have fresh data, and whose decisions are grounded in reality. The ones who do not, eventually find themselves making important calls based on data they cannot trust.
Data pipelines are infrastructure. Boring when they work. Catastrophic when they do not. Worth investing in.