The Problem That Led Here

For decades, companies that wanted to do anything serious with data had to pick between two very different tools.

On one side, the data warehouse. Fast, structured, reliable. Perfect for finance reports, executive dashboards, anything where you need clean, trustworthy answers from SQL queries. The downside: expensive, rigid, and not built for raw or unstructured data.

On the other side, the data lake. Cheap, flexible, massive. You can dump anything into it: logs, images, JSON blobs, video, IoT data. Perfect for machine learning and exploration. The downside: no schema, no transactions, no consistency guarantees. It is easy to fill a lake with garbage.

For years, large organizations ran both. Operational data flowed into the lake for archival and ML training. Curated business data lived in the warehouse for analytics. Two pipelines, two governance models, two sets of quality checks, two bills.

The Two-System Problem
Data Sources
copies to
copies to
Data Warehouse
Structured, fast, expensive
Used by: Analysts, BI
Data Lake
Raw, cheap, flexible
Used by: ML, Engineers
Pain: Same data in two places. Two quality checks. Two governance models. Two bills. Constant sync drift.

This was the state of the world for over a decade. Then around 2020, a different approach started to take hold: what if you could keep one shared storage layer, but get the reliability features of a warehouse on top of it? That idea is the data lakehouse.

Warehouse vs Lake vs Lakehouse: Side by Side

Before going deeper, here is how the three approaches compare directly:

Data Warehouse Data Lake Data Lakehouse
Storage Cost High Very low Very low
Data Format Structured only Anything Anything (with structured tables on top)
Schema Schema-on-write Schema-on-read Both supported
ACID Transactions Yes No Yes
Query Speed Fast Slow without tuning Fast (with right table format)
ML Workloads Limited Excellent Excellent
Operational Burden Low (managed) Medium High (more moving parts)
Examples Snowflake, BigQuery S3, ADLS, GCS Databricks, Iceberg, Delta Lake

What Exactly Is a Lakehouse?

A data lakehouse is an architecture, not a product. It is built by stacking several technologies together so that one shared storage layer can serve both warehouse-style and lake-style workloads.

The core promise: one copy of the data, reliable and structured for analytics, raw and flexible for ML, all stored cheaply in object storage.

To make this work, you need three layers stacked on top of each other.

The Three Building Blocks

Lakehouse Architecture Stack
Compute Engines
Spark, Trino, Flink, Snowflake, DuckDB
Query and process data from multiple tools
Shared Catalog
Unity Catalog, AWS Glue, Hive Metastore, Polaris
Single source of truth for table names, schemas, locations
Open Table Format
Apache Iceberg, Delta Lake, Apache Hudi
Adds ACID transactions, snapshots, time travel
Object Storage
S3, Azure Data Lake Storage, GCS
Cheap, durable storage for files (Parquet, ORC, Avro)
Each layer adds capabilities to the layer below it. Read top to bottom for query flow, bottom to top for build order.

Block 1: Object Storage

The foundation. Cheap, virtually infinite, durable. Think of it as a giant hard drive in the cloud where you can dump any file: raw JSON logs, polished Parquet tables, images, videos, anything.

The most popular object stores are Amazon S3, Azure Data Lake Storage (ADLS), and Google Cloud Storage (GCS).

Why object storage is the right foundation:

It is cheap (often around $0.02 per GB per month).
It is durable (S3 promises 11 nines of durability).
It scales infinitely without manual provisioning.
It supports any file format.
Multiple compute engines can read from it without copying data.

The catch: object storage by itself has no concept of tables, transactions, or consistency. If two processes write to the same path at the same time, you can end up with corrupted or partial data. That is why you need the next layer.

Block 2: Open Table Formats

This is the breakthrough that made lakehouses possible. Open table formats sit on top of raw files in object storage and add database-like properties to them.

The three main contenders:

Apache Iceberg
Originally from Netflix
Strengths: Strong schema evolution, hidden partitioning, broad engine support
Used by: Apple, Netflix, Adobe, Stripe
Delta Lake
Originally from Databricks
Strengths: Tightest integration with Spark, mature ecosystem, optimized for streaming
Used by: Databricks customers, many enterprises
Apache Hudi
Originally from Uber
Strengths: Optimized for upserts and streaming ingestion at scale
Used by: Uber, Robinhood, Walmart

What all three give you:

ACID transactions: writes are atomic. No more partial files. No more "wait, did that job actually finish?"
Snapshots: every write creates a new version of the table. You can read the table as it was 10 minutes ago, or 10 days ago.
Time travel: roll back a bad ingest job by querying yesterday's snapshot.
Schema evolution: add columns, drop columns, change types, without rewriting the entire table.
Hidden partitioning: users do not need to know how the data is physically partitioned to write efficient queries.
Commit history: every change to the table is logged, like git for your data.

These are warehouse-grade features. But they are now available on top of cheap object storage. That is the magic.

Block 3: Shared Catalog

You have your storage. You have your table format. But you still need to answer questions like:

"What tables exist?"
"Where does the customers table actually live?"
"What columns does it have?"
"Who is allowed to read it?"

That is the job of the catalog. It is the directory that maps human-readable table names to physical locations and metadata. Without it, every tool would have to know the exact storage path of every table, which becomes impossible at scale.

The catalog is the single source of truth that lets multiple engines collaborate on the same data. Spark reads from it during ingestion. Trino reads from it for SQL queries. A BI tool reads from it for dashboards. They all see the same table definitions, so nobody gets lost.

The Catalog Connects Everything
Spark
ingestion
Trino
SQL queries
Flink
streaming
BI Tools
dashboards
all reference the same
Shared Catalog
table names, schemas, locations, permissions
which points to data in
Object Storage
actual data files

Common catalog options:

AWS Glue Data Catalog: the default for AWS-based lakehouses.
Hive Metastore: the original, still widely used in legacy stacks.
Databricks Unity Catalog: tightly integrated with Databricks, with strong governance.
Apache Polaris: open source, recent, designed for Iceberg.
Project Nessie: brings git-like branching to data catalogs.

Governance and Security

Once you have multiple teams using the same lakehouse, you need rules. Who can read what? Who can write what? Where did this data come from? Was sensitive data masked correctly? This is the governance layer, and it sits across the whole stack.

The Governance Layer
A
Access Control
Row-level, column-level, and table-level permissions. Tied to user identity from your SSO provider.
L
Lineage Tracking
For any column, know exactly where its values came from and which downstream tables depend on it.
P
PII Protection
Auto-detect sensitive fields. Apply masking, encryption, or tokenization before unauthorized eyes see them.
Q
Quality Checks
Rules that data must satisfy. Alerts when constraints break. Block bad data from reaching consumers.
D
Data Discovery
A searchable directory of every table and column with descriptions, owners, and tags.
R
Retention & Deletion
Enforce policies for how long data lives. Critical for GDPR, CCPA, and other regulations.

Common tools that provide this layer:

AWS Lake Formation: permissions and lineage for AWS-based lakehouses.
Databricks Unity Catalog: all-in-one governance for Databricks.
Apache Atlas: open source, traditional Hadoop ecosystem.
OpenLineage: open standard for tracking data lineage across tools.

This layer is often what differentiates a "data swamp" (a lake nobody trusts) from a real lakehouse that the business depends on.

The Trade-offs Nobody Tells You About

The lakehouse story sounds great in marketing slides. In production, there are real costs.

You Take On Operational Work

A managed warehouse like Snowflake handles everything: storage, compute, query optimization, indexing, vacuuming. You hand it SQL and pay the bill.

A lakehouse is more like Lego. You assemble it from open components. You get more flexibility and lower costs, but you are now responsible for keeping the parts working together. That includes:

The Small File Problem

Object storage performs poorly when you have millions of tiny files. A typical streaming write produces lots of small Parquet files. Over time, query performance degrades because the engine has to open thousands of files to read a single table.

The fix is to run periodic compaction jobs that merge small files into bigger ones. This is invisible to users but essential for performance. Lakehouse table formats provide tools for this (Delta has OPTIMIZE, Iceberg has rewrite_data_files), but you have to remember to run them.

Schema Stability Across Tools

Multiple engines reading the same table need to agree on what the schema is. If Spark writes a new column but Trino's catalog cache is stale, queries can fail or return wrong results.

Coordinating schema changes across teams and tools requires discipline: schema versioning, deprecation timelines, communication. None of this is technical, but it is real work.

Concurrent Writers

When two jobs write to the same table at the same time, the table format has to handle the conflict. All three (Iceberg, Delta, Hudi) handle this, but you need to understand their conflict resolution strategy and design your pipelines around it.

Cost Optimization

Storage is cheap. Compute is not. With a lakehouse, you control which engine you use for which workload. That can be a big advantage (use cheap engines for batch, expensive ones only for interactive queries) but it also means you need to actively manage the cost. With a managed warehouse, the vendor optimizes for you.

Decision Tree: Which Should You Pick?

Not every team needs a lakehouse. Pick based on your actual situation, not the trend.

Picking the Right Architecture
Do you have unstructured or ML data alongside structured analytics data?
No
Do you want zero infrastructure work?
Yes
Data Warehouse
Snowflake, BigQuery. Pay a premium, get simplicity.
No
Lakehouse (lighter setup)
Just structured tables on Iceberg or Delta. Save on storage.
Yes
Do you have a dedicated data engineering team?
No
Warehouse + Lake
Run them separately. Less elegant, but operationally simpler than full lakehouse.
Yes
Full Lakehouse
Best of both worlds. Single source of truth. Maintenance is your responsibility.

When to Choose What

Pick a Data Warehouse if: you mostly do structured SQL analytics, you do not have heavy ML or unstructured data needs, and you would rather pay more for less operational burden. Snowflake, BigQuery, and Redshift are battle-tested for this.

Pick a Data Lake if: you primarily need cheap storage for raw data and ML training. You are okay with limited transactional guarantees, and you do not need fast SQL on top of it. S3 plus Spark is enough for many teams.

Pick a Data Lakehouse if: you need both worlds. You have varied workloads (analytics, ML, streaming, batch). You have a dedicated engineering team that can handle the maintenance. You want to avoid duplicating data and pipelines. The savings on storage and the unification of governance pay off at scale.

The One Thing to Remember

The lakehouse is not just a new product. It is a real architectural shift made possible by three things coming together: cheap object storage, open table formats that brought ACID to the lake, and shared catalogs that let multiple engines collaborate.

It eliminates the longstanding gap between two parallel worlds. But it does so by pushing more responsibility onto the team building it. There is no free lunch. You trade vendor management for engineering work, and infrastructure costs for operational complexity.

For organizations with the right scale and the right team, that trade is worth it. For smaller teams, a managed warehouse is still often the better answer. Knowing your situation honestly is more important than picking the trendiest architecture.