Data Lakehouse Architecture

The Problem That Led Here

For decades, companies that wanted to do anything serious with data had to pick between two very different tools.

On one side, the data warehouse. Fast, structured, reliable. Perfect for finance reports, executive dashboards, anything where you need clean, trustworthy answers from SQL queries. The downside: expensive, rigid, and not built for raw or unstructured data.

On the other side, the data lake. Cheap, flexible, massive. You can dump anything into it: logs, images, JSON blobs, video, IoT data. Perfect for machine learning and exploration. The downside: no schema, no transactions, no consistency guarantees. It is easy to fill a lake with garbage.

For years, large organizations ran both. Operational data flowed into the lake for archival and ML training. Curated business data lived in the warehouse for analytics. Two pipelines, two governance models, two sets of quality checks, two bills.

The Two-System Problem

Data Sources

copies to

Data Warehouse

Structured, fast, expensive

Used by: Analysts, BI

Data Lake

Raw, cheap, flexible

Used by: ML, Engineers

Pain: Same data in two places. Two quality checks. Two governance models. Two bills. Constant sync drift.

This was the state of the world for over a decade. Then around 2020, a different approach started to take hold: what if you could keep one shared storage layer, but get the reliability features of a warehouse on top of it? That idea is the data lakehouse.

Warehouse vs Lake vs Lakehouse: Side by Side

Before going deeper, here is how the three approaches compare directly:

	Data Warehouse	Data Lake	Data Lakehouse
Storage Cost	High	Very low	Very low
Data Format	Structured only	Anything	Anything (with structured tables on top)
Schema	Schema-on-write	Schema-on-read	Both supported
ACID Transactions	Yes	No	Yes
Query Speed	Fast	Slow without tuning	Fast (with right table format)
ML Workloads	Limited	Excellent	Excellent
Operational Burden	Low (managed)	Medium	High (more moving parts)
Examples	Snowflake, BigQuery	S3, ADLS, GCS	Databricks, Iceberg, Delta Lake

What Exactly Is a Lakehouse?

A data lakehouse is an architecture, not a product. It is built by stacking several technologies together so that one shared storage layer can serve both warehouse-style and lake-style workloads.

The core promise: one copy of the data, reliable and structured for analytics, raw and flexible for ML, all stored cheaply in object storage.

To make this work, you need three layers stacked on top of each other.

The Three Building Blocks

Lakehouse Architecture Stack

Compute Engines

Spark, Trino, Flink, Snowflake, DuckDB

Query and process data from multiple tools

Shared Catalog

Unity Catalog, AWS Glue, Hive Metastore, Polaris

Single source of truth for table names, schemas, locations

Open Table Format

Apache Iceberg, Delta Lake, Apache Hudi

Adds ACID transactions, snapshots, time travel

Object Storage

S3, Azure Data Lake Storage, GCS

Cheap, durable storage for files (Parquet, ORC, Avro)

Each layer adds capabilities to the layer below it. Read top to bottom for query flow, bottom to top for build order.

Block 1: Object Storage

The foundation. Cheap, virtually infinite, durable. Think of it as a giant hard drive in the cloud where you can dump any file: raw JSON logs, polished Parquet tables, images, videos, anything.

The most popular object stores are Amazon S3, Azure Data Lake Storage (ADLS), and Google Cloud Storage (GCS).

Why object storage is the right foundation:

It is cheap (often around $0.02 per GB per month).
It is durable (S3 promises 11 nines of durability).
It scales infinitely without manual provisioning.
It supports any file format.
Multiple compute engines can read from it without copying data.

The catch: object storage by itself has no concept of tables, transactions, or consistency. If two processes write to the same path at the same time, you can end up with corrupted or partial data. That is why you need the next layer.

Block 2: Open Table Formats

This is the breakthrough that made lakehouses possible. Open table formats sit on top of raw files in object storage and add database-like properties to them.

The three main contenders:

Apache Iceberg

Originally from Netflix

Strengths: Strong schema evolution, hidden partitioning, broad engine support

Used by: Apple, Netflix, Adobe, Stripe

Delta Lake

Originally from Databricks

Strengths: Tightest integration with Spark, mature ecosystem, optimized for streaming

Used by: Databricks customers, many enterprises

Apache Hudi

Originally from Uber

Strengths: Optimized for upserts and streaming ingestion at scale

Used by: Uber, Robinhood, Walmart

What all three give you:

ACID transactions: writes are atomic. No more partial files. No more "wait, did that job actually finish?"
Snapshots: every write creates a new version of the table. You can read the table as it was 10 minutes ago, or 10 days ago.
Time travel: roll back a bad ingest job by querying yesterday's snapshot.
Schema evolution: add columns, drop columns, change types, without rewriting the entire table.
Hidden partitioning: users do not need to know how the data is physically partitioned to write efficient queries.
Commit history: every change to the table is logged, like git for your data.

These are warehouse-grade features. But they are now available on top of cheap object storage. That is the magic.

Block 3: Shared Catalog

You have your storage. You have your table format. But you still need to answer questions like:

"What tables exist?"
"Where does the customers table actually live?"
"What columns does it have?"
"Who is allowed to read it?"

That is the job of the catalog. It is the directory that maps human-readable table names to physical locations and metadata. Without it, every tool would have to know the exact storage path of every table, which becomes impossible at scale.

The catalog is the single source of truth that lets multiple engines collaborate on the same data. Spark reads from it during ingestion. Trino reads from it for SQL queries. A BI tool reads from it for dashboards. They all see the same table definitions, so nobody gets lost.

The Catalog Connects Everything

Spark
ingestion

Trino
SQL queries

Flink
streaming

BI Tools
dashboards

all reference the same

Shared Catalog
table names, schemas, locations, permissions

which points to data in

Object Storage
actual data files

Common catalog options:

AWS Glue Data Catalog: the default for AWS-based lakehouses.
Hive Metastore: the original, still widely used in legacy stacks.
Databricks Unity Catalog: tightly integrated with Databricks, with strong governance.
Apache Polaris: open source, recent, designed for Iceberg.
Project Nessie: brings git-like branching to data catalogs.

Governance and Security

Once you have multiple teams using the same lakehouse, you need rules. Who can read what? Who can write what? Where did this data come from? Was sensitive data masked correctly? This is the governance layer, and it sits across the whole stack.

The Governance Layer

Access Control

Row-level, column-level, and table-level permissions. Tied to user identity from your SSO provider.

Lineage Tracking

For any column, know exactly where its values came from and which downstream tables depend on it.

PII Protection

Auto-detect sensitive fields. Apply masking, encryption, or tokenization before unauthorized eyes see them.

Quality Checks

Rules that data must satisfy. Alerts when constraints break. Block bad data from reaching consumers.

Data Discovery

A searchable directory of every table and column with descriptions, owners, and tags.

Retention & Deletion

Enforce policies for how long data lives. Critical for GDPR, CCPA, and other regulations.

Common tools that provide this layer:

AWS Lake Formation: permissions and lineage for AWS-based lakehouses.
Databricks Unity Catalog: all-in-one governance for Databricks.
Apache Atlas: open source, traditional Hadoop ecosystem.
OpenLineage: open standard for tracking data lineage across tools.

This layer is often what differentiates a "data swamp" (a lake nobody trusts) from a real lakehouse that the business depends on.

The Trade-offs Nobody Tells You About

The lakehouse story sounds great in marketing slides. In production, there are real costs.

You Take On Operational Work

A managed warehouse like Snowflake handles everything: storage, compute, query optimization, indexing, vacuuming. You hand it SQL and pay the bill.

A lakehouse is more like Lego. You assemble it from open components. You get more flexibility and lower costs, but you are now responsible for keeping the parts working together. That includes:

The Small File Problem

Object storage performs poorly when you have millions of tiny files. A typical streaming write produces lots of small Parquet files. Over time, query performance degrades because the engine has to open thousands of files to read a single table.

The fix is to run periodic compaction jobs that merge small files into bigger ones. This is invisible to users but essential for performance. Lakehouse table formats provide tools for this (Delta has OPTIMIZE, Iceberg has rewrite_data_files), but you have to remember to run them.

Schema Stability Across Tools

Multiple engines reading the same table need to agree on what the schema is. If Spark writes a new column but Trino's catalog cache is stale, queries can fail or return wrong results.

Coordinating schema changes across teams and tools requires discipline: schema versioning, deprecation timelines, communication. None of this is technical, but it is real work.

Concurrent Writers

When two jobs write to the same table at the same time, the table format has to handle the conflict. All three (Iceberg, Delta, Hudi) handle this, but you need to understand their conflict resolution strategy and design your pipelines around it.

Cost Optimization

Storage is cheap. Compute is not. With a lakehouse, you control which engine you use for which workload. That can be a big advantage (use cheap engines for batch, expensive ones only for interactive queries) but it also means you need to actively manage the cost. With a managed warehouse, the vendor optimizes for you.

Decision Tree: Which Should You Pick?

Not every team needs a lakehouse. Pick based on your actual situation, not the trend.

Picking the Right Architecture

Do you have unstructured or ML data alongside structured analytics data?

Do you want zero infrastructure work?

Yes

Data Warehouse

Snowflake, BigQuery. Pay a premium, get simplicity.

Lakehouse (lighter setup)

Just structured tables on Iceberg or Delta. Save on storage.

Yes

Do you have a dedicated data engineering team?

Warehouse + Lake

Run them separately. Less elegant, but operationally simpler than full lakehouse.

Yes

Full Lakehouse

Best of both worlds. Single source of truth. Maintenance is your responsibility.

When to Choose What

Pick a Data Warehouse if: you mostly do structured SQL analytics, you do not have heavy ML or unstructured data needs, and you would rather pay more for less operational burden. Snowflake, BigQuery, and Redshift are battle-tested for this.

Pick a Data Lake if: you primarily need cheap storage for raw data and ML training. You are okay with limited transactional guarantees, and you do not need fast SQL on top of it. S3 plus Spark is enough for many teams.

Pick a Data Lakehouse if: you need both worlds. You have varied workloads (analytics, ML, streaming, batch). You have a dedicated engineering team that can handle the maintenance. You want to avoid duplicating data and pipelines. The savings on storage and the unification of governance pay off at scale.

The One Thing to Remember

The lakehouse is not just a new product. It is a real architectural shift made possible by three things coming together: cheap object storage, open table formats that brought ACID to the lake, and shared catalogs that let multiple engines collaborate.

It eliminates the longstanding gap between two parallel worlds. But it does so by pushing more responsibility onto the team building it. There is no free lunch. You trade vendor management for engineering work, and infrastructure costs for operational complexity.

For organizations with the right scale and the right team, that trade is worth it. For smaller teams, a managed warehouse is still often the better answer. Knowing your situation honestly is more important than picking the trendiest architecture.