The Problem That Led Here
For decades, companies that wanted to do anything serious with data had to pick between two very different tools.
On one side, the data warehouse. Fast, structured, reliable. Perfect for finance reports, executive dashboards, anything where you need clean, trustworthy answers from SQL queries. The downside: expensive, rigid, and not built for raw or unstructured data.
On the other side, the data lake. Cheap, flexible, massive. You can dump anything into it: logs, images, JSON blobs, video, IoT data. Perfect for machine learning and exploration. The downside: no schema, no transactions, no consistency guarantees. It is easy to fill a lake with garbage.
For years, large organizations ran both. Operational data flowed into the lake for archival and ML training. Curated business data lived in the warehouse for analytics. Two pipelines, two governance models, two sets of quality checks, two bills.
This was the state of the world for over a decade. Then around 2020, a different approach started to take hold: what if you could keep one shared storage layer, but get the reliability features of a warehouse on top of it? That idea is the data lakehouse.
Warehouse vs Lake vs Lakehouse: Side by Side
Before going deeper, here is how the three approaches compare directly:
| Data Warehouse | Data Lake | Data Lakehouse | |
|---|---|---|---|
| Storage Cost | High | Very low | Very low |
| Data Format | Structured only | Anything | Anything (with structured tables on top) |
| Schema | Schema-on-write | Schema-on-read | Both supported |
| ACID Transactions | Yes | No | Yes |
| Query Speed | Fast | Slow without tuning | Fast (with right table format) |
| ML Workloads | Limited | Excellent | Excellent |
| Operational Burden | Low (managed) | Medium | High (more moving parts) |
| Examples | Snowflake, BigQuery | S3, ADLS, GCS | Databricks, Iceberg, Delta Lake |
What Exactly Is a Lakehouse?
A data lakehouse is an architecture, not a product. It is built by stacking several technologies together so that one shared storage layer can serve both warehouse-style and lake-style workloads.
The core promise: one copy of the data, reliable and structured for analytics, raw and flexible for ML, all stored cheaply in object storage.
To make this work, you need three layers stacked on top of each other.
The Three Building Blocks
Block 1: Object Storage
The foundation. Cheap, virtually infinite, durable. Think of it as a giant hard drive in the cloud where you can dump any file: raw JSON logs, polished Parquet tables, images, videos, anything.
The most popular object stores are Amazon S3, Azure Data Lake Storage (ADLS), and Google Cloud Storage (GCS).
Why object storage is the right foundation:
It is cheap (often around $0.02 per GB per month).
It is durable (S3 promises 11 nines of durability).
It scales infinitely without manual provisioning.
It supports any file format.
Multiple compute engines can read from it without copying data.
The catch: object storage by itself has no concept of tables, transactions, or consistency. If two processes write to the same path at the same time, you can end up with corrupted or partial data. That is why you need the next layer.
Block 2: Open Table Formats
This is the breakthrough that made lakehouses possible. Open table formats sit on top of raw files in object storage and add database-like properties to them.
The three main contenders:
What all three give you:
ACID transactions: writes are atomic. No more partial files. No more "wait, did that job actually finish?"
Snapshots: every write creates a new version of the table. You can read the table as it was 10 minutes ago, or 10 days ago.
Time travel: roll back a bad ingest job by querying yesterday's snapshot.
Schema evolution: add columns, drop columns, change types, without rewriting the entire table.
Hidden partitioning: users do not need to know how the data is physically partitioned to write efficient queries.
Commit history: every change to the table is logged, like git for your data.
These are warehouse-grade features. But they are now available on top of cheap object storage. That is the magic.
Block 3: Shared Catalog
You have your storage. You have your table format. But you still need to answer questions like:
"What tables exist?"
"Where does the customers table actually live?"
"What columns does it have?"
"Who is allowed to read it?"
That is the job of the catalog. It is the directory that maps human-readable table names to physical locations and metadata. Without it, every tool would have to know the exact storage path of every table, which becomes impossible at scale.
The catalog is the single source of truth that lets multiple engines collaborate on the same data. Spark reads from it during ingestion. Trino reads from it for SQL queries. A BI tool reads from it for dashboards. They all see the same table definitions, so nobody gets lost.
ingestion
SQL queries
streaming
dashboards
table names, schemas, locations, permissions
actual data files
Common catalog options:
AWS Glue Data Catalog: the default for AWS-based lakehouses.
Hive Metastore: the original, still widely used in legacy stacks.
Databricks Unity Catalog: tightly integrated with Databricks, with strong governance.
Apache Polaris: open source, recent, designed for Iceberg.
Project Nessie: brings git-like branching to data catalogs.
Governance and Security
Once you have multiple teams using the same lakehouse, you need rules. Who can read what? Who can write what? Where did this data come from? Was sensitive data masked correctly? This is the governance layer, and it sits across the whole stack.
Common tools that provide this layer:
AWS Lake Formation: permissions and lineage for AWS-based lakehouses.
Databricks Unity Catalog: all-in-one governance for Databricks.
Apache Atlas: open source, traditional Hadoop ecosystem.
OpenLineage: open standard for tracking data lineage across tools.
This layer is often what differentiates a "data swamp" (a lake nobody trusts) from a real lakehouse that the business depends on.
The Trade-offs Nobody Tells You About
The lakehouse story sounds great in marketing slides. In production, there are real costs.
You Take On Operational Work
A managed warehouse like Snowflake handles everything: storage, compute, query optimization, indexing, vacuuming. You hand it SQL and pay the bill.
A lakehouse is more like Lego. You assemble it from open components. You get more flexibility and lower costs, but you are now responsible for keeping the parts working together. That includes:
The Small File Problem
Object storage performs poorly when you have millions of tiny files. A typical streaming write produces lots of small Parquet files. Over time, query performance degrades because the engine has to open thousands of files to read a single table.
The fix is to run periodic compaction jobs that merge small files into bigger ones. This is invisible to users but essential for performance. Lakehouse table formats provide tools for this (Delta has OPTIMIZE, Iceberg has rewrite_data_files), but you have to remember to run them.
Schema Stability Across Tools
Multiple engines reading the same table need to agree on what the schema is. If Spark writes a new column but Trino's catalog cache is stale, queries can fail or return wrong results.
Coordinating schema changes across teams and tools requires discipline: schema versioning, deprecation timelines, communication. None of this is technical, but it is real work.
Concurrent Writers
When two jobs write to the same table at the same time, the table format has to handle the conflict. All three (Iceberg, Delta, Hudi) handle this, but you need to understand their conflict resolution strategy and design your pipelines around it.
Cost Optimization
Storage is cheap. Compute is not. With a lakehouse, you control which engine you use for which workload. That can be a big advantage (use cheap engines for batch, expensive ones only for interactive queries) but it also means you need to actively manage the cost. With a managed warehouse, the vendor optimizes for you.
Decision Tree: Which Should You Pick?
Not every team needs a lakehouse. Pick based on your actual situation, not the trend.
When to Choose What
Pick a Data Warehouse if: you mostly do structured SQL analytics, you do not have heavy ML or unstructured data needs, and you would rather pay more for less operational burden. Snowflake, BigQuery, and Redshift are battle-tested for this.
Pick a Data Lake if: you primarily need cheap storage for raw data and ML training. You are okay with limited transactional guarantees, and you do not need fast SQL on top of it. S3 plus Spark is enough for many teams.
Pick a Data Lakehouse if: you need both worlds. You have varied workloads (analytics, ML, streaming, batch). You have a dedicated engineering team that can handle the maintenance. You want to avoid duplicating data and pipelines. The savings on storage and the unification of governance pay off at scale.
The One Thing to Remember
The lakehouse is not just a new product. It is a real architectural shift made possible by three things coming together: cheap object storage, open table formats that brought ACID to the lake, and shared catalogs that let multiple engines collaborate.
It eliminates the longstanding gap between two parallel worlds. But it does so by pushing more responsibility onto the team building it. There is no free lunch. You trade vendor management for engineering work, and infrastructure costs for operational complexity.
For organizations with the right scale and the right team, that trade is worth it. For smaller teams, a managed warehouse is still often the better answer. Knowing your situation honestly is more important than picking the trendiest architecture.