Lakehouse vs warehouse in fabric: which one do you actually need
Fabric gives you two options for storing data: lakehouses and warehouses. Both store tables. Both let you query with sql. But they're fundamentally different and picking the wrong one causes problems later.
Spent way too much time trying to figure out which one to use. The microsoft docs explain what they are but not really when to use which. Here's what i learned actually using both.
What's a lakehouse in fabric
A lakehouse stores data as files in OneLake using delta format. It gives you both file access and sql query access to the same data.
Under the hood:
- Data stored as parquet files organized by delta lake protocol
- You get a sql endpoint automatically for querying
- Access via spark (notebooks) or sql (endpoint)
- Tables are delta tables with versioning and time travel
Think of it as a folder structure with database functionality layered on top.
What's a warehouse in fabric
A warehouse is a traditional sql database. It stores data in database pages not files.
Under the hood:
- Traditional relational database engine
- Access via t-sql only
- Tables are sql tables with indexes and constraints
- Stored in OneLake but as database files not parquet
Think of it as azure sql database built into fabric.
Key differences that matter
Here's what actually matters when deciding between them:
Access methods
Lakehouse:
- Spark notebooks (python, scala, r)
- Sql endpoint (read only)
- Direct file access to parquet files
- Power bi direct lake mode
Warehouse:
- T-sql only
- Read and write via sql
- Power bi directquery or import
- No direct file access
If you need spark processing you need a lakehouse. If you only use sql you can use either.
Write operations
Lakehouse: You can't write to lakehouse tables via the sql endpoint. It's read only.
To write data you need to use:
- Notebooks with spark
- Dataflows gen2
- Data pipelines with copy activity
Warehouse: Full read/write access via t-sql. You can insert, update, delete, merge just like any sql database.
This is a big deal. If your workflow is built around sql procedures and you need to write data via sql statements you probably need a warehouse.
Performance characteristics
Lakehouse:
- Fast for large scale scans (reading millions of rows)
- Optimized for analytics and aggregations
- Slower for single row lookups
- No traditional indexes (uses delta statistics instead)
Warehouse:
- Fast for point queries (finding specific rows)
- Good for oltp style workloads
- Supports traditional indexes and constraints
- Can be slower for full table scans on huge datasets
If you're doing analytics (summarizing large datasets) lakehouse is usually faster. If you're doing transactional lookups warehouse might be better.
direct lake mode
Only lakehouses support direct lake mode for power bi. This is the feature where your semantic model queries the storage directly without importing data. If you want direct lake you must use a lakehouse not a warehouse.
Data formats
Lakehouse: Delta format. Each table is a folder of parquet files plus a transaction log.
Benefits:
- Time travel (query previous versions)
- Schema evolution (add columns without breaking things)
- Acid transactions
- Merge and upsert operations
Warehouse: Traditional database pages. Proprietary storage format.
Benefits:
- Better space efficiency for small tables
- Traditional constraints and foreign keys
- Familiar to sql developers
The delta format in lakehouses is more flexible but the warehouse format is more familiar if you come from sql server world.
When to use a lakehouse
Use a lakehouse when:
You need spark processing
If any part of your workflow needs python, scala, or spark transformations you need a lakehouse. Warehouses don't support notebooks.
You want direct lake mode
For power bi reports that need to query large datasets without importing, direct lake only works with lakehouses.
You're building a medallion architecture
The bronze/silver/gold pattern works naturally with lakehouses. Raw files in bronze, cleaned delta tables in silver and gold.
You have large scale analytics workloads
Scanning billions of rows to aggregate data performs better on lakehouse delta tables than warehouse tables.
You need time travel or versioning
Delta tables track every change. You can query data as it was yesterday or last week. Warehouses don't have this built in.
You want to mix structured and unstructured data
Lakehouses let you store files (pdfs, images, json) alongside delta tables in the same location.
When to use a warehouse
Use a warehouse when:
You only know t-sql
If your team is sql experts and nobody wants to learn spark, warehouse lets you stay in familiar territory.
You need full sql write access
If your processes rely on sql procedures, updates, and deletes, warehouse gives you full dml access.
You're migrating from sql server
If you're lifting and shifting a sql server database, warehouse is the closer match. Less transformation required.
You need traditional database features
Things like foreign keys, check constraints, triggers work in warehouse not lakehouse.
You have high concurrency point queries
Lots of users doing lookups by id or key? Warehouse indexes handle this better than lakehouse scans.
Your data volumes are modest
If you're working with gigabytes not terabytes, the warehouse simplicity might outweigh lakehouse flexibility.
migration difficulty
It's easier to move from warehouse to lakehouse later than lakehouse to warehouse. If you're unsure and don't have a specific need for warehouse features, start with lakehouse.
Can you use both?
Yes and honestly most real projects do.
Common pattern:
- Lakehouse for analytics data - your main data platform, large datasets, delta tables
- Warehouse for dimension tables - small reference tables, lookups, things that need sql updates
- Cross-database queries - you can query both in the same sql statement
Example:
-- query lakehouse and warehouse together
SELECT
f.sales_amount,
d.product_name
FROM lakehouse_sales.dbo.fact_sales f
JOIN warehouse_ref.dbo.dim_product d
ON f.product_id = d.product_id
The fabric sql endpoint lets you join across lakehouses and warehouses seamlessly.
Storage and cost
Both lakehouses and warehouses store data in OneLake so storage costs are similar.
The real cost difference is compute:
Lakehouse compute:
- Spark notebooks consume capacity when running
- Sql endpoint queries use capacity
- No cost when not actively querying
Warehouse compute:
- Always consuming some capacity even when idle
- Queries consume additional capacity
- More like a traditional database that's always on
If you have sporadic workloads lakehouse is more cost effective. If you have constant querying the warehouse cost model might make more sense.
Real world decision framework
Here's how i actually decide for new projects:
Start with these questions:
- Do i need spark or notebooks? If yes → lakehouse
- Do i need to write data via sql? If yes → warehouse
- Do i want direct lake mode? If yes → lakehouse
- Is my team sql only? If yes → warehouse
- Am i dealing with terabytes of data? If yes → lakehouse
If multiple answers point different directions:
- Default to lakehouse unless you have a specific warehouse requirement
- You can always create both and use them for different purposes
- Moving from warehouse to lakehouse is possible later, the reverse is harder
start simple
For your first fabric project pick one and stick with it. Don't try to architect a complex multi-storage solution right away. Get comfortable with one approach first.
Common mistakes
Mistake 1: choosing warehouse because sql is familiar
I get it, sql is comfortable. But if you're using fabric for analytics workloads you're limiting yourself. Take a week to learn basic spark. It's not that hard and opens up way more capabilities.
Mistake 2: using lakehouse and then trying to update rows via sql
The lakehouse sql endpoint is read only. If you need to update data you have to use notebooks with merge operations. Don't design a workflow that depends on sql updates then choose lakehouse.
Mistake 3: picking lakehouse without understanding delta
Delta tables aren't just files. They have transaction logs, versioning, and optimization requirements. If you ignore this stuff your queries will be slow.
Mistake 4: creating both for the same data
Don't store the same data in both lakehouse and warehouse thinking you get best of both worlds. You get data duplication, sync issues, and higher costs. Pick one as the source of truth.
Loading data patterns
How you load data differs between them:
Into lakehouse:
- Dataflows gen2 with lakehouse destination
- Notebooks writing delta tables
- Copy activity in pipelines landing to files section
- Direct file upload for small datasets
Into warehouse:
- Dataflows gen2 with warehouse destination
- T-sql insert/merge statements
- Copy activity with warehouse destination
- Bulk insert from files
Most etl tools in fabric work with both but the patterns are slightly different.
Querying patterns
Lakehouse sql endpoint:
-- read only queries
SELECT * FROM bronze.raw_sales
WHERE sale_date >= '2024-01-01'
Lakehouse notebook:
# full read/write with spark
df = spark.table("bronze.raw_sales")
df = df.filter(df.sale_date >= "2024-01-01")
# write back to delta
df.write.format("delta").mode("append").saveAsTable("silver.sales")
Warehouse t-sql:
-- full read/write
UPDATE dbo.customers
SET status = 'active'
WHERE last_purchase > DATEADD(month, -6, GETDATE())
-- insert/merge works too
MERGE INTO dbo.customers AS target
USING staging.customers AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...
Notice the lakehouse example requires notebook for writes. Warehouse does everything in sql.
Power bi integration
This is important if you're coming from power bi like most people reading this.
Lakehouse with direct lake mode:
- Semantic model queries delta tables directly
- No data import, always current
- Best performance for large datasets
- Need fabric capacity with enough row limits
Warehouse with directquery:
- Semantic model sends sql queries to warehouse
- No data import, always current
- Performance depends on warehouse indexes and query complexity
- Works on any capacity
Either with import mode:
- Standard power bi import
- Data refreshes on schedule
- Works exactly like traditional power bi
- Fastest query performance but stale data
For new projects if you're targeting large datasets direct lake with lakehouse is the better path.
Migration path
If you pick wrong you're not stuck forever.
Warehouse to lakehouse:
- Export warehouse tables to files
- Create lakehouse tables from those files
- Rewrite any sql procedures as notebooks
- Switch your semantic models and reports over
Medium difficulty. The sql to spark translation is the hard part.
Lakehouse to warehouse:
- Copy delta tables to staging
- Create warehouse schema
- Load data via t-sql
- Rewrite spark notebooks as sql procedures
Also medium difficulty. Lose time travel and versioning features.
Neither migration is trivial but both are doable if you realize you need to switch.
Final decision guide
If you're still unsure here's the tldr:
Choose lakehouse if:
- You want direct lake mode
- You're doing large scale analytics
- You need spark processing
- You're comfortable learning new tools
- This is a new project
Choose warehouse if:
- You only use t-sql
- You need full sql write access
- You're migrating from sql server
- Your team refuses to learn spark
- You have traditional oltp patterns
When in doubt pick lakehouse. It's more flexible and where fabric is investing development effort.
Getting started
Whichever you pick:
- Create a test workspace
- Create one lakehouse or warehouse
- Load some sample data via a dataflow gen2
- Try querying it different ways
- Build a simple power bi report on top
Actually using them for a week teaches you more than reading docs.
If you're new to fabric start with my intro guide for power bi developers to understand the overall platform first.
The lakehouse vs warehouse decision matters but it's not permanent. Pick one based on your immediate needs and switch later if required. Don't overthink it to the point where you never start.
related posts
Migrating to fabric: a 3 day plan for power bi teams
Moving to fabric doesn't have to be a month-long ordeal. Here's a practical 3-day roadmap to get your first end-to-end solution running in production.
Spark optimization in fabric notebooks: the logic vs physics split
Your notebook code is logic. Your spark configuration is physics. Understanding this split and what you can actually control at each fabric SKU level makes everything faster and cheaper.
Databricks vs fabric: which one do you actually need
Databricks gives you atomic control over everything. Fabric makes it simple and integrates with power bi. Neither is objectively better but one is probably right for your situation.