Delta lake optimization in fabric: the maintenance nobody tells you about
Delta tables aren't magic. They get slower over time if you don't maintain them.
Every write creates new files. Updates create more files. Deletes create even more files. Eventually you have thousands of tiny files and queries that used to take 10 seconds now take 2 minutes.
Had a table go from 50 files to 4000 files over 3 months. Query performance dropped by 10x. Ran optimize and z-order, back to normal in 15 minutes.
Delta maintenance isn't optional. It's required. Here's what you need to know.
The small files problem
Delta tables store data as parquet files. Each write operation creates new files.
# first write: creates 5 files
df.write.format("delta").mode("overwrite").saveAsTable("sales")
# append data: adds 3 more files
df2.write.format("delta").mode("append").saveAsTable("sales")
# another append: adds 2 more files
df3.write.format("delta").mode("append").saveAsTable("sales")
# now you have 10 files total
With frequent appends you end up with hundreds or thousands of small files.
Why this is bad:
- Each file requires a separate read operation
- List operations slow down with more files
- Query planning overhead increases
- Metadata becomes huge
- Storage costs are higher (minimum block size)
Real example: table with 2000 files vs optimized table with 20 files, same data. The 2000 file version was 7x slower to query.
What optimize does
The optimize command reads small files and rewrites them into larger files.
# optimize a table
spark.sql("OPTIMIZE sales")
# optimize with a filter (only optimize recent partitions)
spark.sql("OPTIMIZE sales WHERE date >= '2024-01-01'")
What actually happens:
- Reads all small files in the table (or partition)
- Combines them into larger files (default target: 1GB per file)
- Writes the new files
- Updates transaction log
- Old files are marked for deletion
After optimize your 2000 small files become 20 large files. Queries are faster, metadata is smaller.
optimize is safe
Optimize doesn't change data, just file organization. It's safe to run on production tables. Time travel still works, old versions reference the old files until you vacuum.
When to run optimize
Don't optimize after every write. That's wasteful. But don't wait until performance is terrible either.
Good triggers:
After bulk appends:
# load a bunch of data
df.write.format("delta").mode("append").saveAsTable("sales")
# optimize immediately
spark.sql("OPTIMIZE sales")
On a schedule:
# in a notebook run weekly
from datetime import datetime
# optimize tables older than 7 days
spark.sql("""
OPTIMIZE sales
WHERE date < current_date() - interval 7 days
""")
When file count grows:
# check file count
file_count = spark.sql("""
DESCRIBE DETAIL sales
""").select("numFiles").collect()[0][0]
if file_count > 1000:
spark.sql("OPTIMIZE sales")
For tables with frequent writes optimize weekly. For tables with occasional writes optimize monthly or after big loads.
Auto optimize
Fabric supports auto optimize. It runs optimize automatically during writes.
# enable auto optimize for a table
spark.sql("""
ALTER TABLE sales SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true'
)
""")
What this does:
optimizeWrite: writes fewer, larger files during the write operationautoCompact: automatically compacts small files after writes
Sounds great but has tradeoffs:
Pros:
- No manual optimize needed
- Tables stay performant automatically
- Less maintenance work
Cons:
- Write operations take longer
- Uses more compute during writes
- Not as optimized as manual optimize with z-order
My recommendation: use auto optimize for tables with very frequent small writes. Use manual optimize with z-order for tables with batch loads.
Z-order clustering
This is the secret weapon. Z-order physically organizes data within files based on column values.
# optimize with z-order on frequently filtered columns
spark.sql("OPTIMIZE sales ZORDER BY (customer_id, product_category)")
Z-order makes filtered queries way faster.
Without z-order:
- Query filters on customer_id
- Spark reads all files
- Filters data after reading
- Slow for large tables
With z-order on customer_id:
- Query filters on customer_id
- Data for same customer is clustered together
- Spark skips files that don't contain that customer
- Way less data read
Real numbers: query filtering on customer_id went from 45 seconds to 6 seconds after z-order. Same query, same data, 7.5x faster.
Picking z-order columns
Choose columns you filter on most frequently. Usually:
Good z-order candidates:
- ID columns (customer_id, order_id)
- Category columns (product_category, region)
- Status columns (order_status, user_type)
Bad z-order candidates:
- Timestamp columns (too high cardinality)
- Columns you never filter on
- Columns with very few distinct values
You can z-order on multiple columns (up to 4-5 is reasonable):
# z-order on multiple columns
spark.sql("""
OPTIMIZE sales
ZORDER BY (region, product_category, customer_id)
""")
Order matters. Put the most frequently filtered column first.
Z-order cost
Z-order is more expensive than regular optimize. It reads and rewrites all data sorting by the z-order columns.
For a 100GB table:
- Regular optimize: 2-5 minutes
- Optimize with z-order: 10-20 minutes
Worth it if your queries filter on those columns. Not worth it for tables you only scan entirely.
z-order after bulk loads
Best time to z-order is after loading large amounts of data. Run it once on the full table then only on new partitions going forward.
Vacuum: cleaning up old files
Optimize creates new files but doesn't delete old ones immediately. They're kept for time travel.
Vacuum deletes old files that are no longer needed.
# delete files older than retention period (default 7 days)
spark.sql("VACUUM sales")
# vacuum with custom retention
spark.sql("VACUUM sales RETAIN 168 HOURS") # 7 days in hours
What vacuum does:
- Lists all files in the table
- Checks transaction log for which files are current
- Deletes files not referenced by versions within retention period
After vacuum your old versions are gone. Time travel only works back to the retention period.
Vacuum retention period
Default retention is 7 days. You can't vacuum files less than 7 days old (safety feature).
# this errors unless you override safety
spark.sql("VACUUM sales RETAIN 0 HOURS")
# need to disable check (dangerous)
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
spark.sql("VACUUM sales RETAIN 0 HOURS")
Don't override the safety unless you know what you're doing. If you vacuum too aggressively you can corrupt concurrent queries reading old file versions.
Retention period guidelines:
- 7 days: safe default
- 30 days: if you need longer time travel
- 90 days: for compliance or audit requirements
- 24 hours: only if you're sure no queries run longer than that
Storage is cheap. Keep retention at 7 days unless you have a specific reason to change it.
When to vacuum
Vacuum periodically to reclaim storage. Don't need to vacuum after every optimize.
# good pattern: optimize weekly, vacuum monthly
from datetime import datetime
# weekly maintenance notebook
if datetime.now().day <= 7:
# first week of month: vacuum
spark.sql("VACUUM sales RETAIN 168 HOURS")
else:
# other weeks: just optimize
spark.sql("OPTIMIZE sales")
Vacuum is less urgent than optimize. Optimize helps performance. Vacuum helps storage costs.
Partitioning strategy
How you partition tables affects optimization strategy.
# partition by date
df.write.format("delta") \
.partitionBy("date") \
.saveAsTable("sales")
With partitioned tables you can optimize specific partitions:
# only optimize recent partition
spark.sql("OPTIMIZE sales WHERE date = '2024-01-15'")
# optimize last 7 days
spark.sql("OPTIMIZE sales WHERE date >= current_date() - interval 7 days")
This is way faster than optimizing the entire table.
Over-partitioning problem
Too many partitions is worse than no partitions.
Bad example:
# partitioned by customer_id (millions of values)
df.write.format("delta") \
.partitionBy("customer_id") \
.saveAsTable("sales")
This creates millions of directories. Listing operations become slow. Queries that scan multiple customers are slower than non-partitioned.
Good partitioning:
- Date columns (day, month, year)
- Low cardinality categories (10-1000 distinct values)
- Columns you filter on in most queries
Don't partition if:
- Table is under 10GB
- You don't have a clear partition column
- Most queries scan the full table anyway
For small tables partitioning adds overhead without benefit.
Table properties that matter
Delta tables have properties that control behavior.
# view current properties
spark.sql("SHOW TBLPROPERTIES sales")
# set useful properties
spark.sql("""
ALTER TABLE sales SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true',
'delta.targetFileSize' = '134217728',
'delta.deletedFileRetentionDuration' = 'interval 7 days',
'delta.logRetentionDuration' = 'interval 30 days'
)
""")
Important properties:
delta.targetFileSize: target file size after optimize (default 1GB)
- Increase for very large tables
- Decrease for tables with selective queries
delta.autoOptimize.optimizeWrite: write fewer, larger files
- Good for frequent appends
- Slower writes but better read performance
delta.autoOptimize.autoCompact: auto-compact small files
- Runs compaction after writes
- Good for streaming or micro-batch patterns
delta.deletedFileRetentionDuration: how long to keep old files
- Default 7 days
- Increase if you need longer time travel
delta.logRetentionDuration: how long to keep transaction log
- Default 30 days
- Controls how far back time travel works
Don't change these unless you have a reason. Defaults work for most cases.
Monitoring table health
Check your table stats regularly.
# get table details
spark.sql("DESCRIBE DETAIL sales").show()
# see version history
spark.sql("DESCRIBE HISTORY sales").show()
# check file count and size
details = spark.sql("DESCRIBE DETAIL sales").collect()[0]
print(f"Files: {details.numFiles}")
print(f"Size: {details.sizeInBytes / 1e9:.2f} GB")
Signs your table needs optimization:
- File count over 1000
- Average file size under 100MB
- Query times increasing over time
- Storage growing faster than data size
Set up monitoring:
# weekly health check notebook
def check_table_health(table_name):
details = spark.sql(f"DESCRIBE DETAIL {table_name}").collect()[0]
num_files = details.numFiles
size_gb = details.sizeInBytes / 1e9
avg_file_size_mb = details.sizeInBytes / num_files / 1e6
print(f"Table: {table_name}")
print(f"Files: {num_files}")
print(f"Size: {size_gb:.2f} GB")
print(f"Avg file size: {avg_file_size_mb:.2f} MB")
if num_files > 1000:
print("WARNING: Too many files, consider optimize")
if avg_file_size_mb < 100:
print("WARNING: Files too small, consider optimize")
check_table_health("sales")
Scheduled maintenance pattern
Put it all together in a maintenance notebook.
# maintenance notebook run weekly
from datetime import datetime
tables_to_maintain = ["sales", "customers", "orders", "products"]
for table in tables_to_maintain:
print(f"Maintaining {table}...")
# check if table needs optimize
details = spark.sql(f"DESCRIBE DETAIL {table}").collect()[0]
num_files = details.numFiles
if num_files > 500:
print(f" Optimizing ({num_files} files)...")
# optimize with z-order on key columns
if table == "sales":
spark.sql(f"OPTIMIZE {table} ZORDER BY (customer_id, date)")
elif table == "customers":
spark.sql(f"OPTIMIZE {table} ZORDER BY (customer_id)")
else:
spark.sql(f"OPTIMIZE {table}")
# vacuum monthly (first week)
if datetime.now().day <= 7:
print(f" Vacuuming...")
spark.sql(f"VACUUM {table} RETAIN 168 HOURS")
print("Maintenance complete")
Schedule this via a fabric pipeline to run weekly.
maintenance during low traffic
Run optimize and vacuum during low-traffic periods. They consume capacity and can slow down concurrent queries. 2am on Sunday is better than 2pm on Tuesday.
Performance before and after
Real example from a production table:
Before optimization:
- Files: 3500
- Total size: 45GB
- Average file size: 13MB
- Query time (filter on customer_id): 38 seconds
After optimize + z-order:
- Files: 45
- Total size: 45GB (same data)
- Average file size: 1GB
- Query time (same query): 5 seconds
7.6x performance improvement just from file organization. Same data, same query, different physics.
Storage impact after vacuum:
- Before vacuum: 67GB (old and new files)
- After vacuum: 45GB (reclaimed 22GB)
File organization matters more than most code optimizations.
Common mistakes
Mistake 1: Never running optimize
Worst mistake. Your tables get slower every month and you wonder why fabric is slow. It's not fabric, it's file fragmentation.
Mistake 2: Optimizing too frequently
Running optimize after every tiny write wastes compute. Batch your writes or use auto optimize instead.
Mistake 3: Vacuuming too aggressively
Setting retention to 0 hours and wondering why queries randomly fail. Old file versions are needed for concurrent readers.
Mistake 4: Z-ordering on wrong columns
Z-ordering on timestamp columns with millions of distinct values. Doesn't help and takes forever. Pick low-to-medium cardinality columns you filter on.
Mistake 5: Over-partitioning
Partitioning by user_id when you have millions of users. Creates a directory nightmare and slows everything down.
Mistake 6: Ignoring maintenance entirely
Treating delta like a normal database that maintains itself. It doesn't. You need to run optimize and vacuum or performance degrades.
Integration with data pipelines
Optimize and vacuum fit naturally into data pipelines.
# pipeline pattern
# step 1: load new data
df = extract_from_source()
df.write.format("delta").mode("append").saveAsTable("sales")
# step 2: optimize new partition
from datetime import date
today = date.today()
spark.sql(f"OPTIMIZE sales WHERE date = '{today}'")
# step 3: continue with downstream processing
transformed = spark.table("sales").filter(...)
Don't optimize the entire table on every run. Optimize just what changed.
# weekly full maintenance in separate pipeline
spark.sql("OPTIMIZE sales ZORDER BY (customer_id)")
spark.sql("VACUUM sales RETAIN 168 HOURS")
Separate incremental optimization (daily) from full maintenance (weekly).
Fabric specific considerations
Fabric manages spark clusters differently than databricks.
What this means for maintenance:
- Optimize/vacuum runs use your capacity units
- Jobs compete with other fabric workloads
- Schedule maintenance during low-usage times
- Consider your capacity SKU when planning maintenance windows
On F16 capacity optimizing a 100GB table takes 10-15 minutes. On F8 it might take 30-40 minutes. Budget accordingly.
Working with lakehouses
In fabric your delta tables live in lakehouses.
Maintenance commands work the same:
# reference table in lakehouse
spark.sql("USE my_lakehouse")
spark.sql("OPTIMIZE sales")
# or use full path
spark.sql("OPTIMIZE my_lakehouse.sales")
Lakehouse files are in OneLake. After vacuum the old files are deleted from OneLake and you reclaim storage.
Final maintenance checklist
For each delta table:
Weekly:
- Check file count
- Run optimize if files > 500
- Use z-order on optimize for key tables
Monthly:
- Run vacuum to reclaim storage
- Review table properties
- Check query performance trends
After bulk loads:
- Always optimize immediately
- Consider z-order if table supports filtered queries
Ongoing:
- Monitor file counts in critical tables
- Track query performance over time
- Adjust maintenance frequency based on write patterns
Real impact
Had a lakehouse with 15 tables. Never ran any maintenance for 6 months. Query times degraded 3-10x. Storage was 2x what it should be.
Ran a maintenance pass:
- Optimize + z-order on all tables: 2 hours
- Vacuum everything: 30 minutes
- Queries back to normal speed
- Reclaimed 200GB of storage
Total cost of maintenance: 2.5 hours of capacity once. Ongoing benefit: faster queries and lower storage costs forever.
Maintenance isn't optional overhead. It's required operations for delta tables.
Learning more
Delta maintenance ties into overall spark optimization. File organization is part of the physics that makes queries fast.
If you're new to fabric start with the intro guide for power bi developers first. Then learn about lakehouses vs warehouses. Then come back to optimization.
The concepts here apply to any delta lake implementation. Works the same in databricks, fabric, or standalone delta. You're learning transferable skills.
Delta tables are powerful but they're not self-maintaining. Run optimize regularly, use z-order for filtered queries, vacuum periodically. Your tables stay fast and your storage stays clean.
Build maintenance into your pipelines from day one. Way easier than trying to fix performance problems later.
related posts
Migrating to fabric: a 3 day plan for power bi teams
Moving to fabric doesn't have to be a month-long ordeal. Here's a practical 3-day roadmap to get your first end-to-end solution running in production.
Spark optimization in fabric notebooks: the logic vs physics split
Your notebook code is logic. Your spark configuration is physics. Understanding this split and what you can actually control at each fabric SKU level makes everything faster and cheaper.
Databricks vs fabric: which one do you actually need
Databricks gives you atomic control over everything. Fabric makes it simple and integrates with power bi. Neither is objectively better but one is probably right for your situation.