Getting started with microsoft fabric: a practical guide
Tried migrating a 2tb data warehouse to fabric last month. Everyone said direct lake mode would solve the performance issues. Some of it was true, some wasnt. Heres what actually happened and the stuff the docs dont tell you.
What is fabric actually
Fabric is microsofts attempt at building an all-in-one data platform. You get a lakehouse, warehouse, notebooks, pipelines, and power bi all in one place. Sounds great on paper.
The reality is a bit more nuanced. Its still relatively new and some features are rougher around the edges than others. But the core functionality is solid and getting better every month.
Setting up your first lakehouse
First thing you need to do is set up a lakehouse. Not a warehouse, a lakehouse. Fabric has both and theyre different. Took me a week to figure out which one i actually needed. Spoiler: i needed the lakehouse.
Heres the quick breakdown:
- lakehouse: stores data in delta tables, queries with spark or sql endpoint
- warehouse: t-sql only, more traditional dwh experience
- when to use which: if you need spark processing, go lakehouse. If you only know t-sql and dont need spark, warehouse might be simpler
To create a lakehouse:
- Go to your workspace
- Click new > lakehouse
- Give it a name (keep it simple, youll type this a lot)
- Thats it. You now have a lakehouse
Loading your first data
The easiest way to get data in is with a notebook. Heres a quick example of loading a csv:
# read csv from files section
df = spark.read.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("Files/raw/sales_data.csv")
# write to delta table
df.write.format("delta") \
.mode("overwrite") \
.saveAsTable("sales")
pro tip
Always use delta format for your tables. Parquet works but you lose all the versioning and merge capabilities that make delta useful.
Direct lake mode: what worked
Direct lake mode is where things get interesting. Instead of importing data into power bi, it queries your delta tables directly. No data duplication, always fresh data.
What worked well:
- Reports update automatically when data changes
- No import refresh schedules to manage
- Storage costs went down since im not duplicating data
Direct lake mode: what didnt work
watch out
Direct lake mode has a 1 million row limit on development workspaces. Learned that one the hard way when my 500gb table took down the entire workspace.
Other stuff i ran into:
- V-order optimization is crucial for performance. Without it queries are slow
- Not all dax functions work the same way
- Fallback to import mode happens more often than youd expect
Spark configuration that actually matters
Spent a lot of time tuning spark configs. Most of them dont matter much but these ones do:
# set this or your spark jobs will be slow
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
# i set threshold to 1gb for my workload
# default 256mb was too aggressive
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "1073741824")
Adaptive query execution (aqe) makes a huge difference. Enable it and forget about it.
What i would do differently
If i was starting over:
- start with lakehouse, not warehouse unless you refuse to learn anything besides t-sql
- enable v-order from day one on all your delta tables
- test direct lake limits early dont find out about row limits in production
- use notebooks for etl data factory works but notebooks give you more control
Next steps
Thats the basics to get you started. In the next post ill cover incremental loading patterns and how i handle slowly changing dimensions in fabric.
If you have questions hit me up on twitter. Always happy to help people avoid the mistakes i made.