Getting started with microsoft fabric: a practical guide

Tried migrating a 2tb data warehouse to fabric last month. Everyone said direct lake mode would solve the performance issues. Some of it was true, some wasnt. Heres what actually happened and the stuff the docs dont tell you.

What is fabric actually

Fabric is microsofts attempt at building an all-in-one data platform. You get a lakehouse, warehouse, notebooks, pipelines, and power bi all in one place. Sounds great on paper.

The reality is a bit more nuanced. Its still relatively new and some features are rougher around the edges than others. But the core functionality is solid and getting better every month.

Setting up your first lakehouse

First thing you need to do is set up a lakehouse. Not a warehouse, a lakehouse. Fabric has both and theyre different. Took me a week to figure out which one i actually needed. Spoiler: i needed the lakehouse.

Heres the quick breakdown:

lakehouse: stores data in delta tables, queries with spark or sql endpoint
warehouse: t-sql only, more traditional dwh experience
when to use which: if you need spark processing, go lakehouse. If you only know t-sql and dont need spark, warehouse might be simpler

To create a lakehouse:

Go to your workspace
Click new > lakehouse
Give it a name (keep it simple, youll type this a lot)
Thats it. You now have a lakehouse

Loading your first data

The easiest way to get data in is with a notebook. Heres a quick example of loading a csv:

# read csv from files section
df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("Files/raw/sales_data.csv")

# write to delta table
df.write.format("delta") \
    .mode("overwrite") \
    .saveAsTable("sales")

pro tip

Always use delta format for your tables. Parquet works but you lose all the versioning and merge capabilities that make delta useful.

Direct lake mode: what worked

Direct lake mode is where things get interesting. Instead of importing data into power bi, it queries your delta tables directly. No data duplication, always fresh data.

What worked well:

Reports update automatically when data changes
No import refresh schedules to manage
Storage costs went down since im not duplicating data

Direct lake mode: what didnt work

watch out

Direct lake mode has a 1 million row limit on development workspaces. Learned that one the hard way when my 500gb table took down the entire workspace.

Other stuff i ran into:

V-order optimization is crucial for performance. Without it queries are slow
Not all dax functions work the same way
Fallback to import mode happens more often than youd expect

Spark configuration that actually matters

Spent a lot of time tuning spark configs. Most of them dont matter much but these ones do:

# set this or your spark jobs will be slow
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

# i set threshold to 1gb for my workload
# default 256mb was too aggressive
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "1073741824")

Adaptive query execution (aqe) makes a huge difference. Enable it and forget about it.

What i would do differently

If i was starting over:

start with lakehouse, not warehouse unless you refuse to learn anything besides t-sql
enable v-order from day one on all your delta tables
test direct lake limits early dont find out about row limits in production
use notebooks for etl data factory works but notebooks give you more control

Next steps

Thats the basics to get you started. In the next post ill cover incremental loading patterns and how i handle slowly changing dimensions in fabric.

If you have questions hit me up on twitter. Always happy to help people avoid the mistakes i made.