Why are my Spark notebooks slow in Microsoft Fabric?

Common causes include too many small files in your delta tables, insufficient shuffle partitions, inefficient joins, and not using the right Fabric capacity SKU. Check your execution plan and optimize file organization first.

How do I configure Spark settings in Fabric notebooks?

Use spark.conf.set() to configure settings like shuffle partitions, adaptive query execution, and memory allocation. Some settings depend on your Fabric capacity SKU which determines available executors and memory.

What is the difference between logic and physics in Spark optimization?

Logic is the transformations you write in code. Physics is how Spark executes those transformations on the cluster. You control logic; Fabric manages physics based on your capacity SKU. Understanding both helps you write faster queries.

How many shuffle partitions should I use in Fabric?

Default is 200 but optimal depends on data size. For small data use fewer partitions (20-50). For large data increase to 400-800. Enable adaptive query execution to let Spark auto-tune partitions.

Spark Optimization in Fabric Notebooks: Performance Tuning Guide

Most people write spark notebooks and just hope they run fast. Then they're surprised when processing 10GB takes 20 minutes or costs way more capacity than expected.

The problem is treating spark like regular python. It's not. Your code is just instructions. The actual work happens in a distributed system that you need to configure properly.

There's a split between logic and physics. Logic is your code, what transformations you want. Physics is the spark cluster, how those transformations actually execute. You need to understand both.

The logic vs physics mental model

Logic: the transformations you write in your notebook

df = spark.read.parquet("files/raw_data")
df = df.filter(df.date >= "2024-01-01")
df = df.groupBy("customer_id").agg(sum("amount"))
df.write.saveAsTable("aggregated_sales")

Physics: the cluster running that code

How many nodes
How much memory per node
How many cores
Network between nodes
Storage I/O limits

Your code might be perfect but if the physics is wrong it'll still be slow. Or your cluster might be powerful but inefficient code wastes it.

Most performance problems are physics issues not logic issues. People optimize the wrong layer.

Fabric capacity SKUs: what you actually get

Fabric doesn't let you configure clusters node by node like databricks. Instead you get capacity units based on your SKU and fabric manages the spark resources.

Here's what each SKU gives you for spark workloads:

F2 (trial/dev)

2 capacity units
Effectively 1-2 spark executors
~8GB usable memory total
Good for: learning, small datasets under 1GB

4 capacity units
2-4 executors
~16GB memory
Good for: small production workloads, datasets 1-5GB

8 capacity units
4-6 executors
~32GB memory
Good for: most standard workloads, datasets 5-20GB

F16

16 capacity units
8-12 executors
~64GB memory
Good for: larger production workloads, datasets 20-50GB

F32

32 capacity units
16-24 executors
~128GB memory
Good for: heavy processing, datasets 50-200GB

F64 and higher

64+ capacity units
32+ executors
256GB+ memory
Good for: very large scale, datasets 200GB+

These are approximations. Fabric doesn't expose exact executor counts and they vary based on workload type. But this gives you a mental model of what scale each tier handles.

starter pools

Fabric has starter pools that spin up in 5-10 seconds vs 2-3 minutes for custom pools. Starter pools are smaller (usually 2-4 executors) but fine for most notebooks. Only use custom pools if you need specific configurations or more resources.

Small datasets: avoid spark entirely

If your data is under 1GB don't use spark dataframes at all. Use pandas.

Spark has overhead. Starting executors, shuffling data between them, coordinating tasks. For small data this overhead is bigger than the actual processing time.

# slow for small data
df = spark.read.parquet("files/small_data.parquet")
df = df.filter(df.status == "active")
df_pandas = df.toPandas()  # converts to pandas anyway

# faster
import pandas as pd
df = pd.read_parquet("files/small_data.parquet")
df = df[df.status == "active"]

The pandas version runs on one machine in memory. No distribution overhead. For datasets under 1GB it's usually 2-5x faster than spark.

You can still save pandas dataframes as delta tables:

# write pandas df to delta
df.to_parquet("Files/my_table", engine="pyarrow")
spark.sql("CREATE TABLE my_table USING DELTA LOCATION 'Files/my_table'")

Rule: if it fits in memory on one machine use pandas. Only use spark when you actually need distribution.

Medium datasets: default settings work fine

For datasets between 1-50GB the default spark configuration in fabric works fine. Don't over-optimize.

What actually matters at this scale:

1. Read only the columns you need

# reads entire file
df = spark.read.parquet("files/large_table")
df = df.select("customer_id", "amount")

# reads only needed columns (way faster)
df = spark.read.parquet("files/large_table") \
    .select("customer_id", "amount")

Parquet is columnar. Reading 2 columns from a 50 column table is 20x less data to scan.

2. Filter early

# filters after loading everything
df = spark.read.parquet("files/sales")
df = df.filter(df.date >= "2024-01-01")

# filters during read (predicate pushdown)
df = spark.read.parquet("files/sales") \
    .filter("date >= '2024-01-01'")

The second version only reads files that contain data from 2024. Can skip entire partitions.

3. Cache only when reusing

# loads same data twice (slow)
df = spark.table("sales")
result1 = df.filter(df.region == "US").count()
result2 = df.filter(df.region == "EU").count()

# cache when reading multiple times
df = spark.table("sales")
df.cache()
result1 = df.filter(df.region == "US").count()
result2 = df.filter(df.region == "EU").count()
df.unpersist()

Caching stores the dataframe in memory. Only helps if you actually reuse it. Don't cache and then only use it once.

For 1-50GB workloads these three things matter more than fancy spark configs.

Large datasets: configuration starts mattering

Above 50GB you need to think about spark configuration. The defaults aren't optimized for large scale.

Shuffle partitions

This is the most important setting nobody changes.

Default is 200 shuffle partitions. This is way too low for large data and way too high for small data.

# check current setting
spark.conf.get("spark.sql.shuffle.partitions")

# set based on your data size
# rule of thumb: aim for 128MB per partition

# for 10GB of data after shuffle
# 10GB / 128MB = ~80 partitions
spark.conf.set("spark.sql.shuffle.partitions", "80")

# for 100GB of data
# 100GB / 128MB = ~800 partitions  
spark.conf.set("spark.sql.shuffle.partitions", "800")

Shuffle happens during groupBy, join, repartition operations. Too few partitions means tasks are too big and run out of memory. Too many means overhead from managing thousands of tiny tasks.

Adaptive query execution

Turn this on. It's off by default in some fabric versions but makes everything faster.

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

AQE adjusts the query plan at runtime based on actual data sizes. Automatically combines small partitions, adjusts join strategies, handles skew.

Should honestly be default but you need to enable it manually.

Broadcast joins

When joining a large table to a small table spark can broadcast the small one to all executors instead of shuffling both.

# automatic broadcast for tables under 10MB (default)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10485760")

# increase if you have memory
# good for dimension tables up to 100MB
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "104857600")

# or force broadcast manually
from pyspark.sql.functions import broadcast

large_df = spark.table("fact_sales")
small_df = spark.table("dim_product")

result = large_df.join(broadcast(small_df), "product_id")

Broadcasting avoids expensive shuffles. Way faster for small dimension tables joining to large fact tables.

memory limits

Don't broadcast tables larger than 20% of your executor memory. If executors have 8GB don't broadcast tables over 1-2GB. You'll run out of memory and everything crashes.

Dynamic partition pruning

Another setting that should be default but isn't always enabled.

spark.conf.set("spark.sql.optimizer.dynamicPartitionPruning.enabled", "true")

When you filter on a dimension table then join to a partitioned fact table, this skips reading unnecessary partitions. Can reduce data scanned by 10-100x.

Reading and writing efficiently

File formats

Always use parquet or delta for analytics. Never use csv in production.

# csv: slow, no schema, string types, huge files
df = spark.read.csv("files/data.csv", header=True, inferSchema=True)

# parquet: fast, schema, compressed, columnar
df = spark.read.parquet("files/data.parquet")

# delta: parquet + versioning + optimizations
df = spark.read.format("delta").load("Tables/data")

CSV is text. Every read parses strings into types. Parquet stores data in binary format already typed and compressed.

Real numbers: 10GB csv vs 2GB parquet with 5x faster read times.

Writing patterns

How you write data affects read performance forever.

# creates tons of small files (slow to read later)
df.write.format("delta").mode("append").saveAsTable("my_table")

# repartition first for better file sizes
df.repartition(10).write.format("delta").mode("append").saveAsTable("my_table")

# or use coalesce if already shuffled
df.coalesce(10).write.format("delta").mode("append").saveAsTable("my_table")

Each partition writes one file. Too many partitions means thousands of tiny files. Too few means giant files that can't parallelize reads.

Target file size: 128MB to 1GB per file depending on scale.

Partitioning tables

Partition delta tables by columns you filter on frequently.

# partition by date (most common pattern)
df.write.format("delta") \
    .partitionBy("date") \
    .saveAsTable("sales")

# queries that filter on date skip entire partitions
filtered = spark.table("sales").filter("date >= '2024-01-01'")

Only partition on low cardinality columns. Date is good (365 values). Customer ID is bad (millions of values).

Don't partition if your table is under 10GB. The overhead isn't worth it.

Real example: optimizing a slow notebook

Had a notebook processing 80GB of sales data. Took 45 minutes on F16 capacity. Here's what actually helped:

Before optimization:

# read everything
sales = spark.read.parquet("files/sales")

# filter after reading
sales = sales.filter(sales.date >= "2024-01-01")

# join to product catalog (5MB table)
products = spark.read.parquet("files/products")
result = sales.join(products, "product_id")

# aggregate
final = result.groupBy("category").agg(sum("amount"))
final.write.saveAsTable("category_sales")

Runtime: 45 minutes Capacity consumed: 12 CU-hours

After optimization:

# enable AQE
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.shuffle.partitions", "400")

# filter during read
sales = spark.read.parquet("files/sales") \
    .filter("date >= '2024-01-01'")

# broadcast small table
products = spark.read.parquet("files/products")
result = sales.join(broadcast(products), "product_id")

# rest same
final = result.groupBy("category").agg(sum("amount"))
final.write.saveAsTable("category_sales")

Runtime: 12 minutes Capacity consumed: 3.2 CU-hours

3.75x faster by changing physics not logic. The actual transformations stayed identical.

What you can't control in fabric

Unlike databricks you don't get full cluster control in fabric. Some things are managed:

Can't configure:

Exact executor count
Individual node types
Custom docker images
Network settings
Storage mount points

Can configure:

All spark configs via spark.conf.set
Library installations
Environment variables
Session timeout settings

This is actually fine for most use cases. Less configuration means less to break. If you need atomic control over every cluster setting use databricks instead.

When to use notebooks vs dataflows

People ask this a lot. Here's my rule:

Use dataflows gen2 when:

You're comfortable with power query
Transformations are straightforward (filters, joins, basic aggregations)
Data is under 20GB
Your team already knows power query

Use notebooks when:

You need python or complex logic
Data is over 20GB
You need specific spark optimizations
You want full control over the physics

Dataflows run on spark too but you can't control the configuration. For large scale work notebooks give you the control you need.

Storage matters too

Your spark cluster is only half the physics. Storage I/O is the other half.

Delta tables stored in OneLake use azure blob storage underneath. This has limits:

Throughput caps based on fabric capacity
Latency ~10-50ms per operation
Works best with large sequential reads

What this means:

Good pattern:

# large scan, sequential read
df = spark.table("large_table").filter("date >= '2024-01-01'")
result = df.groupBy("category").agg(sum("amount"))

Bad pattern:

# lots of small random reads
for customer in customer_list:
    df = spark.table("sales").filter(f"customer_id = {customer}")
    # process each customer separately

The second pattern makes thousands of small I/O requests. Storage becomes the bottleneck not compute.

If you need to process data per-key use groupBy not loops. Let spark handle the distribution.

Monitoring what's actually happening

Fabric gives you spark UI access. Use it.

In your notebook after running cells:

Click the spark application link at top
Opens spark UI
Check stages, tasks, data shuffle

Look for:

Tasks that take way longer than others (skew)
Large shuffle read/write (too much data movement)
Lots of tasks (too many partitions)
Few tasks (too few partitions)

The UI shows you the physics. Your code doesn't tell you why it's slow, the execution metrics do.

Quick wins checklist

If your notebook is slow try these in order:

Check data size, if under 1GB use pandas
Read only columns you need
Filter as early as possible
Enable adaptive query execution
Set shuffle partitions based on data size (data_gb / 0.128 = partitions)
Broadcast small tables in joins
Check for skew in spark UI
Repartition before writing (target 128MB-1GB files)

These handle 80% of performance problems.

Cost optimization

Performance and cost are connected in fabric. Faster jobs cost less because they use fewer CU-hours.

Real cost example on F16 capacity:

Slow notebook:

Runtime: 45 minutes
CU consumption: 12 CU-hours
If F16 costs $X/hour, this costs 0.75 * X

Optimized notebook:

Runtime: 12 minutes
CU consumption: 3.2 CU-hours
Costs 0.2 * X

Same output, 62% lower cost just from optimization.

This is why understanding the physics matters. You're paying for compute time. Efficient physics means less time means lower cost.

Learning more about spark

If you're serious about fabric notebooks you need to understand spark fundamentals. Can't optimize what you don't understand.

Key concepts to learn:

Transformations vs actions
Wide vs narrow transformations
Shuffle operations
Catalyst optimizer
Tungsten execution engine

You don't need to be a spark expert but understanding the execution model helps you write better notebooks.

The lakehouse architecture also matters. How you structure your delta tables affects query performance as much as spark configuration.

Final thoughts

Your notebook code is just half the story. The spark cluster running it is the other half. Most people only think about logic and wonder why things are slow.

Understanding the physics, what resources you have at each SKU level, and how to configure spark properly makes everything faster and cheaper.

Start with the basics: read less data, filter early, enable AQE. Then move to advanced stuff like shuffle partitions and broadcast joins once you understand what's actually happening.

If you're new to fabric check out my intro guide for power bi developers first. Then come back to this when you're writing notebooks that need optimization.

The split between logic and physics is fundamental. Write good code but configure the execution environment properly too. That's when notebooks actually perform well.