Spark optimization in fabric notebooks: the logic vs physics split
Most people write spark notebooks and just hope they run fast. Then they're surprised when processing 10GB takes 20 minutes or costs way more capacity than expected.
The problem is treating spark like regular python. It's not. Your code is just instructions. The actual work happens in a distributed system that you need to configure properly.
There's a split between logic and physics. Logic is your code, what transformations you want. Physics is the spark cluster, how those transformations actually execute. You need to understand both.
The logic vs physics mental model
Logic: the transformations you write in your notebook
df = spark.read.parquet("files/raw_data")
df = df.filter(df.date >= "2024-01-01")
df = df.groupBy("customer_id").agg(sum("amount"))
df.write.saveAsTable("aggregated_sales")
Physics: the cluster running that code
- How many nodes
- How much memory per node
- How many cores
- Network between nodes
- Storage I/O limits
Your code might be perfect but if the physics is wrong it'll still be slow. Or your cluster might be powerful but inefficient code wastes it.
Most performance problems are physics issues not logic issues. People optimize the wrong layer.
Fabric capacity SKUs: what you actually get
Fabric doesn't let you configure clusters node by node like databricks. Instead you get capacity units based on your SKU and fabric manages the spark resources.
Here's what each SKU gives you for spark workloads:
F2 (trial/dev)
- 2 capacity units
- Effectively 1-2 spark executors
- ~8GB usable memory total
- Good for: learning, small datasets under 1GB
F4
- 4 capacity units
- 2-4 executors
- ~16GB memory
- Good for: small production workloads, datasets 1-5GB
F8
- 8 capacity units
- 4-6 executors
- ~32GB memory
- Good for: most standard workloads, datasets 5-20GB
F16
- 16 capacity units
- 8-12 executors
- ~64GB memory
- Good for: larger production workloads, datasets 20-50GB
F32
- 32 capacity units
- 16-24 executors
- ~128GB memory
- Good for: heavy processing, datasets 50-200GB
F64 and higher
- 64+ capacity units
- 32+ executors
- 256GB+ memory
- Good for: very large scale, datasets 200GB+
These are approximations. Fabric doesn't expose exact executor counts and they vary based on workload type. But this gives you a mental model of what scale each tier handles.
starter pools
Fabric has starter pools that spin up in 5-10 seconds vs 2-3 minutes for custom pools. Starter pools are smaller (usually 2-4 executors) but fine for most notebooks. Only use custom pools if you need specific configurations or more resources.
Small datasets: avoid spark entirely
If your data is under 1GB don't use spark dataframes at all. Use pandas.
Spark has overhead. Starting executors, shuffling data between them, coordinating tasks. For small data this overhead is bigger than the actual processing time.
# slow for small data
df = spark.read.parquet("files/small_data.parquet")
df = df.filter(df.status == "active")
df_pandas = df.toPandas() # converts to pandas anyway
# faster
import pandas as pd
df = pd.read_parquet("files/small_data.parquet")
df = df[df.status == "active"]
The pandas version runs on one machine in memory. No distribution overhead. For datasets under 1GB it's usually 2-5x faster than spark.
You can still save pandas dataframes as delta tables:
# write pandas df to delta
df.to_parquet("Files/my_table", engine="pyarrow")
spark.sql("CREATE TABLE my_table USING DELTA LOCATION 'Files/my_table'")
Rule: if it fits in memory on one machine use pandas. Only use spark when you actually need distribution.
Medium datasets: default settings work fine
For datasets between 1-50GB the default spark configuration in fabric works fine. Don't over-optimize.
What actually matters at this scale:
1. Read only the columns you need
# reads entire file
df = spark.read.parquet("files/large_table")
df = df.select("customer_id", "amount")
# reads only needed columns (way faster)
df = spark.read.parquet("files/large_table") \
.select("customer_id", "amount")
Parquet is columnar. Reading 2 columns from a 50 column table is 20x less data to scan.
2. Filter early
# filters after loading everything
df = spark.read.parquet("files/sales")
df = df.filter(df.date >= "2024-01-01")
# filters during read (predicate pushdown)
df = spark.read.parquet("files/sales") \
.filter("date >= '2024-01-01'")
The second version only reads files that contain data from 2024. Can skip entire partitions.
3. Cache only when reusing
# loads same data twice (slow)
df = spark.table("sales")
result1 = df.filter(df.region == "US").count()
result2 = df.filter(df.region == "EU").count()
# cache when reading multiple times
df = spark.table("sales")
df.cache()
result1 = df.filter(df.region == "US").count()
result2 = df.filter(df.region == "EU").count()
df.unpersist()
Caching stores the dataframe in memory. Only helps if you actually reuse it. Don't cache and then only use it once.
For 1-50GB workloads these three things matter more than fancy spark configs.
Large datasets: configuration starts mattering
Above 50GB you need to think about spark configuration. The defaults aren't optimized for large scale.
Shuffle partitions
This is the most important setting nobody changes.
Default is 200 shuffle partitions. This is way too low for large data and way too high for small data.
# check current setting
spark.conf.get("spark.sql.shuffle.partitions")
# set based on your data size
# rule of thumb: aim for 128MB per partition
# for 10GB of data after shuffle
# 10GB / 128MB = ~80 partitions
spark.conf.set("spark.sql.shuffle.partitions", "80")
# for 100GB of data
# 100GB / 128MB = ~800 partitions
spark.conf.set("spark.sql.shuffle.partitions", "800")
Shuffle happens during groupBy, join, repartition operations. Too few partitions means tasks are too big and run out of memory. Too many means overhead from managing thousands of tiny tasks.
Adaptive query execution
Turn this on. It's off by default in some fabric versions but makes everything faster.
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
AQE adjusts the query plan at runtime based on actual data sizes. Automatically combines small partitions, adjusts join strategies, handles skew.
Should honestly be default but you need to enable it manually.
Broadcast joins
When joining a large table to a small table spark can broadcast the small one to all executors instead of shuffling both.
# automatic broadcast for tables under 10MB (default)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10485760")
# increase if you have memory
# good for dimension tables up to 100MB
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "104857600")
# or force broadcast manually
from pyspark.sql.functions import broadcast
large_df = spark.table("fact_sales")
small_df = spark.table("dim_product")
result = large_df.join(broadcast(small_df), "product_id")
Broadcasting avoids expensive shuffles. Way faster for small dimension tables joining to large fact tables.
memory limits
Don't broadcast tables larger than 20% of your executor memory. If executors have 8GB don't broadcast tables over 1-2GB. You'll run out of memory and everything crashes.
Dynamic partition pruning
Another setting that should be default but isn't always enabled.
spark.conf.set("spark.sql.optimizer.dynamicPartitionPruning.enabled", "true")
When you filter on a dimension table then join to a partitioned fact table, this skips reading unnecessary partitions. Can reduce data scanned by 10-100x.
Reading and writing efficiently
File formats
Always use parquet or delta for analytics. Never use csv in production.
# csv: slow, no schema, string types, huge files
df = spark.read.csv("files/data.csv", header=True, inferSchema=True)
# parquet: fast, schema, compressed, columnar
df = spark.read.parquet("files/data.parquet")
# delta: parquet + versioning + optimizations
df = spark.read.format("delta").load("Tables/data")
CSV is text. Every read parses strings into types. Parquet stores data in binary format already typed and compressed.
Real numbers: 10GB csv vs 2GB parquet with 5x faster read times.
Writing patterns
How you write data affects read performance forever.
# creates tons of small files (slow to read later)
df.write.format("delta").mode("append").saveAsTable("my_table")
# repartition first for better file sizes
df.repartition(10).write.format("delta").mode("append").saveAsTable("my_table")
# or use coalesce if already shuffled
df.coalesce(10).write.format("delta").mode("append").saveAsTable("my_table")
Each partition writes one file. Too many partitions means thousands of tiny files. Too few means giant files that can't parallelize reads.
Target file size: 128MB to 1GB per file depending on scale.
Partitioning tables
Partition delta tables by columns you filter on frequently.
# partition by date (most common pattern)
df.write.format("delta") \
.partitionBy("date") \
.saveAsTable("sales")
# queries that filter on date skip entire partitions
filtered = spark.table("sales").filter("date >= '2024-01-01'")
Only partition on low cardinality columns. Date is good (365 values). Customer ID is bad (millions of values).
Don't partition if your table is under 10GB. The overhead isn't worth it.
Real example: optimizing a slow notebook
Had a notebook processing 80GB of sales data. Took 45 minutes on F16 capacity. Here's what actually helped:
Before optimization:
# read everything
sales = spark.read.parquet("files/sales")
# filter after reading
sales = sales.filter(sales.date >= "2024-01-01")
# join to product catalog (5MB table)
products = spark.read.parquet("files/products")
result = sales.join(products, "product_id")
# aggregate
final = result.groupBy("category").agg(sum("amount"))
final.write.saveAsTable("category_sales")
Runtime: 45 minutes Capacity consumed: 12 CU-hours
After optimization:
# enable AQE
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.shuffle.partitions", "400")
# filter during read
sales = spark.read.parquet("files/sales") \
.filter("date >= '2024-01-01'")
# broadcast small table
products = spark.read.parquet("files/products")
result = sales.join(broadcast(products), "product_id")
# rest same
final = result.groupBy("category").agg(sum("amount"))
final.write.saveAsTable("category_sales")
Runtime: 12 minutes Capacity consumed: 3.2 CU-hours
3.75x faster by changing physics not logic. The actual transformations stayed identical.
What you can't control in fabric
Unlike databricks you don't get full cluster control in fabric. Some things are managed:
Can't configure:
- Exact executor count
- Individual node types
- Custom docker images
- Network settings
- Storage mount points
Can configure:
- All spark configs via spark.conf.set
- Library installations
- Environment variables
- Session timeout settings
This is actually fine for most use cases. Less configuration means less to break. If you need atomic control over every cluster setting use databricks instead.
When to use notebooks vs dataflows
People ask this a lot. Here's my rule:
Use dataflows gen2 when:
- You're comfortable with power query
- Transformations are straightforward (filters, joins, basic aggregations)
- Data is under 20GB
- Your team already knows power query
Use notebooks when:
- You need python or complex logic
- Data is over 20GB
- You need specific spark optimizations
- You want full control over the physics
Dataflows run on spark too but you can't control the configuration. For large scale work notebooks give you the control you need.
Storage matters too
Your spark cluster is only half the physics. Storage I/O is the other half.
Delta tables stored in OneLake use azure blob storage underneath. This has limits:
- Throughput caps based on fabric capacity
- Latency ~10-50ms per operation
- Works best with large sequential reads
What this means:
Good pattern:
# large scan, sequential read
df = spark.table("large_table").filter("date >= '2024-01-01'")
result = df.groupBy("category").agg(sum("amount"))
Bad pattern:
# lots of small random reads
for customer in customer_list:
df = spark.table("sales").filter(f"customer_id = {customer}")
# process each customer separately
The second pattern makes thousands of small I/O requests. Storage becomes the bottleneck not compute.
If you need to process data per-key use groupBy not loops. Let spark handle the distribution.
Monitoring what's actually happening
Fabric gives you spark UI access. Use it.
In your notebook after running cells:
- Click the spark application link at top
- Opens spark UI
- Check stages, tasks, data shuffle
Look for:
- Tasks that take way longer than others (skew)
- Large shuffle read/write (too much data movement)
- Lots of tasks (too many partitions)
- Few tasks (too few partitions)
The UI shows you the physics. Your code doesn't tell you why it's slow, the execution metrics do.
Quick wins checklist
If your notebook is slow try these in order:
- Check data size, if under 1GB use pandas
- Read only columns you need
- Filter as early as possible
- Enable adaptive query execution
- Set shuffle partitions based on data size (data_gb / 0.128 = partitions)
- Broadcast small tables in joins
- Check for skew in spark UI
- Repartition before writing (target 128MB-1GB files)
These handle 80% of performance problems.
Cost optimization
Performance and cost are connected in fabric. Faster jobs cost less because they use fewer CU-hours.
Real cost example on F16 capacity:
Slow notebook:
- Runtime: 45 minutes
- CU consumption: 12 CU-hours
- If F16 costs $X/hour, this costs 0.75 * X
Optimized notebook:
- Runtime: 12 minutes
- CU consumption: 3.2 CU-hours
- Costs 0.2 * X
Same output, 62% lower cost just from optimization.
This is why understanding the physics matters. You're paying for compute time. Efficient physics means less time means lower cost.
Learning more about spark
If you're serious about fabric notebooks you need to understand spark fundamentals. Can't optimize what you don't understand.
Key concepts to learn:
- Transformations vs actions
- Wide vs narrow transformations
- Shuffle operations
- Catalyst optimizer
- Tungsten execution engine
You don't need to be a spark expert but understanding the execution model helps you write better notebooks.
The lakehouse architecture also matters. How you structure your delta tables affects query performance as much as spark configuration.
Final thoughts
Your notebook code is just half the story. The spark cluster running it is the other half. Most people only think about logic and wonder why things are slow.
Understanding the physics, what resources you have at each SKU level, and how to configure spark properly makes everything faster and cheaper.
Start with the basics: read less data, filter early, enable AQE. Then move to advanced stuff like shuffle partitions and broadcast joins once you understand what's actually happening.
If you're new to fabric check out my intro guide for power bi developers first. Then come back to this when you're writing notebooks that need optimization.
The split between logic and physics is fundamental. Write good code but configure the execution environment properly too. That's when notebooks actually perform well.
related posts
Migrating to fabric: a 3 day plan for power bi teams
Moving to fabric doesn't have to be a month-long ordeal. Here's a practical 3-day roadmap to get your first end-to-end solution running in production.
Databricks vs fabric: which one do you actually need
Databricks gives you atomic control over everything. Fabric makes it simple and integrates with power bi. Neither is objectively better but one is probably right for your situation.
Delta lake optimization in fabric: the maintenance nobody tells you about
Delta tables get slow over time if you don't maintain them. Small files pile up, queries slow down, storage bloats. Here's how to actually fix it with optimize, z-order, and vacuum.