What is the difference between Dataflows Gen1 and Gen2?

Dataflows Gen2 run on Spark instead of the old Power Query engine. This means better performance, direct output to lakehouses, and more control over where data lands. Gen1 stored data in a black box; Gen2 writes to delta tables you can access directly.

Can I use Power Query in Dataflows Gen2?

Yes, Dataflows Gen2 still use the Power Query interface for transformations. The difference is under the hood - transformations are converted to Spark operations for better scalability and performance.

Should I migrate from Dataflows Gen1 to Gen2?

Yes, for most use cases. Gen2 offers better performance, direct lakehouse output, and better integration with Fabric. Migrate when you need better scalability or want your data in a lakehouse instead of Dataverse.

How do Dataflows Gen2 connect to Lakehouses?

You configure a data destination in the dataflow settings. Gen2 can write directly to lakehouse tables as delta format. This is the recommended pattern for loading data into Fabric lakehouses.

Dataflows Gen2 in Microsoft Fabric: Complete Power Query Migration Guide

If you've used dataflows in power bi you might think dataflows gen2 in fabric are just a version bump. They're not. They're a complete rebuild that works totally differently.

Took me a while to figure out why microsoft even made a new version. Turns out the old dataflows had some fundamental limitations and gen2 fixes most of them.

What was wrong with regular dataflows

Regular dataflows (now called gen1) were useful but had issues:

Performance was hit or miss especially with large datasets
You couldn't really control where the data landed
Refresh times were unpredictable
Limited transformation capabilities compared to full spark
Stored data in dataverse or power bi storage (black box)

They worked fine for small to medium etl jobs but didn't scale well.

What makes gen2 different

Dataflows gen2 run on spark instead of the old power query engine. This changes everything about how they work.

Key differences:

Spark backend - way faster for large data transformations
Destination control - you choose where data lands (lakehouse or warehouse)
Delta table output - data gets written as proper delta tables
Better monitoring - actual spark logs and metrics
More flexible - can mix power query with custom spark code

The power query interface looks similar but under the hood it's translating your M code to spark operations.

gen1 still exists

You can still create regular dataflows in fabric. They show up as just "dataflow" not "dataflow gen2". Microsoft hasn't deprecated gen1 yet but gen2 is clearly the future.

Creating a dataflow gen2

Process is pretty straightforward:

Go to your fabric workspace
Click new > dataflow gen2
Get data and build your transformations (same power query experience)
Configure your data destination
Save and publish

The power query part feels identical to regular dataflows. The magic is in step 4 where you pick the destination.

Writing to a lakehouse

This is the most common pattern. Your dataflow writes data into delta tables in a lakehouse.

When you publish a dataflow gen2:

Click on a query in your dataflow
At the bottom right you'll see "data destination"
Select lakehouse
Pick your target lakehouse
Choose table name
Set write mode (append, replace, etc)

Write modes explained:

Replace - drops existing table and writes new data (full refresh)
Append - adds new rows to existing table (incremental)

For most use cases replace is simpler. Append is useful when you only want to load new records but requires you to handle deduplication logic yourself.

# the dataflow generates spark code similar to this
df.write.format("delta") \
    .mode("overwrite") \
    .option("mergeSchema", "true") \
    .saveAsTable("your_table_name")

After the dataflow runs your data shows up in the lakehouse as a delta table. You can query it with sql, use it in notebooks, or build semantic models on top of it.

table naming

Use clear consistent naming for your tables. Something like bronze_raw_sales or silver_cleaned_customers. Makes it obvious which tables are source data vs transformed data.

Writing to a warehouse

You can also write dataflow gen2 output to a fabric warehouse instead of a lakehouse.

Same process but select warehouse as the destination. Data gets written as regular sql tables not delta tables.

When to use warehouse destination:

You're more comfortable with t-sql than spark
Your downstream tools only understand sql databases
You don't need spark processing on the data
You want traditional database constraints and indexes

Personally i default to lakehouse unless there's a specific reason to use warehouse. Lakehouses are more flexible and integrate better with the rest of fabric.

If you're torn on this decision i wrote a detailed comparison of lakehouses vs warehouses that might help.

Incremental refresh patterns

One of the best features in dataflows gen2 is handling incremental loads properly.

Old dataflows had incremental refresh but it was clunky. Gen2 makes it more explicit.

Pattern 1: Simple append

If your source has new records only (no updates):

Add a filter in power query to get only new records (based on date or id)
Set destination to append mode
Each refresh adds only the new rows

Pattern 2: Upsert logic

If you need to handle updates to existing records it gets more complex. You have two options:

Option a: Use replace mode and reload everything. Simple but slow for large tables.

Option b: Use a merge pattern with a notebook after the dataflow. The dataflow stages data and the notebook does a delta merge operation.

# example merge pattern in a notebook
from delta.tables import DeltaTable

# your dataflow writes to a staging table
staging_df = spark.table("staging_sales")

# merge into the main table
target_table = DeltaTable.forName(spark, "sales")

target_table.alias("target").merge(
    staging_df.alias("source"),
    "target.order_id = source.order_id"
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

This pattern is more work to set up but performs way better for large incremental loads.

Scheduling and refresh

Dataflows gen2 don't auto-refresh by default. You need to set up a schedule.

Two ways to trigger a refresh:

Option 1: Built-in schedule

Go to workspace
Find your dataflow gen2
Click the three dots > settings
Set up a refresh schedule

Works fine for simple cases. Limited to standard schedule options like daily or weekly.

Option 2: Data pipeline

More flexible approach. Create a data pipeline that triggers the dataflow.

This lets you:

Run the dataflow based on events or dependencies
Chain multiple dataflows in order
Add error handling and retry logic
Run the dataflow with parameters

For anything beyond basic scheduling i'd use a pipeline.

Performance tips

Gen2 is faster than gen1 but you can still write slow dataflows.

Things that actually matter for performance:

1. Filter early

Push filters as close to the source as possible. Don't load a million rows just to filter down to 1000 later.

Power query query folding still applies. Keep an eye on whether your steps are folding to the source or happening in memory.

2. Avoid unnecessary columns

Remove columns you don't need early in the transformation. Less data moving through the pipeline means faster processing.

3. Use native connectors

Native fabric connectors (for sql, blob storage, etc) are optimized. Custom connectors or web calls are slower.

4. Split large dataflows

One giant dataflow with 50 queries is slower than 5 smaller dataflows with 10 queries each. You can run multiple dataflows in parallel via a pipeline.

5. Watch for cross-joins

Merging tables without proper join keys can explode your row count. Power query doesn't always warn you about this.

Monitoring and debugging

When a dataflow fails you actually get useful logs now.

Go to workspace > find your dataflow > click on it > view refresh history

You'll see:

Duration for each query
Spark metrics (rows processed, data read)
Actual error messages with line numbers
Query plan information

Way better than gen1 where you just got "refresh failed" and had to guess why.

refresh failures

Most dataflow failures are source connection issues or query folding problems. Check your data source credentials first, then look at which specific query failed in the logs.

When to use dataflows gen2 vs other options

Fabric gives you multiple ways to load data. Here's when dataflows gen2 make sense:

Use dataflows gen2 when:

You're comfortable with power query and want to stick with it
Your transformations are mostly standard etl (filters, joins, aggregations)
You want a visual interface not code
Your team already knows power query from power bi
Data volumes are small to medium (under a few hundred gb)

Use notebooks instead when:

You need complex python or scala transformations
You're working with really large datasets
You need machine learning or advanced analytics
You want full control over spark configurations
Power query doesn't support your source system

Use data pipelines when:

You need to orchestrate multiple dataflows or notebooks
You have conditional logic or branching
You need to copy files or call apis
You're migrating from azure data factory

For most power bi developers coming to fabric, dataflows gen2 are the natural starting point. You already know power query so there's minimal learning curve.

Migrating from gen1 to gen2

If you have existing gen1 dataflows there's no automatic migration. You need to rebuild them as gen2.

Process:

Open your old dataflow
Note down all the queries and transformations
Create a new dataflow gen2
Rebuild the queries (should be copy paste for most)
Configure destinations (new step that didn't exist in gen1)
Test thoroughly before switching over

The power query code is mostly compatible but double check:

Any custom functions might need adjustments
Check that query folding still works
Verify data types after loading to destination

Don't migrate everything at once. Start with one dataflow, test it properly, then move the rest.

Common issues and fixes

Problem: dataflow is really slow

Check if query folding is working. Click on a step and look at the query plan. If it says "evaluated in workspace" instead of "evaluated at source" you're loading too much data into memory.

Fix: rewrite queries to push more work to the source.

Problem: data not showing up in lakehouse

Make sure you configured the data destination for each query you want to output. By default queries don't write anywhere, they're just intermediate steps.

Problem: dataflow fails with memory errors

You're trying to load too much data in one go. Either filter more aggressively or split into multiple dataflows and use incremental loading.

Problem: timestamp columns have wrong timezone

Common issue. Power query and spark handle timezones differently. Be explicit about timezone conversions in your M code.

Next steps

Now that you understand dataflows gen2 you can build proper etl pipelines in fabric.

Recommended flow:

Use dataflows gen2 to load raw data into your lakehouse
Use notebooks for complex transformations if needed
Build semantic models on top of the cleaned data
Create power bi reports using direct lake mode

This gives you a proper data pipeline instead of just importing data directly into semantic models.

If you're just getting started with fabric check out my intro guide for power bi developers first.

Dataflows gen2 are a solid tool once you understand they're not just upgraded dataflows but a completely different implementation. The power query interface makes them approachable but the spark backend gives you real power for data transformation at scale.