back to blog
Microsoft Fabric9 min read

Dataflows gen2 in fabric: the upgrade you need to understand

#fabric#dataflows#dataflows-gen2#power-query#lakehouse

If you've used dataflows in power bi you might think dataflows gen2 in fabric are just a version bump. They're not. They're a complete rebuild that works totally differently.

Took me a while to figure out why microsoft even made a new version. Turns out the old dataflows had some fundamental limitations and gen2 fixes most of them.

What was wrong with regular dataflows

Regular dataflows (now called gen1) were useful but had issues:

  • Performance was hit or miss especially with large datasets
  • You couldn't really control where the data landed
  • Refresh times were unpredictable
  • Limited transformation capabilities compared to full spark
  • Stored data in dataverse or power bi storage (black box)

They worked fine for small to medium etl jobs but didn't scale well.

What makes gen2 different

Dataflows gen2 run on spark instead of the old power query engine. This changes everything about how they work.

Key differences:

  • Spark backend - way faster for large data transformations
  • Destination control - you choose where data lands (lakehouse or warehouse)
  • Delta table output - data gets written as proper delta tables
  • Better monitoring - actual spark logs and metrics
  • More flexible - can mix power query with custom spark code

The power query interface looks similar but under the hood it's translating your M code to spark operations.

gen1 still exists

You can still create regular dataflows in fabric. They show up as just "dataflow" not "dataflow gen2". Microsoft hasn't deprecated gen1 yet but gen2 is clearly the future.

Creating a dataflow gen2

Process is pretty straightforward:

  1. Go to your fabric workspace
  2. Click new > dataflow gen2
  3. Get data and build your transformations (same power query experience)
  4. Configure your data destination
  5. Save and publish

The power query part feels identical to regular dataflows. The magic is in step 4 where you pick the destination.

Writing to a lakehouse

This is the most common pattern. Your dataflow writes data into delta tables in a lakehouse.

When you publish a dataflow gen2:

  1. Click on a query in your dataflow
  2. At the bottom right you'll see "data destination"
  3. Select lakehouse
  4. Pick your target lakehouse
  5. Choose table name
  6. Set write mode (append, replace, etc)

Write modes explained:

  • Replace - drops existing table and writes new data (full refresh)
  • Append - adds new rows to existing table (incremental)

For most use cases replace is simpler. Append is useful when you only want to load new records but requires you to handle deduplication logic yourself.

# the dataflow generates spark code similar to this
df.write.format("delta") \
    .mode("overwrite") \
    .option("mergeSchema", "true") \
    .saveAsTable("your_table_name")

After the dataflow runs your data shows up in the lakehouse as a delta table. You can query it with sql, use it in notebooks, or build semantic models on top of it.

table naming

Use clear consistent naming for your tables. Something like bronze_raw_sales or silver_cleaned_customers. Makes it obvious which tables are source data vs transformed data.

Writing to a warehouse

You can also write dataflow gen2 output to a fabric warehouse instead of a lakehouse.

Same process but select warehouse as the destination. Data gets written as regular sql tables not delta tables.

When to use warehouse destination:

  • You're more comfortable with t-sql than spark
  • Your downstream tools only understand sql databases
  • You don't need spark processing on the data
  • You want traditional database constraints and indexes

Personally i default to lakehouse unless there's a specific reason to use warehouse. Lakehouses are more flexible and integrate better with the rest of fabric.

If you're torn on this decision i wrote a detailed comparison of lakehouses vs warehouses that might help.

Incremental refresh patterns

One of the best features in dataflows gen2 is handling incremental loads properly.

Old dataflows had incremental refresh but it was clunky. Gen2 makes it more explicit.

Pattern 1: Simple append

If your source has new records only (no updates):

  1. Add a filter in power query to get only new records (based on date or id)
  2. Set destination to append mode
  3. Each refresh adds only the new rows

Pattern 2: Upsert logic

If you need to handle updates to existing records it gets more complex. You have two options:

Option a: Use replace mode and reload everything. Simple but slow for large tables.

Option b: Use a merge pattern with a notebook after the dataflow. The dataflow stages data and the notebook does a delta merge operation.

# example merge pattern in a notebook
from delta.tables import DeltaTable

# your dataflow writes to a staging table
staging_df = spark.table("staging_sales")

# merge into the main table
target_table = DeltaTable.forName(spark, "sales")

target_table.alias("target").merge(
    staging_df.alias("source"),
    "target.order_id = source.order_id"
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

This pattern is more work to set up but performs way better for large incremental loads.

Scheduling and refresh

Dataflows gen2 don't auto-refresh by default. You need to set up a schedule.

Two ways to trigger a refresh:

Option 1: Built-in schedule

  1. Go to workspace
  2. Find your dataflow gen2
  3. Click the three dots > settings
  4. Set up a refresh schedule

Works fine for simple cases. Limited to standard schedule options like daily or weekly.

Option 2: Data pipeline

More flexible approach. Create a data pipeline that triggers the dataflow.

This lets you:

  • Run the dataflow based on events or dependencies
  • Chain multiple dataflows in order
  • Add error handling and retry logic
  • Run the dataflow with parameters

For anything beyond basic scheduling i'd use a pipeline.

Performance tips

Gen2 is faster than gen1 but you can still write slow dataflows.

Things that actually matter for performance:

1. Filter early

Push filters as close to the source as possible. Don't load a million rows just to filter down to 1000 later.

Power query query folding still applies. Keep an eye on whether your steps are folding to the source or happening in memory.

2. Avoid unnecessary columns

Remove columns you don't need early in the transformation. Less data moving through the pipeline means faster processing.

3. Use native connectors

Native fabric connectors (for sql, blob storage, etc) are optimized. Custom connectors or web calls are slower.

4. Split large dataflows

One giant dataflow with 50 queries is slower than 5 smaller dataflows with 10 queries each. You can run multiple dataflows in parallel via a pipeline.

5. Watch for cross-joins

Merging tables without proper join keys can explode your row count. Power query doesn't always warn you about this.

Monitoring and debugging

When a dataflow fails you actually get useful logs now.

Go to workspace > find your dataflow > click on it > view refresh history

You'll see:

  • Duration for each query
  • Spark metrics (rows processed, data read)
  • Actual error messages with line numbers
  • Query plan information

Way better than gen1 where you just got "refresh failed" and had to guess why.

refresh failures

Most dataflow failures are source connection issues or query folding problems. Check your data source credentials first, then look at which specific query failed in the logs.

When to use dataflows gen2 vs other options

Fabric gives you multiple ways to load data. Here's when dataflows gen2 make sense:

Use dataflows gen2 when:

  • You're comfortable with power query and want to stick with it
  • Your transformations are mostly standard etl (filters, joins, aggregations)
  • You want a visual interface not code
  • Your team already knows power query from power bi
  • Data volumes are small to medium (under a few hundred gb)

Use notebooks instead when:

  • You need complex python or scala transformations
  • You're working with really large datasets
  • You need machine learning or advanced analytics
  • You want full control over spark configurations
  • Power query doesn't support your source system

Use data pipelines when:

  • You need to orchestrate multiple dataflows or notebooks
  • You have conditional logic or branching
  • You need to copy files or call apis
  • You're migrating from azure data factory

For most power bi developers coming to fabric, dataflows gen2 are the natural starting point. You already know power query so there's minimal learning curve.

Migrating from gen1 to gen2

If you have existing gen1 dataflows there's no automatic migration. You need to rebuild them as gen2.

Process:

  1. Open your old dataflow
  2. Note down all the queries and transformations
  3. Create a new dataflow gen2
  4. Rebuild the queries (should be copy paste for most)
  5. Configure destinations (new step that didn't exist in gen1)
  6. Test thoroughly before switching over

The power query code is mostly compatible but double check:

  • Any custom functions might need adjustments
  • Check that query folding still works
  • Verify data types after loading to destination

Don't migrate everything at once. Start with one dataflow, test it properly, then move the rest.

Common issues and fixes

Problem: dataflow is really slow

Check if query folding is working. Click on a step and look at the query plan. If it says "evaluated in workspace" instead of "evaluated at source" you're loading too much data into memory.

Fix: rewrite queries to push more work to the source.

Problem: data not showing up in lakehouse

Make sure you configured the data destination for each query you want to output. By default queries don't write anywhere, they're just intermediate steps.

Problem: dataflow fails with memory errors

You're trying to load too much data in one go. Either filter more aggressively or split into multiple dataflows and use incremental loading.

Problem: timestamp columns have wrong timezone

Common issue. Power query and spark handle timezones differently. Be explicit about timezone conversions in your M code.

Next steps

Now that you understand dataflows gen2 you can build proper etl pipelines in fabric.

Recommended flow:

  1. Use dataflows gen2 to load raw data into your lakehouse
  2. Use notebooks for complex transformations if needed
  3. Build semantic models on top of the cleaned data
  4. Create power bi reports using direct lake mode

This gives you a proper data pipeline instead of just importing data directly into semantic models.

If you're just getting started with fabric check out my intro guide for power bi developers first.

Dataflows gen2 are a solid tool once you understand they're not just upgraded dataflows but a completely different implementation. The power query interface makes them approachable but the spark backend gives you real power for data transformation at scale.

share:
Yari Bouwman

Written by

Data Engineer and Solution Designer specializing in scalable data platforms and modern cloud solutions. More about me

related posts