Picture this: you're maintaining a Power BI report that pulls from a transactional database with 50 million rows. Every morning at 6 AM, your scheduled refresh kicks off and proceeds to re-download all 50 million rows, apply the same transformations it applied yesterday, and rebuild every aggregation from scratch. By 6:45 AM, your gateway is exhausted, your source system's DBA is sending you passive-aggressive emails, and your report still isn't ready for the 7 AM standup. Meanwhile, only 12,000 rows actually changed overnight.
This is the incremental refresh problem, and it's one of the most common sources of wasted compute, refresh timeouts, and general misery in production Power BI environments. Power Query gives you more control over this than most practitioners realize — and when you combine smart watermark logic, persistent state management, and change tracking patterns, you can build refresh pipelines that only process what's actually new or changed. By the end of this lesson, you'll understand how to design and implement those patterns yourself.
What you'll learn:
You should be comfortable with:
Before writing a single line of M, let's be precise about what we're solving. A full refresh has three failure modes at scale:
Volume failure is the most obvious. If your source has 100 million rows, pulling all of them every refresh cycle is simply not sustainable. Network bandwidth, gateway memory, and source system load all become constraints.
Incremental failure is subtler. Even if volume isn't the issue today, growth makes it one eventually. A table that's manageable at 10 million rows becomes a problem at 100 million. Pipelines that don't account for growth fail gradually, then suddenly.
Change complexity failure is the sneakiest. Some datasets don't just grow — they mutate. Rows get updated, deleted, or backdated. A naive "give me everything after the last ID I saw" approach misses updates entirely and may never detect deletes.
The solution architecture has three components working together:
Power Query handles all three, but it needs help with persistence — and that's where the design gets interesting.
Power Query itself is stateless. Every time a refresh runs, M code executes from scratch with no memory of previous executions. This is by design — it keeps queries deterministic and reproducible. But incremental refresh requires state: specifically, the answer to "what was the last thing I processed?"
The standard solution is to externalize your state. You store your watermark somewhere Power Query can read it at the start of a refresh and write it at the end. Common storage options are:
| Storage Location | Read | Write | Best For |
|---|---|---|---|
| Excel file (SharePoint/OneDrive) | Easy | Manual or Flow | Small teams, low frequency |
| SharePoint List | Easy | Power Automate | Medium complexity |
| SQL table | Easy | Dataflow or stored proc | Production pipelines |
| Azure Blob / Data Lake | Easy | Power Automate / ADF | Enterprise scale |
For this lesson, we'll use a SQL table as the persistent store because it's the most production-realistic and gives us transactional safety. The pattern translates directly to the other options.
Here's the watermark table we'll maintain in SQL:
CREATE TABLE dbo.RefreshWatermarks (
TableName NVARCHAR(100) PRIMARY KEY,
LastWatermark DATETIME2,
LastRunStatus NVARCHAR(20),
LastRunStart DATETIME2,
LastRunEnd DATETIME2
);
-- Seed the initial record
INSERT INTO dbo.RefreshWatermarks VALUES (
'SalesTransactions',
'2024-01-01 00:00:00',
'Success',
NULL,
NULL
);
This table does double duty: it stores the watermark value and tracks whether the last run succeeded. That second capability becomes important when we build failure recovery.
Let's set up the actual source data scenario. We're working with a SalesTransactions table that receives roughly 15,000 new rows per day and has updates to existing rows (price corrections, status changes) on maybe 500 rows per day. Here's the source schema:
CREATE TABLE dbo.SalesTransactions (
TransactionID INT PRIMARY KEY,
CustomerID INT,
ProductID INT,
SaleAmount DECIMAL(10,2),
SaleStatus NVARCHAR(20),
CreatedAt DATETIME2,
ModifiedAt DATETIME2 -- Updated whenever a row changes
);
The ModifiedAt column is our friend. Any system worth its salt tracks when rows were last modified — if yours doesn't, now is the time to advocate loudly for adding that column.
Create a new query called WatermarkValue. This query connects to the watermark table and extracts the current boundary:
let
Source = Sql.Database("your-server.database.windows.net", "YourDatabase"),
WatermarkTable = Source{[Schema="dbo", Item="RefreshWatermarks"]}[Data],
FilteredRow = Table.SelectRows(
WatermarkTable,
each [TableName] = "SalesTransactions"
),
WatermarkDatetime = FilteredRow{0}[LastWatermark]
in
WatermarkDatetime
This query returns a single datetime2 value. Disable load for this query (right-click in the Queries pane, uncheck "Enable Load") — it's a helper, not a destination table.
Tip: Always test this query in isolation first. Open the Advanced Editor, confirm you're getting the right scalar value, and check the data type. A type mismatch between your watermark and your filter column is the #1 cause of silent full-table scans.
Now create the main SalesTransactions_Incremental query. Notice how we reference WatermarkValue directly:
let
Source = Sql.Database("your-server.database.windows.net", "YourDatabase"),
TransactionsTable = Source{[Schema="dbo", Item="SalesTransactions"]}[Data],
// Filter to only rows modified after the watermark
// This captures BOTH new rows and updated rows
FilteredRows = Table.SelectRows(
TransactionsTable,
each [ModifiedAt] > WatermarkValue
),
// Apply your standard transformations
TypedColumns = Table.TransformColumnTypes(
FilteredRows,
{
{"TransactionID", Int64.Type},
{"CustomerID", Int64.Type},
{"ProductID", Int64.Type},
{"SaleAmount", type number},
{"SaleStatus", type text},
{"CreatedAt", type datetime},
{"ModifiedAt", type datetime}
}
),
// Add a processing metadata column
WithProcessingTimestamp = Table.AddColumn(
TypedColumns,
"ProcessedAt",
each DateTime.LocalNow(),
type datetime
)
in
WithProcessingTimestamp
This is where query folding matters enormously. The Table.SelectRows filter against ModifiedAt needs to fold back to the SQL source as a WHERE clause, not execute in Power Query's engine. If it folds correctly, you're asking the database to give you only the changed rows. If it doesn't fold, Power Query downloads the entire table and filters in memory — exactly what you were trying to avoid.
Warning: Check query folding by right-clicking the last applied step and selecting "View Native Query." If the option is grayed out, folding has broken somewhere in your step chain. Common culprits are custom functions,
List.Generatecalls, and referencing other queries mid-chain. Keep your filter steps as early as possible.
When folding works correctly, the native query for the filter step should look something like:
SELECT [TransactionID], [CustomerID], [ProductID], [SaleAmount],
[SaleStatus], [CreatedAt], [ModifiedAt]
FROM [dbo].[SalesTransactions]
WHERE [ModifiedAt] > '2024-09-15 06:00:00'
That's what efficiency looks like. The database handles the filter, returns only the relevant rows, and Power Query just needs to transform the results.
An incremental extract is only half the picture. You need to merge those new/changed rows into your existing dataset. There are three main patterns, and choosing between them depends on your data characteristics.
If your data is genuinely immutable — events are recorded once and never changed — the merge is just an append. This is the simplest case and the one Power BI's built-in Incremental Refresh feature handles for you. We're going further than that here.
This is the most common real-world case. New rows come in, existing rows sometimes get updated, but nothing is ever deleted. The merge logic:
TransactionID appears in the incremental batch (these are the updated rows)In M, assuming you have SalesTransactions_Full as your existing dataset and SalesTransactions_Incremental as the batch:
let
// Load the existing full dataset
Existing = SalesTransactions_Full,
// Get the list of IDs that appear in the incremental batch
IncrementalIDs = List.Buffer(
Table.Column(SalesTransactions_Incremental, "TransactionID")
),
// Remove stale versions of updated rows from existing data
ExistingWithoutUpdated = Table.SelectRows(
Existing,
each not List.Contains(IncrementalIDs, [TransactionID])
),
// Append the incremental batch (new rows + updated row versions)
Merged = Table.Combine({ExistingWithoutUpdated, SalesTransactions_Incremental}),
// Sort to maintain consistent ordering
Sorted = Table.Sort(Merged, {{"TransactionID", Order.Ascending}})
in
Sorted
List.Buffer is doing important work here — it materializes the ID list into memory once so the subsequent List.Contains call doesn't re-evaluate the source query for every row in Existing. Without it, you can hit severe performance degradation on large datasets.
Deletes are the hard problem. If a row disappears from the source, your watermark-based query will never see it — you only see what changed, not what vanished. There are two approaches:
Approach 1: Soft deletes. The source system never physically deletes rows; instead it marks them with a IsDeleted flag and updates ModifiedAt. Your incremental query naturally picks these up, and your report-level filters handle excluding them. This is the cleanest solution and worth negotiating with your source system owners.
Approach 2: Periodic full reconciliation. On a schedule (weekly, monthly), you perform a full key comparison between your local dataset and the source. This is expensive but necessary if soft deletes aren't available:
let
// Pull just the IDs currently in the source (no full data transfer)
SourceIDs = Table.SelectColumns(
Sql.Database("your-server", "YourDatabase"){
[Schema="dbo", Item="SalesTransactions"]
}[Data],
{"TransactionID"}
),
SourceIDList = List.Buffer(Table.Column(SourceIDs, "TransactionID")),
// Remove rows from local dataset that no longer exist in source
ExistingReconciled = Table.SelectRows(
SalesTransactions_Full,
each List.Contains(SourceIDList, [TransactionID])
)
in
ExistingReconciled
Warning: The reconciliation query above will not fold — you're comparing across a live source and an in-memory table. This is expensive. Run it on a separate, less-frequent schedule, not every incremental refresh.
Now let's put persistent state properly in place. The challenge is that Power Query can read state but can't write it — M is a query language, not a scripting language. The write-back step needs to happen outside Power Query.
Here's the production architecture:
[Source DB] --read--> [Power Query: read watermark]
-> [Power Query: extract incremental]
-> [Power Query: merge to full dataset]
-> [Power BI: load data model]
-> [Power Automate: update watermark table]
The Power Automate flow triggers on successful Power BI refresh completion and executes a SQL stored procedure:
CREATE PROCEDURE dbo.UpdateRefreshWatermark
@TableName NVARCHAR(100),
@NewWatermark DATETIME2,
@RunStatus NVARCHAR(20)
AS
BEGIN
UPDATE dbo.RefreshWatermarks
SET
LastWatermark = @NewWatermark,
LastRunStatus = @RunStatus,
LastRunEnd = SYSUTCDATETIME()
WHERE TableName = @TableName;
END;
What value do we pass as @NewWatermark? The safest choice is MAX(ModifiedAt) from the incremental batch that was just loaded. You can expose this as a separate query in Power Query:
// Query: NewWatermarkValue
// Disable Load - used by Power Automate via dataset REST API or measure
let
MaxModified = List.Max(
Table.Column(SalesTransactions_Incremental, "ModifiedAt")
),
// Fall back to current time if incremental batch was empty
Result = if MaxModified = null
then DateTime.UtcNow()
else MaxModified
in
Result
Power Automate can read this value via the Power BI REST API (Get Dataset In Group / Get Tables) or you can surface it as a card visual that Power Automate reads with the "Get a row" action on a SharePoint list that the report writes to.
Tip: There's a simpler option for teams that don't want a Power Automate dependency: use a Power BI dataflow instead of a Power BI Desktop dataset. Dataflows support output to Azure Data Lake and have direct integration with Azure Data Factory for writeback. If you're already in the Azure ecosystem, this pipeline is cleaner end-to-end.
In distributed systems, transactions don't always arrive in order. A sale that happened yesterday might not appear in your source database until today, due to replication lag, ETL pipelines feeding the source, or timezone issues. If your watermark moved past that row's ModifiedAt, you'll never pick it up.
The fix is a lookback window: instead of filtering ModifiedAt > WatermarkValue, filter ModifiedAt > WatermarkValue - #duration(0, 4, 0, 0) (i.e., four hours before the watermark). This overlaps slightly with the previous refresh window, ensuring late arrivals are captured.
// In your SalesTransactions_Incremental query:
LookbackWatermark = WatermarkValue - #duration(0, 4, 0, 0),
FilteredRows = Table.SelectRows(
TransactionsTable,
each [ModifiedAt] > LookbackWatermark
)
The lookback means you'll always pull some duplicate rows. That's fine — your upsert merge logic handles duplicates correctly by replacing older versions with newer ones. The only cost is slightly higher row counts in the incremental batch.
DATETIME2 in SQL Server is timezone-naive. If your Power Query session runs in UTC but your database stores times in US Eastern, your watermark comparison will be off by 4-5 hours depending on DST. Standardize on UTC everywhere and be explicit about conversions:
// Force UTC comparison by converting local Power Query time to UTC
CurrentUTC = DateTimeZone.UtcNow(),
CurrentUTCNaive = DateTime.From(DateTimeZone.RemoveZone(CurrentUTC)),
// Apply as upper bound if you want to avoid partial-hour data
FilteredRows = Table.SelectRows(
TransactionsTable,
each [ModifiedAt] > WatermarkValue
and [ModifiedAt] <= CurrentUTCNaive - #duration(0, 1, 0, 0)
)
The upper bound filter (<= now - 1 hour) prevents you from capturing rows that are in-flight — partially committed transactions that might still be changing. It's a safety buffer that the watermark update will then advance past on the next successful run.
If a refresh fails halfway through — say, the connection drops after extracting data but before Power Automate updates the watermark — your next refresh will re-process the same batch. This is called at-least-once processing, and it's the correct failure mode for this architecture. Because your merge logic is an upsert, re-processing the same rows is safe — you'll just end up with the same result.
What you want to avoid is advancing the watermark before confirming success. Never update the watermark at the start of a refresh run. Always update it at the end, after successful load.
The LastRunStatus column in our watermark table helps you build alerting. If Power Automate sees LastRunStatus = 'Failed' persisted from a previous run, it can trigger a notification or retry logic before attempting an update.
-- Power Automate can check this before updating
SELECT LastRunStatus, LastRunStart, LastRunEnd
FROM dbo.RefreshWatermarks
WHERE TableName = 'SalesTransactions'
AND LastRunStatus = 'Failed'
AND LastRunEnd < DATEADD(HOUR, -2, SYSUTCDATETIME());
-- Alert if this returns any rows
Let's bring everything together in a realistic end-to-end setup. You're building a daily refresh pipeline for a retail analytics report. The source is SQL Server. You want incremental extraction with upsert merging, late-arrival protection, and watermark persistence.
let
Source = Sql.Database("sqlprod01.yourcompany.com", "RetailDW"),
WatermarkTable = Source{[Schema="dbo", Item="RefreshWatermarks"]}[Data],
SalesRow = Table.SelectRows(
WatermarkTable,
each [TableName] = "SalesTransactions"
),
Watermark = SalesRow{0}[LastWatermark],
// Apply lookback window for late-arriving data
LookbackWatermark = Watermark - #duration(0, 4, 0, 0)
in
LookbackWatermark
let
Source = Sql.Database("sqlprod01.yourcompany.com", "RetailDW"),
RawTable = Source{[Schema="dbo", Item="SalesTransactions"]}[Data],
// Filter for changed rows - this MUST fold for performance
ChangedRows = Table.SelectRows(
RawTable,
each [ModifiedAt] > WatermarkValue
),
// Standard type enforcement
TypedTable = Table.TransformColumnTypes(ChangedRows, {
{"TransactionID", Int64.Type},
{"CustomerID", Int64.Type},
{"ProductID", Int64.Type},
{"SaleAmount", Currency.Type},
{"SaleStatus", type text},
{"CreatedAt", type datetime},
{"ModifiedAt", type datetime},
{"StoreID", Int64.Type},
{"RegionCode", type text}
}),
// Enrich with processing metadata
WithBatchID = Table.AddColumn(
TypedTable,
"BatchID",
each Number.From(DateTime.FixedLocalNow()),
type number
)
in
WithBatchID
Note: DateTime.FixedLocalNow() evaluates once per query execution (unlike DateTime.LocalNow() which can evaluate per row). Using it as a BatchID stamps all rows in this delta with the same value, which is useful for debugging and auditing.
let
// Load the persisted full dataset
// In production, this would be your Power BI dataflow output
// or a staging SQL table maintained by your merge logic
ExistingDataSource = Sql.Database(
"sqlprod01.yourcompany.com", "RetailDW"
),
ExistingTable = ExistingDataSource{
[Schema="dbo", Item="SalesTransactions_Processed"]
}[Data],
// Buffer the delta IDs to avoid repeated evaluation
DeltaIDs = List.Buffer(
Table.Column(SalesTransactions_Delta, "TransactionID")
),
// Remove stale versions of rows appearing in this delta
PrunedExisting = Table.SelectRows(
ExistingTable,
each not List.Contains(DeltaIDs, [TransactionID])
),
// Combine pruned existing with fresh delta
MergedTable = Table.Combine({PrunedExisting, SalesTransactions_Delta}),
// Final sort for presentation consistency
SortedTable = Table.Sort(
MergedTable,
{{"TransactionID", Order.Ascending}}
)
in
SortedTable
let
MaxFromDelta = List.Max(
Table.Column(SalesTransactions_Delta, "ModifiedAt")
),
// If no new data, keep the existing watermark (don't advance into future)
FinalWatermark = if MaxFromDelta = null
then WatermarkValue
else MaxFromDelta
in
FinalWatermark
This pattern isn't universally appropriate. Here's an honest assessment:
Use this when:
ModifiedAt or equivalent change tracking columnDon't use this when:
Performance tips:
ModifiedAt column on the source table. Without an index, even a perfectly-folded query will result in a full table scan at the database level.Build a complete incremental refresh pipeline for a fictional CustomerOrders table. Here's your setup:
Source table definition:
CREATE TABLE dbo.CustomerOrders (
OrderID INT PRIMARY KEY,
CustomerID INT,
OrderDate DATE,
TotalAmount DECIMAL(10,2),
OrderStatus NVARCHAR(30),
CreatedAt DATETIME2 DEFAULT SYSUTCDATETIME(),
ModifiedAt DATETIME2 DEFAULT SYSUTCDATETIME()
);
Your tasks:
Create the RefreshWatermarks table and seed it with an initial watermark of 2024-01-01 00:00:00 for CustomerOrders.
Write a WatermarkValue query in Power Query that reads the watermark and applies a 2-hour lookback window.
Write a CustomerOrders_Delta query that:
ModifiedAt > WatermarkValueRefreshTimestamp column using DateTime.FixedLocalNow()Write a CustomerOrders_Full query that merges the delta into an existing base dataset using an upsert pattern. Use List.Buffer on the delta key list.
Write a NewWatermarkValue query that returns MAX(ModifiedAt) from the delta, with a null-safe fallback to the current watermark.
Bonus: Add a record count comparison step that raises an error using error keyword if the delta contains more than 100,000 rows — this acts as a safety check against watermark corruption pulling an unexpectedly large batch:
// Add this check in your delta query before the final output
SafetyCheck = if Table.RowCount(WithBatchID) > 100000
then error Error.Record(
"ExcessiveDeltaSize",
"Delta row count exceeds safety threshold. Check watermark.",
[RowCount = Table.RowCount(WithBatchID)]
)
else WithBatchID
Mistake 1: The watermark never advances
Symptom: Every refresh pulls a massive batch because WatermarkValue returns the same old value every time.
Cause: The Power Automate flow isn't triggering, or the stored procedure is failing silently, or you're updating the wrong TableName in the watermark table.
Fix: Query the watermark table directly after a few refresh cycles. If the LastWatermark value isn't changing, add logging to your Power Automate flow and test the stored procedure call manually.
Mistake 2: Query folding breaks on the filter step
Symptom: Refreshes are slow even though the delta should be small. The gateway shows high memory usage.
Cause: Something before your filter step broke the foldable chain. Common culprits: referencing the WatermarkValue query directly causes folding to break because Power Query can't express a cross-query reference as SQL.
Fix: Convert WatermarkValue to a parameter instead of a query reference. Right-click the WatermarkValue query, select "Convert to Parameter." Power Query can fold parameterized filters because it substitutes the literal value directly into the SQL predicate.
Mistake 3: List.Contains performance on large datasets
Symptom: The merge query is extremely slow despite the source data being manageable.
Cause: List.Contains on an unbuffered list re-evaluates the source query for every row check.
Fix: Always wrap the ID list in List.Buffer(). This materializes the list into memory once:
// Slow:
each not List.Contains(Table.Column(Delta, "TransactionID"), [TransactionID])
// Fast:
BufferedIDs = List.Buffer(Table.Column(Delta, "TransactionID")),
// then:
each not List.Contains(BufferedIDs, [TransactionID])
Mistake 4: Watermark moves backward after a failed refresh
Symptom: Data appears to disappear or duplicate after a refresh failure and recovery.
Cause: Watermark update logic ran before data was fully loaded, or the update logic doesn't distinguish success from failure.
Fix: Always update the watermark after a confirmed successful load. Use Power Automate's "Run after" configuration to trigger the update flow only on success of the Power BI refresh action, never on failure or timeout.
Mistake 5: Timezone offset causes phantom data gaps
Symptom: Data from certain hours of the day is always missing or always duplicated.
Cause: A timezone mismatch between the Power Query execution environment (UTC) and the source database (local time).
Fix: Standardize all timestamp comparisons to UTC. Use DateTimeZone.UtcNow() in Power Query and SYSUTCDATETIME() in SQL Server. If the source can't be changed, apply a fixed offset in your watermark comparison:
// Adjust for UTC+5 source if you can't change the source
AdjustedWatermark = WatermarkValue + #duration(0, 5, 0, 0)
You've built a production-grade incremental refresh pipeline from first principles. The core ideas to carry forward:
List.Buffer keeps this performant.Where to go next:
ModifiedAt-based tracking. CDC tracks row-level changes at the database engine level, giving you precise new/updated/deleted signals without relying on application-maintained timestamps.Diagnostics.ActivityIdentifier, Table.View) gives you programmatic access to folding behavior that goes beyond what "View Native Query" shows.The incremental refresh problem is ultimately a systems design problem dressed up in M code. The patterns you've learned here will serve you not just in Power Query, but in any data pipeline where you're trading compute efficiency against data completeness.
Learning Path: Power Query Essentials