Wicked Smart Data
LearnArticlesAbout
Sign InSign Up
LearnArticlesAboutContact
Sign InSign Up
Wicked Smart Data

The go-to platform for professionals who want to master data, automation, and AI — from Excel fundamentals to cutting-edge machine learning.

Platform

  • Learning Paths
  • Articles
  • About
  • Contact

Connect

  • Contact Us
  • RSS Feed

© 2026 Wicked Smart Data. All rights reserved.

Privacy PolicyTerms of Service
All Articles
Loading Data from APIs and Web Pages with Power Query

Loading Data from APIs and Web Pages with Power Query

Power Query🔥 Expert23 min readApr 9, 2026Updated Apr 9, 2026
Table of Contents
  • Prerequisites
  • Understanding Web Data Sources in Power Query
  • Building Your First API Connection
  • Handling API Parameters and Headers
  • Managing Authentication
  • Handling Pagination and Large Datasets
  • Basic Pagination Pattern
  • Cursor-Based Pagination
  • Web Scraping with Advanced HTML Parsing
  • Basic HTML Table Extraction
  • Advanced CSS Selector Parsing
  • Dynamic Content and JavaScript-Rendered Pages
  • Error Handling and Resilience Patterns

Your quarterly revenue dashboard is pulling from six different systems, and three of them don't offer direct database access. Instead, you're stuck copying data from web portals and wrestling with CSV exports that break every time the vendor updates their interface. Meanwhile, your competitor is releasing market analysis reports twice as fast as your team, and you suspect they've automated their data collection from the same public APIs you're manually checking each week.

This scenario plays out in organizations everywhere. While Power Query excels at connecting to databases and files, the real competitive advantage comes from automating data collection from web sources—APIs that provide real-time market data, customer portals that require authentication, and dynamic web pages that generate reports on demand. Mastering these techniques transforms you from a data processor into a data intelligence architect.

By the end of this lesson, you'll be building robust, automated pipelines that pull data from any web source, handle authentication complexities, and maintain themselves even when APIs change their structure.

What you'll learn:

  • Design resilient API connections that handle rate limits, pagination, and authentication
  • Extract structured data from complex web pages using advanced HTML parsing techniques
  • Implement error handling and retry logic for unstable web sources
  • Build parameterized queries that scale across multiple endpoints and time periods
  • Optimize web data refreshes for performance and reliability in enterprise environments

Prerequisites

You should be comfortable with Power Query's basic data transformation operations and M language syntax. Experience with HTTP concepts (headers, status codes, authentication) and basic understanding of JSON/XML structures will be helpful but not required.

Understanding Web Data Sources in Power Query

Power Query approaches web data through two primary connectors: Web.Contents() for APIs and structured web requests, and Web.Page() for scraping HTML content. While these might seem similar, they operate at different levels of the web stack and require distinct strategies.

Web.Contents() works at the HTTP protocol level, sending requests directly to endpoints and receiving structured responses. This makes it ideal for REST APIs, web services, and any scenario where you're requesting data rather than presentation markup. Web.Page() processes complete HTML documents, parsing them into navigable table structures—perfect for extracting data from human-readable web pages.

The architectural distinction matters because it determines your error handling strategy, refresh behavior, and performance characteristics. API connections typically fail fast with specific error codes, while web scraping fails gradually as page structures change.

Building Your First API Connection

Let's start with a real-world scenario: connecting to the GitHub API to analyze repository activity across your organization. This example demonstrates core concepts without requiring authentication initially.

let
    // Define the base URL and endpoint
    BaseUrl = "https://api.github.com",
    Endpoint = "/repos/microsoft/powerquery-m/commits",
    
    // Make the initial request
    Source = Json.Document(Web.Contents(BaseUrl & Endpoint)),
    
    // Convert to table for easier manipulation
    CommitsTable = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    
    // Expand the JSON structure
    ExpandedCommits = Table.ExpandRecordColumn(CommitsTable, "Column1", 
        {"sha", "commit", "author", "committer"})
        
in
    ExpandedCommits

This basic pattern—request, convert, expand—forms the foundation of most API connections. But real-world APIs require more sophistication.

Handling API Parameters and Headers

APIs rarely accept bare requests. They expect parameters, authentication headers, and specific content types. Here's how to structure these properly:

let
    // API configuration
    BaseUrl = "https://api.github.com",
    Repository = "microsoft/powerquery-m",
    
    // Build query parameters
    QueryParams = [
        #"since" = "2024-01-01T00:00:00Z",
        #"per_page" = "100",
        #"page" = "1"
    ],
    
    // Construct headers
    Headers = [
        #"Accept" = "application/vnd.github.v3+json",
        #"User-Agent" = "PowerQuery-DataPipeline/1.0"
    ],
    
    // Build the complete URL with parameters
    Url = BaseUrl & "/repos/" & Repository & "/commits?" & 
          Uri.BuildQueryString(QueryParams),
    
    // Make the request with headers
    Source = Json.Document(Web.Contents(Url, [Headers=Headers])),
    
    // Process the response
    ProcessedData = Table.FromList(Source, 
        Splitter.SplitByNothing(), null, null, ExtraValues.Error)
        
in
    ProcessedData

Performance Tip: Power Query caches Web.Contents() responses based on the complete URL and headers. Slight variations in parameter formatting can break caching, so standardize your URL construction patterns.

Managing Authentication

Most business APIs require authentication. Power Query supports several methods, each with specific implementation patterns.

API Key Authentication

The simplest approach uses API keys in headers or query parameters:

let
    // Store sensitive data in parameters
    ApiKey = "your-api-key-here", // In production, use Power Query parameters
    
    Headers = [
        #"Authorization" = "Bearer " & ApiKey,
        #"Content-Type" = "application/json"
    ],
    
    Source = Json.Document(Web.Contents(
        "https://api.example.com/data",
        [Headers = Headers]
    ))
in
    Source

Security Warning: Never hardcode API keys in queries. Use Power Query parameters and configure them at the dataset level to keep credentials secure.

OAuth 2.0 Flows

OAuth requires a more complex approach since Power Query can't handle interactive authentication flows directly:

let
    // Pre-obtained access token (from your OAuth flow)
    AccessToken = "your-oauth-token",
    
    Headers = [
        #"Authorization" = "Bearer " & AccessToken,
        #"Accept" = "application/json"
    ],
    
    // Function to refresh token when expired
    RefreshToken = (refresh_token as text) =>
        let
            TokenUrl = "https://oauth.example.com/token",
            Body = "grant_type=refresh_token&refresh_token=" & refresh_token,
            
            Response = Json.Document(Web.Contents(TokenUrl, [
                Headers = [#"Content-Type" = "application/x-www-form-urlencoded"],
                Content = Text.ToBinary(Body)
            ])),
            
            NewToken = Response[access_token]
        in
            NewToken,
    
    // Main request with error handling
    Source = try Json.Document(Web.Contents(
        "https://api.example.com/data",
        [Headers = Headers]
    )) otherwise RefreshToken("your-refresh-token")
    
in
    Source

For production OAuth implementations, consider building a custom connector or using Azure Function proxies to manage token lifecycle outside Power Query.

Handling Pagination and Large Datasets

APIs limit response sizes through pagination. Power Query's functional approach makes building pagination logic elegant but requires understanding recursive patterns.

Basic Pagination Pattern

Here's a robust pagination implementation for APIs that use page-based pagination:

let
    // Pagination function
    GetAllPages = (baseUrl as text, initialPage as number) =>
        let
            // Recursive function to fetch pages
            GetPage = (pageNumber as number, accumulated as list) =>
                let
                    Url = baseUrl & "?page=" & Number.ToText(pageNumber) & "&per_page=100",
                    
                    Response = try Json.Document(Web.Contents(Url)) otherwise null,
                    
                    // Check if we got data
                    HasData = Response <> null and List.Count(Response) > 0,
                    
                    // Add current page to accumulated results
                    NewAccumulated = if HasData then 
                        accumulated & Response 
                    else 
                        accumulated,
                    
                    // Decide whether to continue
                    Result = if HasData and List.Count(Response) = 100 then
                        @GetPage(pageNumber + 1, NewAccumulated)
                    else
                        NewAccumulated
                in
                    Result,
            
            // Start the recursion
            AllData = GetPage(initialPage, {})
        in
            AllData,
    
    // Usage
    Source = GetAllPages("https://api.github.com/repos/microsoft/powerquery-m/commits", 1),
    
    // Convert to table
    Table = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error)
    
in
    Table

Cursor-Based Pagination

Many modern APIs use cursor-based pagination for better performance:

let
    GetAllRecords = (baseUrl as text, optional cursor as text) =>
        let
            // Build URL with cursor
            Url = if cursor = null then 
                baseUrl 
            else 
                baseUrl & "?cursor=" & cursor,
            
            Response = Json.Document(Web.Contents(Url)),
            
            // Extract data and pagination info
            Data = Response[data],
            NextCursor = try Response[pagination][next_cursor] otherwise null,
            HasMore = NextCursor <> null,
            
            // Recursive call if more data exists
            Result = if HasMore then
                Data & @GetAllRecords(baseUrl, NextCursor)
            else
                Data
        in
            Result,
    
    Source = GetAllRecords("https://api.example.com/records"),
    Table = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error)
    
in
    Table

Rate Limiting Consideration: Aggressive pagination can trigger API rate limits. Consider adding delays using Function.InvokeAfter() in production scenarios.

Web Scraping with Advanced HTML Parsing

When APIs aren't available, web scraping becomes necessary. Power Query's HTML parsing capabilities are sophisticated but require understanding the document object model.

Basic HTML Table Extraction

Start with a simple scenario—extracting financial data from a public website:

let
    // Load the web page
    Source = Web.Page(Web.Contents("https://finance.yahoo.com/quote/MSFT/history")),
    
    // Power Query automatically detects tables
    Tables = Source{[Name="Tables"]}[Data],
    
    // Select the specific table (usually requires inspection)
    HistoryTable = Tables{0}[Data],
    
    // Clean up the headers
    PromotedHeaders = Table.PromoteHeaders(HistoryTable, [PromoteAllScalars=true]),
    
    // Type the columns appropriately
    TypedTable = Table.TransformColumnTypes(PromotedHeaders, {
        {"Date", type date},
        {"Open", type number},
        {"High", type number},
        {"Low", type number},
        {"Close*", type number},
        {"Adj Close**", type number},
        {"Volume", Int64.Type}
    })
    
in
    TypedTable

Advanced CSS Selector Parsing

For more complex pages, you need granular control over element selection. While Power Query doesn't support CSS selectors directly, you can navigate the HTML structure programmatically:

let
    // Load page content
    Source = Web.Page(Web.Contents("https://example.com/data-page")),
    
    // Navigate to specific elements
    Body = Source{[Name="Body"]}[Data],
    
    // Function to find elements by class
    FindByClass = (table as table, className as text) =>
        let
            Filtered = Table.SelectRows(table, 
                each try Text.Contains([ClassName], className) otherwise false)
        in
            Filtered,
    
    // Extract data containers
    DataContainers = FindByClass(Body, "data-container"),
    
    // Process each container
    ProcessContainer = (row as record) =>
        let
            // Extract nested data
            NestedData = row[Data],
            
            // Find specific data points
            Values = Table.SelectRows(NestedData, 
                each [TagName] = "span" and Text.Contains([ClassName], "data-value")),
            
            // Extract text content
            ExtractedValues = Table.AddColumn(Values, "Value", 
                each try [TextContent] otherwise "")
        in
            ExtractedValues,
    
    // Apply processing to all containers
    ProcessedData = Table.AddColumn(DataContainers, "ProcessedData", 
        each ProcessContainer(_))
    
in
    ProcessedData

Dynamic Content and JavaScript-Rendered Pages

Power Query's Web.Page() function can't execute JavaScript, which limits its effectiveness on modern single-page applications. However, you can often find the underlying API calls that populate the dynamic content:

let
    // Strategy 1: Find the API endpoint used by the JavaScript
    // Use browser developer tools to identify XHR requests
    
    ApiEndpoint = "https://example.com/api/data-endpoint",
    
    // Strategy 2: Look for JSON data in script tags
    Source = Web.Page(Web.Contents("https://example.com/dynamic-page")),
    
    // Find script tags containing data
    Scripts = Table.SelectRows(Source{[Name="Body"]}[Data], 
        each [TagName] = "script"),
    
    // Extract JSON from script content
    ExtractJson = (scriptContent as text) =>
        let
            // Find JSON pattern (adjust regex as needed)
            JsonStart = Text.PositionOf(scriptContent, "window.initialData = "),
            JsonEnd = Text.PositionOf(scriptContent, ";</script>"),
            
            JsonText = Text.Middle(scriptContent, 
                JsonStart + Text.Length("window.initialData = "), 
                JsonEnd - JsonStart - Text.Length("window.initialData = ")),
            
            ParsedJson = try Json.Document(JsonText) otherwise null
        in
            ParsedJson,
    
    // Process all scripts to find embedded data
    ProcessedScripts = Table.AddColumn(Scripts, "ParsedData", 
        each ExtractJson([TextContent]))
    
in
    ProcessedScripts

Error Handling and Resilience Patterns

Web data sources are inherently unreliable. Network issues, API changes, and server maintenance create constant failure scenarios. Building resilient pipelines requires sophisticated error handling.

Comprehensive Error Detection

Power Query's try...otherwise construct handles exceptions, but web APIs communicate errors through HTTP status codes, empty responses, and malformed data:

let
    // Robust API request function
    SafeApiRequest = (url as text, optional options as record) =>
        let
            // Default options
            DefaultOptions = [
                Headers = [#"User-Agent" = "PowerQuery/1.0"],
                Timeout = #duration(0, 0, 2, 0),
                ManualStatusHandling = {400, 401, 403, 404, 429, 500, 502, 503}
            ],
            
            // Merge provided options
            FinalOptions = Record.Combine({DefaultOptions, options ?? []}),
            
            // Make the request
            Response = try Web.Contents(url, FinalOptions) otherwise null,
            
            // Check response validity
            ValidResponse = if Response = null then
                [Success = false, Error = "Network request failed", Data = null]
            else
                let
                    // Get response metadata
                    Metadata = Value.Metadata(Response),
                    StatusCode = Metadata[Response.Status]?,
                    
                    // Evaluate response
                    Result = if StatusCode >= 200 and StatusCode < 300 then
                        [Success = true, Error = null, Data = Response]
                    else if StatusCode = 429 then
                        [Success = false, Error = "Rate limited", Data = null, Retry = true]
                    else if StatusCode >= 500 then
                        [Success = false, Error = "Server error: " & Number.ToText(StatusCode), 
                         Data = null, Retry = true]
                    else
                        [Success = false, Error = "Client error: " & Number.ToText(StatusCode), 
                         Data = null]
                in
                    Result
        in
            ValidResponse,
    
    // Usage with retry logic
    RequestWithRetry = (url as text, maxRetries as number) =>
        let
            AttemptRequest = (attempt as number) =>
                let
                    Response = SafeApiRequest(url),
                    
                    ShouldRetry = Response[Retry]? = true and attempt < maxRetries,
                    
                    Result = if Response[Success] then
                        Response
                    else if ShouldRetry then
                        @AttemptRequest(attempt + 1)
                    else
                        Response
                in
                    Result
        in
            AttemptRequest(1),
    
    // Test the function
    Source = RequestWithRetry("https://api.github.com/repos/microsoft/powerquery-m", 3)
    
in
    Source

Data Validation and Schema Checking

APIs evolve, and schema changes break downstream processes. Implement validation at the point of data ingestion:

let
    // Expected schema definition
    ExpectedSchema = [
        RequiredFields = {"id", "name", "created_at"},
        FieldTypes = [
            id = Int64.Type,
            name = Text.Type,
            created_at = DateTime.Type
        ]
    ],
    
    // Schema validation function
    ValidateSchema = (data as table, schema as record) =>
        let
            // Check required fields exist
            ActualColumns = Table.ColumnNames(data),
            MissingFields = List.Difference(schema[RequiredFields], ActualColumns),
            
            HasMissingFields = List.Count(MissingFields) > 0,
            
            // Validate field types (simplified)
            TypeValidation = if HasMissingFields then
                [Valid = false, Errors = {"Missing fields: " & Text.Combine(MissingFields, ", ")}]
            else
                [Valid = true, Errors = {}],
            
            Result = [
                IsValid = TypeValidation[Valid],
                Errors = TypeValidation[Errors],
                Data = if TypeValidation[Valid] then data else null
            ]
        in
            Result,
    
    // API request with validation
    Source = Json.Document(Web.Contents("https://api.example.com/data")),
    DataTable = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    
    // Expand and validate
    ExpandedData = Table.ExpandRecordColumn(DataTable, "Column1", 
        {"id", "name", "created_at"}),
    
    ValidationResult = ValidateSchema(ExpandedData, ExpectedSchema),
    
    // Return validated data or error
    FinalResult = if ValidationResult[IsValid] then
        ValidationResult[Data]
    else
        error "Schema validation failed: " & Text.Combine(ValidationResult[Errors], "; ")
    
in
    FinalResult

Performance Optimization and Caching Strategies

Web data connections are expensive operations. Optimizing performance requires understanding Power Query's execution model and implementing intelligent caching.

Query Folding and Web Sources

Unlike database connectors, web sources rarely support query folding. Every transformation happens in Power Query's engine, making data volume management critical:

let
    // Inefficient: Download everything, then filter
    Source = Json.Document(Web.Contents("https://api.example.com/large-dataset")),
    AllData = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    FilteredData = Table.SelectRows(AllData, each [status] = "active"),
    
    // Efficient: Filter at the API level
    FilteredUrl = "https://api.example.com/large-dataset?status=active",
    EfficientSource = Json.Document(Web.Contents(FilteredUrl)),
    EfficientData = Table.FromList(EfficientSource, Splitter.SplitByNothing(), null, null, ExtraValues.Error)
    
in
    EfficientData

Incremental Refresh Patterns

For frequently updated web sources, implement incremental refresh to minimize data transfer:

let
    // Get the last refresh timestamp from previous execution
    LastRefresh = try DateTime.From(#"Last Refresh Parameter") otherwise #datetime(2024, 1, 1, 0, 0, 0),
    
    // Convert to API format
    SinceParam = DateTime.ToText(LastRefresh, "yyyy-MM-ddThh:mm:ssZ"),
    
    // Request only new data
    IncrementalUrl = "https://api.example.com/data?since=" & SinceParam,
    
    Source = Json.Document(Web.Contents(IncrementalUrl)),
    NewData = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    
    // Add metadata for next refresh
    CurrentTime = DateTime.UtcNow(),
    DataWithTimestamp = Table.AddColumn(NewData, "RefreshTime", each CurrentTime)
    
in
    DataWithTimestamp

Parallel Processing for Multiple Endpoints

When collecting data from multiple related endpoints, implement parallel processing to improve performance:

let
    // List of endpoints to process
    Endpoints = {
        "https://api.example.com/dataset1",
        "https://api.example.com/dataset2", 
        "https://api.example.com/dataset3",
        "https://api.example.com/dataset4"
    },
    
    // Function to process a single endpoint
    ProcessEndpoint = (url as text) =>
        let
            Source = try Json.Document(Web.Contents(url)) otherwise {},
            Table = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
            WithSource = Table.AddColumn(Table, "SourceUrl", each url)
        in
            WithSource,
    
    // Process all endpoints (Power Query handles parallelization)
    ProcessedEndpoints = List.Transform(Endpoints, ProcessEndpoint),
    
    // Combine results
    CombinedData = Table.Combine(ProcessedEndpoints)
    
in
    CombinedData

Building Production-Ready Web Data Pipelines

Moving from experimentation to production requires addressing enterprise concerns: security, monitoring, governance, and maintainability.

Parameterization and Environment Management

Create flexible pipelines that work across development, staging, and production environments:

let
    // Environment-specific configuration
    Config = [
        Development = [
            BaseUrl = "https://api-dev.example.com",
            RateLimit = 10,
            Timeout = 30
        ],
        Production = [
            BaseUrl = "https://api.example.com", 
            RateLimit = 100,
            Timeout = 60
        ]
    ],
    
    // Get current environment (from parameter)
    Environment = try #"Environment Parameter" otherwise "Development",
    CurrentConfig = Record.Field(Config, Environment),
    
    // Configurable API client
    ApiClient = [
        BaseUrl = CurrentConfig[BaseUrl],
        
        MakeRequest = (endpoint as text, optional params as record) =>
            let
                // Build URL with parameters
                QueryString = if params = null then "" 
                    else "?" & Uri.BuildQueryString(params),
                FullUrl = CurrentConfig[BaseUrl] & endpoint & QueryString,
                
                // Standard headers
                Headers = [
                    #"User-Agent" = "DataPipeline/1.0",
                    #"Accept" = "application/json"
                ],
                
                // Make request with timeout
                Response = Web.Contents(FullUrl, [
                    Headers = Headers,
                    Timeout = #duration(0, 0, 0, CurrentConfig[Timeout])
                ])
            in
                Json.Document(Response)
    ],
    
    // Usage
    Source = ApiClient[MakeRequest]("/data", [limit = 100])
    
in
    Source

Monitoring and Alerting Integration

Implement monitoring to track API health and data quality:

let
    // Monitoring function
    MonitorApiHealth = (url as text) =>
        let
            StartTime = DateTime.UtcNow(),
            
            Response = try [
                Success = true,
                Data = Json.Document(Web.Contents(url)),
                ResponseTime = Duration.TotalSeconds(DateTime.UtcNow() - StartTime),
                Timestamp = StartTime
            ] otherwise [
                Success = false,
                Error = "Request failed",
                ResponseTime = Duration.TotalSeconds(DateTime.UtcNow() - StartTime),
                Timestamp = StartTime
            ],
            
            // Log metrics (in real implementation, send to monitoring system)
            LogEntry = [
                Url = url,
                Success = Response[Success],
                ResponseTime = Response[ResponseTime],
                Timestamp = Response[Timestamp],
                RecordCount = try List.Count(Response[Data]) otherwise 0
            ]
        in
            LogEntry,
    
    // Monitor multiple endpoints
    EndpointsToMonitor = {
        "https://api.example.com/health",
        "https://api.example.com/data"
    },
    
    HealthChecks = List.Transform(EndpointsToMonitor, MonitorApiHealth),
    HealthTable = Table.FromList(HealthChecks, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    ExpandedHealth = Table.ExpandRecordColumn(HealthTable, "Column1", 
        {"Url", "Success", "ResponseTime", "Timestamp", "RecordCount"})
    
in
    ExpandedHealth

Data Lineage and Documentation

Maintain clear documentation about data sources and transformations:

let
    // Data lineage metadata
    LineageInfo = [
        SourceSystem = "External API",
        ApiVersion = "v1",
        Endpoint = "https://api.example.com/data",
        LastModified = DateTime.UtcNow(),
        Schema = [
            Fields = {"id", "name", "created_at", "status"},
            Types = ["number", "text", "datetime", "text"]
        ],
        Transformations = {
            "Filter active records",
            "Parse datetime fields",
            "Add data quality flags"
        }
    ],
    
    // Main data processing
    Source = Json.Document(Web.Contents("https://api.example.com/data")),
    DataTable = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    
    // Add lineage metadata to table
    WithLineage = Table.AddColumn(DataTable, "DataLineage", each LineageInfo)
    
in
    WithLineage

Advanced Integration Patterns

Enterprise scenarios often require integrating multiple web sources or combining web data with other systems.

Multi-Source Data Fusion

Combine data from multiple APIs with different schemas and update frequencies:

let
    // Source 1: Customer data from CRM API
    CrmData = let
        Source = Json.Document(Web.Contents("https://crm-api.example.com/customers")),
        Table = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
        Expanded = Table.ExpandRecordColumn(Table, "Column1", {"id", "name", "email"}),
        WithSource = Table.AddColumn(Expanded, "Source", each "CRM")
    in WithSource,
    
    // Source 2: Activity data from marketing platform
    MarketingData = let
        Source = Json.Document(Web.Contents("https://marketing-api.example.com/activities")),
        Table = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
        Expanded = Table.ExpandRecordColumn(Table, "Column1", {"customer_id", "activity", "timestamp"}),
        WithSource = Table.AddColumn(Expanded, "Source", each "Marketing")
    in WithSource,
    
    // Standardize schemas
    StandardizedCrm = Table.RenameColumns(CrmData, {{"id", "customer_id"}}),
    
    // Join datasets
    JoinedData = Table.NestedJoin(StandardizedCrm, {"customer_id"}, 
        MarketingData, {"customer_id"}, "MarketingActivities", JoinKind.LeftOuter),
    
    // Expand nested activities
    ExpandedActivities = Table.ExpandTableColumn(JoinedData, "MarketingActivities", 
        {"activity", "timestamp"})
    
in
    ExpandedActivities

Real-Time Data Streaming Simulation

While Power Query doesn't support true streaming, you can simulate real-time updates for dashboards:

let
    // Function to get current data with timestamp
    GetCurrentData = () =>
        let
            CurrentTime = DateTime.UtcNow(),
            
            // Fetch latest data
            Source = Json.Document(Web.Contents("https://api.example.com/live-data")),
            
            // Add timestamp to each record
            WithTimestamp = List.Transform(Source, 
                each Record.AddField(_, "FetchTime", CurrentTime)),
            
            // Convert to table
            Table = Table.FromList(WithTimestamp, Splitter.SplitByNothing(), null, null, ExtraValues.Error)
        in
            Table,
    
    // Get current snapshot
    CurrentData = GetCurrentData(),
    
    // In a real streaming scenario, you might:
    // 1. Store historical data in a separate table
    // 2. Implement change detection
    // 3. Use Power BI's automatic refresh features
    
    FinalResult = CurrentData
    
in
    FinalResult

Hands-On Exercise

Let's build a comprehensive data pipeline that demonstrates advanced web data integration techniques. We'll create a competitive intelligence dashboard that monitors multiple data sources.

Scenario: Your company needs to track competitor pricing, social media sentiment, and market news. You'll build a pipeline that:

  1. Collects pricing data from a competitor's API
  2. Scrapes social media mentions from a news aggregator
  3. Monitors industry news from RSS feeds
  4. Combines all sources into a unified dataset with quality metrics

Step 1: Build the Multi-Source Connector

let
    // Configuration for multiple sources
    Sources = [
        PricingApi = [
            Url = "https://api.competitor.com/pricing",
            Type = "JSON",
            RefreshFrequency = #duration(0, 1, 0, 0), // Every hour
            Headers = [#"Authorization" = "Bearer " & #"API Key Parameter"]
        ],
        
        NewsRss = [
            Url = "https://industry-news.example.com/rss",
            Type = "XML", 
            RefreshFrequency = #duration(0, 0, 15, 0), // Every 15 minutes
            Headers = []
        ],
        
        SocialWeb = [
            Url = "https://social-monitoring.example.com/mentions?company=competitor",
            Type = "HTML",
            RefreshFrequency = #duration(0, 0, 30, 0), // Every 30 minutes  
            Headers = []
        ]
    ],
    
    // Generic source processor
    ProcessSource = (sourceName as text, config as record) =>
        let
            Response = try
                if config[Type] = "JSON" then
                    Json.Document(Web.Contents(config[Url], [Headers = config[Headers]]))
                else if config[Type] = "XML" then
                    Xml.Tables(Web.Contents(config[Url], [Headers = config[Headers]]))
                else if config[Type] = "HTML" then
                    Web.Page(Web.Contents(config[Url], [Headers = config[Headers]]))
                else
                    error "Unsupported source type"
            otherwise null,
            
            // Add metadata
            WithMetadata = [
                Source = sourceName,
                Data = Response,
                FetchTime = DateTime.UtcNow(),
                Success = Response <> null
            ]
        in
            WithMetadata,
    
    // Process all sources
    ProcessedSources = Record.TransformFields(Sources, 
        List.Transform(Record.FieldNames(Sources), 
            each {_, (config) => ProcessSource(_, config)})),
    
    // Extract successful results
    SuccessfulSources = Record.SelectFields(ProcessedSources, 
        List.Select(Record.FieldNames(ProcessedSources), 
            each Record.Field(ProcessedSources, _)[Success] = true))
    
in
    SuccessfulSources

Step 2: Implement Data Quality Scoring

let
    // Data quality assessment function
    AssessDataQuality = (data as any, sourceType as text) =>
        let
            Score = if sourceType = "JSON" then
                let
                    // JSON quality checks
                    RecordCount = try List.Count(data) otherwise 0,
                    HasValidStructure = try data{0}[id] <> null otherwise false,
                    
                    QualityScore = 
                        (if RecordCount > 0 then 30 else 0) +
                        (if HasValidStructure then 40 else 0) +
                        30 // Base score for successful fetch
                in
                    QualityScore
            else if sourceType = "HTML" then
                let
                    // HTML quality checks  
                    HasTables = try List.Count(data{[Name="Tables"]}[Data]) > 0 otherwise false,
                    
                    QualityScore = 
                        (if HasTables then 50 else 0) + 
                        50 // Base score for successful fetch
                in
                    QualityScore
            else
                50, // Default score for other types
                
            QualityGrade = if Score >= 80 then "High"
                else if Score >= 60 then "Medium"  
                else if Score >= 40 then "Low"
                else "Poor"
        in
            [Score = Score, Grade = QualityGrade],
    
    // Apply quality assessment (using previous exercise result)
    PreviousResult = #"Previous Step", // Reference to previous step
    
    WithQuality = Record.TransformFields(PreviousResult,
        List.Transform(Record.FieldNames(PreviousResult),
            each {_, (sourceData) => 
                Record.AddField(sourceData, "Quality", 
                    AssessDataQuality(sourceData[Data], "JSON"))}))
    
in
    WithQuality

Step 3: Create Unified Output Schema

let
    // Transform each source to common schema
    TransformToCommonSchema = (sourceData as record) =>
        let
            SourceName = sourceData[Source],
            
            Transformed = if SourceName = "PricingApi" then
                let
                    Data = sourceData[Data],
                    Table = Table.FromList(Data, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
                    Expanded = Table.ExpandRecordColumn(Table, "Column1", {"product", "price", "date"}),
                    Standardized = Table.AddColumn(Expanded, "Category", each "Pricing")
                in
                    Standardized
            else if SourceName = "NewsRss" then
                let
                    // Process RSS/XML data
                    Data = sourceData[Data],
                    Items = Data{0}[Table],
                    Standardized = Table.AddColumn(Items, "Category", each "News")
                in
                    Standardized
            else
                // Default transformation
                #table({"Category", "Value", "Timestamp"}, 
                    {{"Unknown", "No data", DateTime.UtcNow()}}),
            
            // Add source metadata to every row
            WithMetadata = Table.AddColumn(Standardized, "SourceMetadata", 
                each [
                    Source = SourceName,
                    FetchTime = sourceData[FetchTime],
                    Quality = sourceData[Quality]
                ])
        in
            WithMetadata,
    
    // Apply transformation (using previous result)
    PreviousResult = #"Previous Step",
    
    TransformedTables = Record.TransformFields(PreviousResult,
        List.Transform(Record.FieldNames(PreviousResult),
            each {_, TransformToCommonSchema})),
    
    // Combine all tables
    AllTables = Record.FieldValues(TransformedTables),
    CombinedData = Table.Combine(AllTables)
    
in
    CombinedData

Common Mistakes & Troubleshooting

Authentication Token Expiration

Problem: Queries work initially but fail during scheduled refreshes when OAuth tokens expire.

Solution: Implement token refresh logic and use Power Query parameters for credential management:

let
    // Check token expiration before each request
    CheckTokenValidity = (token as text) =>
        let
            TestUrl = "https://api.example.com/auth/verify",
            TestResponse = try Web.Contents(TestUrl, [
                Headers = [#"Authorization" = "Bearer " & token]
            ]) otherwise null,
            
            IsValid = TestResponse <> null
        in
            IsValid,
    
    CurrentToken = #"OAuth Token Parameter",
    RefreshToken = #"Refresh Token Parameter", 
    
    ValidToken = if CheckTokenValidity(CurrentToken) then
        CurrentToken
    else
        // Implement token refresh
        RefreshAccessToken(RefreshToken),
    
    Source = Json.Document(Web.Contents("https://api.example.com/data", [
        Headers = [#"Authorization" = "Bearer " & ValidToken]
    ]))
in
    Source

Rate Limit Handling

Problem: APIs return 429 errors during data refresh, especially with pagination.

Solution: Implement exponential backoff and rate limit detection:

let
    // Rate-limited request function
    RateLimitedRequest = (url as text, attempt as number) =>
        let
            MaxAttempts = 5,
            
            Response = try [
                Success = true,
                Data = Web.Contents(url),
                StatusCode = null
            ] otherwise 
                let
                    // Extract status code from error
                    ErrorText = try Error.Record()[Detail][Message] otherwise "",
                    IsRateLimit = Text.Contains(ErrorText, "429")
                in [
                    Success = false,
                    Data = null,
                    StatusCode = if IsRateLimit then 429 else 500
                ],
            
            Result = if Response[Success] then
                Response[Data]
            else if Response[StatusCode] = 429 and attempt < MaxAttempts then
                let
                    // Exponential backoff: 2^attempt seconds
                    DelaySeconds = Number.Power(2, attempt),
                    
                    // Wait (simulated - Power Query doesn't have actual delays)
                    // In production, use Azure Functions or similar for delays
                    
                    RetryResult = @RateLimitedRequest(url, attempt + 1)
                in
                    RetryResult
            else
                error "Request failed after " & Text.From(MaxAttempts) & " attempts"
        in
            Result
in
    RateLimitedRequest("https://api.example.com/data", 1)

Dynamic Schema Changes

Problem: API responses change schema, breaking column expansions and transformations.

Solution: Implement defensive programming with schema flexibility:

let
    // Flexible column expansion
    SafeExpandColumns = (table as table, columnName as text, expectedColumns as list) =>
        let
            // Get sample record to inspect available columns
            SampleRecord = try table{0} otherwise error "Empty table",
            SampleColumn = Record.Field(SampleRecord, columnName),
            
            // Get actual available columns
            AvailableColumns = if Value.Type(SampleColumn) = type record then
                Record.FieldNames(SampleColumn)
            else
                {},
            
            // Only expand columns that exist
            ColumnsToExpand = List.Intersect({expectedColumns, AvailableColumns}),
            
            // Perform expansion if columns are available
            Result = if List.Count(ColumnsToExpand) > 0 then
                Table.ExpandRecordColumn(table, columnName, ColumnsToExpand)
            else
                // Add empty columns for missing expected columns
                Table.AddColumn(table, columnName & "_NoData", each null)
        in
            Result,
    
    Source = Json.Document(Web.Contents("https://api.example.com/data")),
    Table = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    
    // Safely expand with expected columns
    ExpectedColumns = {"id", "name", "created_at", "status"},
    ExpandedTable = SafeExpandColumns(Table, "Column1", ExpectedColumns)
    
in
    ExpandedTable

Web Scraping Reliability Issues

Problem: HTML structure changes break web scraping queries.

Solution: Use multiple fallback strategies and robust element identification:

let
    // Multi-strategy element finder
    FindElementsByMultipleStrategies = (htmlTable as table) =>
        let
            // Strategy 1: Find by exact class name
            Strategy1 = Table.SelectRows(htmlTable, 
                each try [ClassName] = "data-row" otherwise false),
            
            // Strategy 2: Find by partial class match  
            Strategy2 = if Table.RowCount(Strategy1) = 0 then
                Table.SelectRows(htmlTable,
                    each try Text.Contains([ClassName], "data") otherwise false)
            else
                Strategy1,
                
            // Strategy 3: Find by tag name and content pattern
            Strategy3 = if Table.RowCount(Strategy2) = 0 then
                Table.SelectRows(htmlTable,
                    each try [TagName] = "tr" and Text.Contains([TextContent], "$") otherwise false)
            else
                Strategy2,
            
            // Final fallback: get all table rows
            FinalResult = if Table.RowCount(Strategy3) = 0 then
                Table.SelectRows(htmlTable, each [TagName] = "tr")
            else
                Strategy3
        in
            FinalResult,
    
    Source = Web.Page(Web.Contents("https://example.com/data-page")),
    DataTable = Source{[Name="Tables"]}[Data]{0}[Data],
    
    ExtractedData = FindElementsByMultipleStrategies(DataTable)
    
in
    ExtractedData

Summary & Next Steps

You've now mastered the complete spectrum of web data integration with Power Query, from basic API connections to enterprise-grade pipelines with comprehensive error handling and monitoring. The key concepts that distinguish expert-level implementations are:

Architectural thinking: Understanding when to use Web.Contents() versus Web.Page(), how Power Query's execution model affects web requests, and designing for scalability from the start.

Resilience patterns: Building queries that handle the inherent instability of web sources through sophisticated error handling, retry logic, and graceful degradation.

Production readiness: Implementing monitoring, parameterization, and data quality controls that enable reliable operation in enterprise environments.

Integration complexity: Combining multiple web sources with different schemas, authentication methods, and update frequencies into unified datasets.

The techniques you've learned here form the foundation for advanced data integration scenarios. Consider these natural progression paths:

Custom Connector Development: When you're regularly working with specific APIs, building custom Power Query connectors provides better performance, built-in authentication handling, and query folding capabilities.

Real-time Integration Patterns: Explore how Power BI's streaming datasets and Azure Stream Analytics can complement Power Query for true real-time scenarios.

Data Governance and Lineage: Implement comprehensive metadata tracking and automated documentation systems for web-sourced data in enterprise environments.

Advanced Authentication Systems: Deep-dive into enterprise identity providers, certificate-based authentication, and custom authentication schemes for secure web data access.

The competitive advantage in data-driven organizations increasingly comes from automated intelligence gathering. You now have the tools to build those systems that transform scattered web data into strategic insights, giving your organization the information edge it needs to succeed.

Learning Path: Power Query Essentials

Previous

Power Query Best Practices for Maintainable ETL

Related Articles

Power Query⚡ Practitioner

Power Query Best Practices for Maintainable ETL

24 min
Power Query🌱 Foundation

Power Query Best Practices: Building Maintainable ETL Solutions That Last

17 min
Power Query🔥 Expert

Parameterized Queries and Dynamic Data Sources in Power Query

19 min

On this page

  • Prerequisites
  • Understanding Web Data Sources in Power Query
  • Building Your First API Connection
  • Handling API Parameters and Headers
  • Managing Authentication
  • Handling Pagination and Large Datasets
  • Basic Pagination Pattern
  • Cursor-Based Pagination
  • Web Scraping with Advanced HTML Parsing
  • Basic HTML Table Extraction
  • Advanced CSS Selector Parsing
  • Comprehensive Error Detection
  • Data Validation and Schema Checking
  • Performance Optimization and Caching Strategies
  • Query Folding and Web Sources
  • Incremental Refresh Patterns
  • Parallel Processing for Multiple Endpoints
  • Building Production-Ready Web Data Pipelines
  • Parameterization and Environment Management
  • Monitoring and Alerting Integration
  • Data Lineage and Documentation
  • Advanced Integration Patterns
  • Multi-Source Data Fusion
  • Real-Time Data Streaming Simulation
  • Hands-On Exercise
  • Step 1: Build the Multi-Source Connector
  • Step 2: Implement Data Quality Scoring
  • Step 3: Create Unified Output Schema
  • Common Mistakes & Troubleshooting
  • Authentication Token Expiration
  • Rate Limit Handling
  • Dynamic Schema Changes
  • Web Scraping Reliability Issues
  • Summary & Next Steps
  • Dynamic Content and JavaScript-Rendered Pages
  • Error Handling and Resilience Patterns
  • Comprehensive Error Detection
  • Data Validation and Schema Checking
  • Performance Optimization and Caching Strategies
  • Query Folding and Web Sources
  • Incremental Refresh Patterns
  • Parallel Processing for Multiple Endpoints
  • Building Production-Ready Web Data Pipelines
  • Parameterization and Environment Management
  • Monitoring and Alerting Integration
  • Data Lineage and Documentation
  • Advanced Integration Patterns
  • Multi-Source Data Fusion
  • Real-Time Data Streaming Simulation
  • Hands-On Exercise
  • Step 1: Build the Multi-Source Connector
  • Step 2: Implement Data Quality Scoring
  • Step 3: Create Unified Output Schema
  • Common Mistakes & Troubleshooting
  • Authentication Token Expiration
  • Rate Limit Handling
  • Dynamic Schema Changes
  • Web Scraping Reliability Issues
  • Summary & Next Steps