Wicked Smart Data
LearnArticlesAbout
Sign InSign Up
LearnArticlesAboutContact
Sign InSign Up
Wicked Smart Data

The go-to platform for professionals who want to master data, automation, and AI — from Excel fundamentals to cutting-edge machine learning.

Platform

  • Learning Paths
  • Articles
  • About
  • Contact

Connect

  • Contact Us
  • RSS Feed

© 2026 Wicked Smart Data. All rights reserved.

Privacy PolicyTerms of Service
All Articles
Hero image for Introduction to dbt (Data Build Tool)

Introduction to dbt (Data Build Tool)

Data Engineering⚡ Practitioner20 min readMar 23, 2026Updated Mar 24, 2026
Table of Contents
  • Prerequisites
  • Understanding dbt's Role in the Data Stack
  • The Modern Data Stack Evolution
  • Key dbt Concepts
  • Setting Up Your First dbt Project
  • Installation and Initialization
  • Configuring Your Connection
  • Building Your First Models
  • Understanding Model Types
  • Creating Source Definitions
  • Building Staging Models
  • Running Your First Models
  • Building Intermediate and Mart Models

Introduction to dbt (Data Build Tool)

Imagine you're a data analyst at a growing e-commerce company. Every morning, you run the same sequence of SQL queries to update your sales dashboard: first you clean the raw transaction data, then calculate daily metrics, then aggregate weekly trends, and finally join everything with customer segments. Each step depends on the previous one, and if something fails halfway through, you're manually debugging which tables are stale. Sound familiar?

This is exactly the problem dbt (Data Build Tool) solves. dbt transforms your collection of ad-hoc SQL scripts into a robust, version-controlled data pipeline with automatic dependency management, testing, and documentation. Instead of maintaining a brittle sequence of manual queries, you'll define transformations as models that dbt orchestrates intelligently.

By the end of this lesson, you'll understand how dbt revolutionizes analytics workflows and be able to build your first production-ready data pipeline.

What you'll learn:

  • How dbt fits into the modern data stack and why it matters for analytics workflows
  • Core dbt concepts: models, sources, tests, and documentation
  • Building and running transformations with dependency management
  • Implementing data quality checks and generating documentation
  • Best practices for structuring dbt projects in production environments

Prerequisites

You should be comfortable writing SQL queries (CTEs, joins, window functions) and have basic familiarity with data warehouse concepts. You'll also need:

  • A data warehouse or database (we'll use BigQuery, but Snowflake, Redshift, or PostgreSQL work similarly)
  • Python 3.7+ installed locally
  • Basic command line familiarity
  • A code editor (VS Code recommended)

Understanding dbt's Role in the Data Stack

The Modern Data Stack Evolution

Traditional ETL tools move data from operational systems into warehouses, but they're clunky for the transformation layer where analysts work. dbt occupies the "T" in ELT (Extract, Load, Transform), sitting between your raw data warehouse tables and your analytics tools.

Raw Data → [EL Tools] → Data Warehouse → [dbt] → Analytics-Ready Tables → [BI Tools]

Here's what makes dbt different: it's SQL-first, version-controlled, and treats data transformations like software development. Instead of clicking through a GUI to define transformations, you write SQL files that dbt compiles into efficient queries.

Key dbt Concepts

Models are the heart of dbt. A model is simply a SQL file that defines a transformation. When you run dbt, it executes these SQL files in dependency order and materializes the results as tables or views in your warehouse.

Sources represent your raw data tables. Instead of hardcoding table names in your SQL, you reference sources, which makes your code more maintainable and enables data lineage tracking.

Tests are assertions about your data. dbt can automatically check that primary keys are unique, foreign keys exist, or custom business logic holds true.

Documentation is generated automatically from your code, creating a data catalog that stays in sync with your transformations.

Let's see these concepts in action with a realistic example.

Setting Up Your First dbt Project

Installation and Initialization

Install dbt for your warehouse (we'll use BigQuery):

pip install dbt-bigquery

Initialize a new project:

dbt init ecommerce_analytics
cd ecommerce_analytics

This creates a project structure:

ecommerce_analytics/
├── dbt_project.yml          # Project configuration
├── models/                  # Your SQL transformations
├── data/                    # Seed CSV files
├── tests/                   # Custom data tests
└── macros/                  # Reusable SQL snippets

Configuring Your Connection

Edit profiles.yml (typically in ~/.dbt/) to connect to your warehouse:

ecommerce_analytics:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: service-account
      project: your-gcp-project
      dataset: analytics_dev
      keyfile: path/to/service-account.json
      threads: 4

Test your connection:

dbt debug

You should see "All checks passed!" confirming dbt can connect to your warehouse.

Building Your First Models

Understanding Model Types

dbt models can be materialized in different ways:

  • View (default): Creates a database view, no storage cost but slower queries
  • Table: Creates a physical table, faster queries but storage cost
  • Incremental: Appends only new records, efficient for large datasets
  • Ephemeral: Creates a CTE, never stored as a table

Creating Source Definitions

First, define your raw data sources in models/sources.yml:

version: 2

sources:
  - name: raw_ecommerce
    description: Raw e-commerce transaction data from our operational database
    tables:
      - name: orders
        description: Individual order records
        columns:
          - name: order_id
            description: Unique identifier for each order
            tests:
              - unique
              - not_null
          - name: customer_id
            description: Reference to customer table
            tests:
              - not_null
          - name: order_date
            description: Date order was placed
          - name: total_amount
            description: Order total in USD

      - name: order_items
        description: Individual items within each order
        columns:
          - name: order_id
            description: Reference to orders table
          - name: product_id
            description: Reference to products table
          - name: quantity
            description: Number of items ordered
          - name: unit_price
            description: Price per unit in USD

      - name: customers
        description: Customer dimension data
        columns:
          - name: customer_id
            description: Unique customer identifier
            tests:
              - unique
              - not_null
          - name: customer_email
            description: Customer email address
          - name: registration_date
            description: Date customer registered
          - name: customer_segment
            description: Customer tier (premium, standard, basic)

Building Staging Models

Create staging models that clean and standardize your raw data. These models do light transformations and establish consistent naming conventions.

models/staging/stg_orders.sql:

{{ config(materialized='view') }}

select
    order_id,
    customer_id,
    order_date,
    total_amount,
    -- Standardize date format
    date(order_date) as order_date_clean,
    -- Extract useful date parts
    extract(year from order_date) as order_year,
    extract(month from order_date) as order_month,
    extract(dayofweek from order_date) as order_day_of_week,
    -- Flag weekend orders
    case 
        when extract(dayofweek from order_date) in (1, 7) then true 
        else false 
    end as is_weekend_order

from {{ source('raw_ecommerce', 'orders') }}
where 
    -- Filter out test orders
    total_amount > 0
    -- Only include orders from the last 3 years
    and order_date >= date_sub(current_date(), interval 3 year)

models/staging/stg_order_items.sql:

{{ config(materialized='view') }}

select
    order_id,
    product_id,
    quantity,
    unit_price,
    -- Calculate line total
    quantity * unit_price as line_total,
    -- Add row number for each item within an order
    row_number() over (partition by order_id order by product_id) as item_sequence

from {{ source('raw_ecommerce', 'order_items') }}
where 
    quantity > 0 
    and unit_price > 0

models/staging/stg_customers.sql:

{{ config(materialized='view') }}

select
    customer_id,
    lower(trim(customer_email)) as customer_email_clean,
    registration_date,
    coalesce(customer_segment, 'unknown') as customer_segment,
    -- Calculate days since registration
    date_diff(current_date(), date(registration_date), day) as days_since_registration,
    -- Categorize customer tenure
    case 
        when date_diff(current_date(), date(registration_date), day) < 30 then 'new'
        when date_diff(current_date(), date(registration_date), day) < 365 then 'established'
        else 'veteran'
    end as customer_tenure_category

from {{ source('raw_ecommerce', 'customers') }}
where customer_email is not null

Running Your First Models

Execute your staging models:

dbt run --models staging

You'll see output like:

Found 3 models, 0 tests, 0 snapshots, 0 analyses, 165 macros, 0 operations, 0 seed files, 3 sources

Completed successfully
Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

dbt automatically determined the execution order and created views in your warehouse. Check your database - you should see stg_orders, stg_order_items, and stg_customers views.

Building Intermediate and Mart Models

Creating Business Logic Models

Now build intermediate models that implement core business logic. These models join staging tables and calculate key metrics.

models/intermediate/int_order_summary.sql:

{{ config(materialized='table') }}

with order_totals as (
    select
        o.order_id,
        o.customer_id,
        o.order_date_clean,
        o.order_year,
        o.order_month,
        o.is_weekend_order,
        o.total_amount,
        -- Calculate item-level aggregations
        sum(oi.quantity) as total_items,
        count(distinct oi.product_id) as unique_products,
        sum(oi.line_total) as calculated_total,
        -- Check for total amount discrepancies
        abs(o.total_amount - sum(oi.line_total)) as amount_difference

    from {{ ref('stg_orders') }} o
    left join {{ ref('stg_order_items') }} oi
        on o.order_id = oi.order_id
    group by 1, 2, 3, 4, 5, 6, 7
),

order_enriched as (
    select
        *,
        -- Categorize order size
        case 
            when total_amount < 50 then 'small'
            when total_amount < 200 then 'medium'
            when total_amount < 500 then 'large'
            else 'xl'
        end as order_size_category,
        
        -- Flag potential data quality issues
        case 
            when amount_difference > 0.01 then true 
            else false 
        end as has_amount_discrepancy

    from order_totals
)

select * from order_enriched

Notice how we use {{ ref('stg_orders') }} instead of hardcoding table names. This creates dependencies that dbt tracks automatically.

Building Customer Analytics

models/intermediate/int_customer_metrics.sql:

{{ config(materialized='table') }}

with customer_order_stats as (
    select
        c.customer_id,
        c.customer_email_clean,
        c.customer_segment,
        c.customer_tenure_category,
        c.days_since_registration,
        
        -- Order statistics
        count(o.order_id) as total_orders,
        sum(o.total_amount) as total_spent,
        avg(o.total_amount) as avg_order_value,
        min(o.order_date_clean) as first_order_date,
        max(o.order_date_clean) as last_order_date,
        
        -- Calculate days between first and last order
        date_diff(max(o.order_date_clean), min(o.order_date_clean), day) as customer_lifespan_days,
        
        -- Recent activity
        date_diff(current_date(), max(o.order_date_clean), day) as days_since_last_order

    from {{ ref('stg_customers') }} c
    left join {{ ref('int_order_summary') }} o
        on c.customer_id = o.customer_id
    group by 1, 2, 3, 4, 5
),

customer_segments as (
    select
        *,
        -- RFM-style segmentation
        case 
            when total_orders = 0 then 'never_ordered'
            when days_since_last_order <= 30 and total_orders >= 5 then 'champion'
            when days_since_last_order <= 30 and total_orders >= 2 then 'loyal'
            when days_since_last_order <= 90 then 'potential_loyalist'
            when days_since_last_order <= 180 then 'at_risk'
            else 'churned'
        end as behavioral_segment,
        
        -- Spending tier
        case 
            when total_spent >= 1000 then 'high_value'
            when total_spent >= 300 then 'medium_value'
            when total_spent > 0 then 'low_value'
            else 'no_value'
        end as value_segment

    from customer_order_stats
)

select * from customer_segments

Creating Final Mart Models

Mart models are business-facing tables optimized for analytics and reporting.

models/marts/sales_daily_summary.sql:

{{ config(
    materialized='table',
    description='Daily sales metrics for executive reporting'
) }}

select
    order_date_clean as report_date,
    order_year,
    order_month,
    
    -- Volume metrics
    count(distinct order_id) as orders_count,
    count(distinct customer_id) as unique_customers,
    sum(total_items) as items_sold,
    
    -- Revenue metrics
    sum(total_amount) as total_revenue,
    avg(total_amount) as avg_order_value,
    
    -- Order composition
    sum(case when order_size_category = 'small' then 1 else 0 end) as small_orders,
    sum(case when order_size_category = 'medium' then 1 else 0 end) as medium_orders,
    sum(case when order_size_category = 'large' then 1 else 0 end) as large_orders,
    sum(case when order_size_category = 'xl' then 1 else 0 end) as xl_orders,
    
    -- Weekend vs weekday split
    sum(case when is_weekend_order then total_amount else 0 end) as weekend_revenue,
    sum(case when not is_weekend_order then total_amount else 0 end) as weekday_revenue,
    
    -- Data quality flags
    sum(case when has_amount_discrepancy then 1 else 0 end) as orders_with_discrepancies

from {{ ref('int_order_summary') }}
group by 1, 2, 3
order by report_date desc

Run all your models:

dbt run

dbt automatically determines the correct execution order based on dependencies:

Completed successfully
Done. PASS=6 WARN=0 ERROR=0 SKIP=0 TOTAL=6

Implementing Data Quality Tests

Built-in Tests

dbt includes four built-in tests: unique, not_null, accepted_values, and relationships. Add tests to your models in schema.yml files.

models/marts/schema.yml:

version: 2

models:
  - name: sales_daily_summary
    description: Daily aggregated sales metrics
    columns:
      - name: report_date
        description: Date of sales activity
        tests:
          - unique
          - not_null
      
      - name: orders_count
        description: Number of orders placed
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: ">= 0"
      
      - name: total_revenue
        description: Total revenue for the day
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: ">= 0"

  - name: int_customer_metrics
    description: Customer-level analytics and segmentation
    columns:
      - name: customer_id
        description: Unique customer identifier
        tests:
          - unique
          - not_null
      
      - name: behavioral_segment
        description: Customer behavior classification
        tests:
          - accepted_values:
              values: ['never_ordered', 'champion', 'loyal', 'potential_loyalist', 'at_risk', 'churned']
      
      - name: total_orders
        description: Lifetime order count
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: ">= 0"

Custom Tests

Create custom tests for business logic. For example, test that daily revenue matches the sum of individual orders:

tests/assert_daily_revenue_matches_orders.sql:

-- Test that daily revenue aggregation matches sum of individual orders
select
    order_date_clean,
    abs(
        sum(total_amount) - 
        (select total_revenue from {{ ref('sales_daily_summary') }} s 
         where s.report_date = o.order_date_clean)
    ) as revenue_difference

from {{ ref('int_order_summary') }} o
group by 1
having revenue_difference > 0.01

If this query returns any rows, the test fails, indicating a discrepancy in your revenue calculations.

Running Tests

Execute all tests:

dbt test

You'll see results for each test:

Found 6 models, 8 tests, 0 snapshots, 0 analyses, 165 macros, 0 operations, 0 seed files, 3 sources

Completed successfully
Done. PASS=8 WARN=0 ERROR=0 SKIP=0 TOTAL=8

Run tests for specific models:

dbt test --models sales_daily_summary

Pro tip: Set up tests to run automatically in your CI/CD pipeline. Failed tests should block deployments to production, ensuring data quality issues are caught early.

Documentation and Lineage

Generating Documentation

dbt automatically generates documentation from your code comments and schema definitions:

dbt docs generate
dbt docs serve

This creates an interactive website showing:

  • All models, sources, and tests
  • Column descriptions and lineage
  • Data freshness information
  • SQL code for each model

Adding Rich Documentation

Enhance your documentation with descriptions and metadata:

models/marts/schema.yml:

version: 2

models:
  - name: sales_daily_summary
    description: |
      ## Daily Sales Summary
      
      This table provides executive-level daily sales metrics including:
      - Revenue and volume KPIs
      - Customer acquisition metrics  
      - Order size distribution
      - Weekend vs weekday performance
      
      **Refresh Schedule**: Daily at 6 AM UTC
      **Data Latency**: Previous day's data available by 8 AM UTC
      **Owner**: Analytics Team
      
    columns:
      - name: report_date
        description: Business date (order placement date in company timezone)
        
      - name: avg_order_value
        description: |
          Average order value calculated as total_revenue / orders_count.
          Excludes refunds and cancelled orders.

The generated documentation becomes your team's single source of truth about data definitions, lineage, and business context.

Exploring Data Lineage

The documentation site includes an interactive lineage graph showing how data flows through your models. Click on any model to see:

  • Upstream dependencies (what feeds this model)
  • Downstream dependencies (what depends on this model)
  • Column-level lineage for complex transformations

This lineage graph becomes invaluable for impact analysis when changing upstream models.

Real-World Project: E-commerce Analytics Pipeline

Let's build a complete analytics pipeline for an e-commerce company that needs to track customer behavior and business performance.

Project Structure

Organize your dbt project using a standard structure:

models/
├── staging/          # Clean, standardized source data
│   ├── _sources.yml
│   ├── stg_orders.sql
│   ├── stg_order_items.sql
│   └── stg_customers.sql
├── intermediate/     # Business logic and complex joins  
│   ├── int_order_summary.sql
│   └── int_customer_metrics.sql
├── marts/           # Business-facing analytics tables
│   ├── sales_daily_summary.sql
│   ├── customer_segments.sql
│   └── schema.yml
└── utils/           # Reusable macros and utilities

Advanced Model: Customer Cohort Analysis

Build a cohort analysis model to track customer retention:

models/marts/customer_cohorts.sql:

{{ config(materialized='table') }}

with customer_first_orders as (
    select
        customer_id,
        min(order_date_clean) as first_order_date,
        date_trunc(min(order_date_clean), month) as cohort_month
    from {{ ref('int_order_summary') }}
    group by 1
),

monthly_activity as (
    select
        o.customer_id,
        date_trunc(o.order_date_clean, month) as activity_month,
        sum(o.total_amount) as monthly_revenue,
        count(o.order_id) as monthly_orders
    from {{ ref('int_order_summary') }} o
    group by 1, 2
),

cohort_data as (
    select
        c.cohort_month,
        a.activity_month,
        date_diff(a.activity_month, c.cohort_month, month) as months_since_first_order,
        
        -- Customer counts
        count(distinct a.customer_id) as active_customers,
        
        -- Revenue metrics  
        sum(a.monthly_revenue) as cohort_revenue,
        avg(a.monthly_revenue) as avg_revenue_per_customer,
        
        -- Calculate cohort size (customers who made first order in cohort_month)
        count(distinct c.customer_id) as cohort_size

    from customer_first_orders c
    left join monthly_activity a
        on c.customer_id = a.customer_id
    where a.activity_month >= c.cohort_month
    group by 1, 2, 3
),

cohort_retention as (
    select
        *,
        -- Calculate retention rate
        round(100.0 * active_customers / cohort_size, 2) as retention_rate,
        
        -- Label cohort periods
        case 
            when months_since_first_order = 0 then 'Month 0 (Acquisition)'
            when months_since_first_order = 1 then 'Month 1'
            when months_since_first_order <= 6 then concat('Month ', months_since_first_order)
            when months_since_first_order <= 12 then concat('Month ', months_since_first_order, ' (Year 1)')
            else concat('Month ', months_since_first_order, ' (Year ', div(months_since_first_order, 12) + 1, ')')
        end as cohort_period_label

    from cohort_data
)

select * from cohort_retention
order by cohort_month desc, months_since_first_order

This model enables powerful retention analysis showing how customer behavior evolves over time by acquisition cohort.

Adding Incremental Models

For large datasets, use incremental models that only process new records:

models/marts/order_events_incremental.sql:

{{ config(
    materialized='incremental',
    unique_key='event_id',
    on_schema_change='fail'
) }}

with order_events as (
    select
        concat(order_id, '_placed') as event_id,
        order_id,
        customer_id,
        order_date_clean as event_date,
        'order_placed' as event_type,
        total_amount as event_value,
        order_size_category,
        current_timestamp() as processed_at
        
    from {{ ref('int_order_summary') }}
    
    {% if is_incremental() %}
        -- Only process orders from the last 2 days on incremental runs
        where order_date_clean >= date_sub(current_date(), interval 2 day)
    {% endif %}
)

select * from order_events

The {% if is_incremental() %} block ensures only recent data is processed on subsequent runs, dramatically improving performance for large tables.

Run the incremental model:

dbt run --models order_events_incremental

On the first run, dbt processes all historical data. On subsequent runs, it only processes new records.

Best Practices for Production

Model Organization

Structure your project following dbt's recommended conventions:

  1. Staging models clean and standardize raw data with minimal business logic
  2. Intermediate models implement complex business logic and joins
  3. Mart models present final analytics-ready tables optimized for consumption

Use consistent naming:

  • stg_ prefix for staging models
  • int_ prefix for intermediate models
  • dim_ and fct_ prefixes for dimensional and fact tables
  • Descriptive names: customer_order_summary not cos

Performance Optimization

Choose appropriate materializations:

  • Views for lightweight transformations accessed infrequently
  • Tables for frequently-queried aggregations
  • Incremental for large, append-only datasets

Optimize warehouse-specific features:

{{ config(
    materialized='table',
    cluster_by=['order_date', 'customer_segment'],
    partition_by={
        'field': 'order_date',
        'data_type': 'date',
        'granularity': 'day'
    }
) }}

Use macros for repetitive logic:

macros/calculate_recency.sql:

{% macro calculate_recency(date_column) %}
    date_diff(current_date(), {{ date_column }}, day)
{% endmacro %}

Then use it in models: {{ calculate_recency('last_order_date') }} as days_since_last_order

Version Control and Deployment

Structure your Git workflow:

  • Feature branches for new models or changes
  • Pull requests with required approvals
  • Automated testing on CI/CD pipeline
  • Separate dev/staging/prod environments

Use environment-specific configurations in dbt_project.yml:

models:
  ecommerce_analytics:
    staging:
      +materialized: view
    intermediate:
      +materialized: "{{ 'table' if target.name == 'prod' else 'view' }}"
    marts:
      +materialized: table

Pro tip: Use dbt's --defer flag in development to reference production tables for upstream dependencies you haven't changed, reducing compute costs.

Hands-On Exercise

Build a customer lifetime value (CLV) model that calculates predicted customer value using historical purchase patterns.

Requirements

Create a model that calculates:

  1. Historical CLV (actual total spending)
  2. Predicted CLV using a simple linear regression on purchase frequency
  3. Customer risk scoring based on recency and order patterns
  4. CLV segmentation (low, medium, high value)

Starter Code

Create models/marts/customer_lifetime_value.sql with this foundation:

{{ config(materialized='table') }}

with customer_purchase_history as (
    select
        customer_id,
        count(order_id) as total_orders,
        sum(total_amount) as total_spent,
        avg(total_amount) as avg_order_value,
        -- Add your calculations here
        
    from {{ ref('int_order_summary') }}
    group by 1
),

-- Add more CTEs for your CLV calculations

final as (
    select
        customer_id,
        -- Your final CLV metrics here
        
    from customer_purchase_history
)

select * from final

Solution Approach

Your solution should include:

  1. Purchase frequency calculation: Orders per month since first purchase
  2. Trend analysis: Is spending increasing, decreasing, or stable?
  3. Predicted CLV: Simple projection based on historical patterns
  4. Risk scoring: Combine recency, frequency, and monetary metrics
  5. Segmentation: Group customers into actionable segments

The complete solution should enable business users to identify high-value customers, predict churn risk, and optimize marketing spend.

Testing Your Solution

Add these tests to validate your CLV model:

# models/marts/schema.yml
- name: customer_lifetime_value
  tests:
    - dbt_utils.expression_is_true:
        expression: "historical_clv >= 0"
    - dbt_utils.expression_is_true:
        expression: "predicted_clv >= 0"
  columns:
    - name: customer_id
      tests:
        - unique
        - not_null
    - name: clv_segment
      tests:
        - accepted_values:
            values: ['low', 'medium', 'high']

Common Mistakes & Troubleshooting

1. Circular Dependencies

Mistake: Creating models that reference each other directly or indirectly.

-- models/model_a.sql
select * from {{ ref('model_b') }}

-- models/model_b.sql  
select * from {{ ref('model_a') }}

Why it fails: dbt cannot determine execution order when models have circular dependencies.

Fix: Restructure your models. Usually this means creating a shared upstream model that both can reference:

-- models/shared_base.sql
select * from {{ source('raw_data', 'transactions') }}

-- models/model_a.sql
select * from {{ ref('shared_base') }} where condition_a

-- models/model_b.sql
select * from {{ ref('shared_base') }} where condition_b

2. Incorrect Source References

Mistake: Hardcoding table names instead of using source references.

-- Wrong
select * from raw_database.orders

-- Right  
select * from {{ source('raw_ecommerce', 'orders') }}

Why it fails: Hard-coded references break when table names change and prevent dbt from tracking lineage.

Fix: Always define sources in _sources.yml files and use {{ source() }} function.

3. Inefficient Incremental Logic

Mistake: Not filtering source data in incremental models.

{{ config(materialized='incremental') }}

select * from {{ source('events', 'page_views') }}

{% if is_incremental() %}
    -- Only filtering the final result
    where event_date >= (select max(event_date) from {{ this }})
{% endif %}

Why it fails: The source query still processes all historical data, just filters at the end.

Fix: Filter the source data early in your query:

select * from {{ source('events', 'page_views') }}
{% if is_incremental() %}
where event_date >= (select max(event_date) from {{ this }})
{% endif %}

4. Test Failures in Production

Mistake: Tests that work in development but fail in production due to data volume or timing issues.

Why it fails: Production data has edge cases and scale that development data lacks.

Fix: Use more robust test logic and sample data appropriately:

-- Instead of testing all data
select customer_id from {{ ref('customers') }}
group by 1 having count(*) > 1

-- Sample and test subset
select customer_id from {{ ref('customers') }}
where created_date >= current_date - 30  -- Only test recent data
group by 1 having count(*) > 1

5. Memory Issues with Large Models

Mistake: Creating models that process too much data without optimization.

Why it fails: Large aggregations can exceed warehouse memory limits.

Fix: Use incremental materialization, partitioning, or break large models into smaller pieces:

{{ config(
    materialized='incremental',
    partition_by={'field': 'event_date', 'data_type': 'date'},
    cluster_by=['customer_id', 'product_category']
) }}

Summary & Next Steps

You've learned how dbt transforms ad-hoc SQL scripts into robust, maintainable data pipelines. The key concepts - models, sources, tests, and documentation - work together to create a development workflow that treats analytics code like software engineering.

The progression from staging through intermediate to mart models creates clear separation of concerns: data cleaning, business logic, and final presentation. Testing ensures data quality, while automatic documentation keeps your team aligned on definitions and lineage.

In production environments, dbt becomes even more powerful with version control, CI/CD integration, and advanced materializations that optimize for your warehouse's specific features.

Next steps to deepen your dbt expertise:

  1. Advanced Testing and Macros - Learn to write custom tests, generic tests, and reusable macros that encapsulate business logic you can apply across models.

  2. dbt Package Ecosystem - Explore community packages like dbt_utils, dbt_expectations, and domain-specific packages that extend dbt's capabilities with pre-built functions and tests.

  3. Production Deployment Patterns - Study environment management, blue/green deployments, and integration with orchestration tools like Airflow or dbt Cloud for automated pipeline execution.

Each builds naturally on the foundation you've established here, moving you toward building production-grade analytics infrastructure that scales with your organization's data needs.

Learning Path: Modern Data Stack

Next

Data Pipeline Orchestration with Airflow

Related Articles

Data Engineering🌱 Foundation

Cloud Data Warehouses: Snowflake vs BigQuery vs Redshift - Complete Comparison Guide

13 min
Data Engineering🔥 Expert

Data Ingestion with Fivetran, Airbyte, and Custom Connectors

31 min
Data Engineering🔥 Expert

dbt Fundamentals: Transform Data with SQL in Your Warehouse

25 min

On this page

  • Prerequisites
  • Understanding dbt's Role in the Data Stack
  • The Modern Data Stack Evolution
  • Key dbt Concepts
  • Setting Up Your First dbt Project
  • Installation and Initialization
  • Configuring Your Connection
  • Building Your First Models
  • Understanding Model Types
  • Creating Source Definitions
  • Building Staging Models
Creating Business Logic Models
  • Building Customer Analytics
  • Creating Final Mart Models
  • Implementing Data Quality Tests
  • Built-in Tests
  • Custom Tests
  • Running Tests
  • Documentation and Lineage
  • Generating Documentation
  • Adding Rich Documentation
  • Exploring Data Lineage
  • Real-World Project: E-commerce Analytics Pipeline
  • Project Structure
  • Advanced Model: Customer Cohort Analysis
  • Adding Incremental Models
  • Best Practices for Production
  • Model Organization
  • Performance Optimization
  • Version Control and Deployment
  • Hands-On Exercise
  • Requirements
  • Starter Code
  • Solution Approach
  • Testing Your Solution
  • Common Mistakes & Troubleshooting
  • 1. Circular Dependencies
  • 2. Incorrect Source References
  • 3. Inefficient Incremental Logic
  • 4. Test Failures in Production
  • 5. Memory Issues with Large Models
  • Summary & Next Steps
  • Running Your First Models
  • Building Intermediate and Mart Models
  • Creating Business Logic Models
  • Building Customer Analytics
  • Creating Final Mart Models
  • Implementing Data Quality Tests
  • Built-in Tests
  • Custom Tests
  • Running Tests
  • Documentation and Lineage
  • Generating Documentation
  • Adding Rich Documentation
  • Exploring Data Lineage
  • Real-World Project: E-commerce Analytics Pipeline
  • Project Structure
  • Advanced Model: Customer Cohort Analysis
  • Adding Incremental Models
  • Best Practices for Production
  • Model Organization
  • Performance Optimization
  • Version Control and Deployment
  • Hands-On Exercise
  • Requirements
  • Starter Code
  • Solution Approach
  • Testing Your Solution
  • Common Mistakes & Troubleshooting
  • 1. Circular Dependencies
  • 2. Incorrect Source References
  • 3. Inefficient Incremental Logic
  • 4. Test Failures in Production
  • 5. Memory Issues with Large Models
  • Summary & Next Steps