
Imagine you're a data analyst at a growing e-commerce company. Every morning, you run the same sequence of SQL queries to update your sales dashboard: first you clean the raw transaction data, then calculate daily metrics, then aggregate weekly trends, and finally join everything with customer segments. Each step depends on the previous one, and if something fails halfway through, you're manually debugging which tables are stale. Sound familiar?
This is exactly the problem dbt (Data Build Tool) solves. dbt transforms your collection of ad-hoc SQL scripts into a robust, version-controlled data pipeline with automatic dependency management, testing, and documentation. Instead of maintaining a brittle sequence of manual queries, you'll define transformations as models that dbt orchestrates intelligently.
By the end of this lesson, you'll understand how dbt revolutionizes analytics workflows and be able to build your first production-ready data pipeline.
What you'll learn:
You should be comfortable writing SQL queries (CTEs, joins, window functions) and have basic familiarity with data warehouse concepts. You'll also need:
Traditional ETL tools move data from operational systems into warehouses, but they're clunky for the transformation layer where analysts work. dbt occupies the "T" in ELT (Extract, Load, Transform), sitting between your raw data warehouse tables and your analytics tools.
Raw Data → [EL Tools] → Data Warehouse → [dbt] → Analytics-Ready Tables → [BI Tools]
Here's what makes dbt different: it's SQL-first, version-controlled, and treats data transformations like software development. Instead of clicking through a GUI to define transformations, you write SQL files that dbt compiles into efficient queries.
Models are the heart of dbt. A model is simply a SQL file that defines a transformation. When you run dbt, it executes these SQL files in dependency order and materializes the results as tables or views in your warehouse.
Sources represent your raw data tables. Instead of hardcoding table names in your SQL, you reference sources, which makes your code more maintainable and enables data lineage tracking.
Tests are assertions about your data. dbt can automatically check that primary keys are unique, foreign keys exist, or custom business logic holds true.
Documentation is generated automatically from your code, creating a data catalog that stays in sync with your transformations.
Let's see these concepts in action with a realistic example.
Install dbt for your warehouse (we'll use BigQuery):
pip install dbt-bigquery
Initialize a new project:
dbt init ecommerce_analytics
cd ecommerce_analytics
This creates a project structure:
ecommerce_analytics/
├── dbt_project.yml # Project configuration
├── models/ # Your SQL transformations
├── data/ # Seed CSV files
├── tests/ # Custom data tests
└── macros/ # Reusable SQL snippets
Edit profiles.yml (typically in ~/.dbt/) to connect to your warehouse:
ecommerce_analytics:
target: dev
outputs:
dev:
type: bigquery
method: service-account
project: your-gcp-project
dataset: analytics_dev
keyfile: path/to/service-account.json
threads: 4
Test your connection:
dbt debug
You should see "All checks passed!" confirming dbt can connect to your warehouse.
dbt models can be materialized in different ways:
First, define your raw data sources in models/sources.yml:
version: 2
sources:
- name: raw_ecommerce
description: Raw e-commerce transaction data from our operational database
tables:
- name: orders
description: Individual order records
columns:
- name: order_id
description: Unique identifier for each order
tests:
- unique
- not_null
- name: customer_id
description: Reference to customer table
tests:
- not_null
- name: order_date
description: Date order was placed
- name: total_amount
description: Order total in USD
- name: order_items
description: Individual items within each order
columns:
- name: order_id
description: Reference to orders table
- name: product_id
description: Reference to products table
- name: quantity
description: Number of items ordered
- name: unit_price
description: Price per unit in USD
- name: customers
description: Customer dimension data
columns:
- name: customer_id
description: Unique customer identifier
tests:
- unique
- not_null
- name: customer_email
description: Customer email address
- name: registration_date
description: Date customer registered
- name: customer_segment
description: Customer tier (premium, standard, basic)
Create staging models that clean and standardize your raw data. These models do light transformations and establish consistent naming conventions.
models/staging/stg_orders.sql:
{{ config(materialized='view') }}
select
order_id,
customer_id,
order_date,
total_amount,
-- Standardize date format
date(order_date) as order_date_clean,
-- Extract useful date parts
extract(year from order_date) as order_year,
extract(month from order_date) as order_month,
extract(dayofweek from order_date) as order_day_of_week,
-- Flag weekend orders
case
when extract(dayofweek from order_date) in (1, 7) then true
else false
end as is_weekend_order
from {{ source('raw_ecommerce', 'orders') }}
where
-- Filter out test orders
total_amount > 0
-- Only include orders from the last 3 years
and order_date >= date_sub(current_date(), interval 3 year)
models/staging/stg_order_items.sql:
{{ config(materialized='view') }}
select
order_id,
product_id,
quantity,
unit_price,
-- Calculate line total
quantity * unit_price as line_total,
-- Add row number for each item within an order
row_number() over (partition by order_id order by product_id) as item_sequence
from {{ source('raw_ecommerce', 'order_items') }}
where
quantity > 0
and unit_price > 0
models/staging/stg_customers.sql:
{{ config(materialized='view') }}
select
customer_id,
lower(trim(customer_email)) as customer_email_clean,
registration_date,
coalesce(customer_segment, 'unknown') as customer_segment,
-- Calculate days since registration
date_diff(current_date(), date(registration_date), day) as days_since_registration,
-- Categorize customer tenure
case
when date_diff(current_date(), date(registration_date), day) < 30 then 'new'
when date_diff(current_date(), date(registration_date), day) < 365 then 'established'
else 'veteran'
end as customer_tenure_category
from {{ source('raw_ecommerce', 'customers') }}
where customer_email is not null
Execute your staging models:
dbt run --models staging
You'll see output like:
Found 3 models, 0 tests, 0 snapshots, 0 analyses, 165 macros, 0 operations, 0 seed files, 3 sources
Completed successfully
Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
dbt automatically determined the execution order and created views in your warehouse. Check your database - you should see stg_orders, stg_order_items, and stg_customers views.
Now build intermediate models that implement core business logic. These models join staging tables and calculate key metrics.
models/intermediate/int_order_summary.sql:
{{ config(materialized='table') }}
with order_totals as (
select
o.order_id,
o.customer_id,
o.order_date_clean,
o.order_year,
o.order_month,
o.is_weekend_order,
o.total_amount,
-- Calculate item-level aggregations
sum(oi.quantity) as total_items,
count(distinct oi.product_id) as unique_products,
sum(oi.line_total) as calculated_total,
-- Check for total amount discrepancies
abs(o.total_amount - sum(oi.line_total)) as amount_difference
from {{ ref('stg_orders') }} o
left join {{ ref('stg_order_items') }} oi
on o.order_id = oi.order_id
group by 1, 2, 3, 4, 5, 6, 7
),
order_enriched as (
select
*,
-- Categorize order size
case
when total_amount < 50 then 'small'
when total_amount < 200 then 'medium'
when total_amount < 500 then 'large'
else 'xl'
end as order_size_category,
-- Flag potential data quality issues
case
when amount_difference > 0.01 then true
else false
end as has_amount_discrepancy
from order_totals
)
select * from order_enriched
Notice how we use {{ ref('stg_orders') }} instead of hardcoding table names. This creates dependencies that dbt tracks automatically.
models/intermediate/int_customer_metrics.sql:
{{ config(materialized='table') }}
with customer_order_stats as (
select
c.customer_id,
c.customer_email_clean,
c.customer_segment,
c.customer_tenure_category,
c.days_since_registration,
-- Order statistics
count(o.order_id) as total_orders,
sum(o.total_amount) as total_spent,
avg(o.total_amount) as avg_order_value,
min(o.order_date_clean) as first_order_date,
max(o.order_date_clean) as last_order_date,
-- Calculate days between first and last order
date_diff(max(o.order_date_clean), min(o.order_date_clean), day) as customer_lifespan_days,
-- Recent activity
date_diff(current_date(), max(o.order_date_clean), day) as days_since_last_order
from {{ ref('stg_customers') }} c
left join {{ ref('int_order_summary') }} o
on c.customer_id = o.customer_id
group by 1, 2, 3, 4, 5
),
customer_segments as (
select
*,
-- RFM-style segmentation
case
when total_orders = 0 then 'never_ordered'
when days_since_last_order <= 30 and total_orders >= 5 then 'champion'
when days_since_last_order <= 30 and total_orders >= 2 then 'loyal'
when days_since_last_order <= 90 then 'potential_loyalist'
when days_since_last_order <= 180 then 'at_risk'
else 'churned'
end as behavioral_segment,
-- Spending tier
case
when total_spent >= 1000 then 'high_value'
when total_spent >= 300 then 'medium_value'
when total_spent > 0 then 'low_value'
else 'no_value'
end as value_segment
from customer_order_stats
)
select * from customer_segments
Mart models are business-facing tables optimized for analytics and reporting.
models/marts/sales_daily_summary.sql:
{{ config(
materialized='table',
description='Daily sales metrics for executive reporting'
) }}
select
order_date_clean as report_date,
order_year,
order_month,
-- Volume metrics
count(distinct order_id) as orders_count,
count(distinct customer_id) as unique_customers,
sum(total_items) as items_sold,
-- Revenue metrics
sum(total_amount) as total_revenue,
avg(total_amount) as avg_order_value,
-- Order composition
sum(case when order_size_category = 'small' then 1 else 0 end) as small_orders,
sum(case when order_size_category = 'medium' then 1 else 0 end) as medium_orders,
sum(case when order_size_category = 'large' then 1 else 0 end) as large_orders,
sum(case when order_size_category = 'xl' then 1 else 0 end) as xl_orders,
-- Weekend vs weekday split
sum(case when is_weekend_order then total_amount else 0 end) as weekend_revenue,
sum(case when not is_weekend_order then total_amount else 0 end) as weekday_revenue,
-- Data quality flags
sum(case when has_amount_discrepancy then 1 else 0 end) as orders_with_discrepancies
from {{ ref('int_order_summary') }}
group by 1, 2, 3
order by report_date desc
Run all your models:
dbt run
dbt automatically determines the correct execution order based on dependencies:
Completed successfully
Done. PASS=6 WARN=0 ERROR=0 SKIP=0 TOTAL=6
dbt includes four built-in tests: unique, not_null, accepted_values, and relationships. Add tests to your models in schema.yml files.
models/marts/schema.yml:
version: 2
models:
- name: sales_daily_summary
description: Daily aggregated sales metrics
columns:
- name: report_date
description: Date of sales activity
tests:
- unique
- not_null
- name: orders_count
description: Number of orders placed
tests:
- not_null
- dbt_utils.expression_is_true:
expression: ">= 0"
- name: total_revenue
description: Total revenue for the day
tests:
- not_null
- dbt_utils.expression_is_true:
expression: ">= 0"
- name: int_customer_metrics
description: Customer-level analytics and segmentation
columns:
- name: customer_id
description: Unique customer identifier
tests:
- unique
- not_null
- name: behavioral_segment
description: Customer behavior classification
tests:
- accepted_values:
values: ['never_ordered', 'champion', 'loyal', 'potential_loyalist', 'at_risk', 'churned']
- name: total_orders
description: Lifetime order count
tests:
- not_null
- dbt_utils.expression_is_true:
expression: ">= 0"
Create custom tests for business logic. For example, test that daily revenue matches the sum of individual orders:
tests/assert_daily_revenue_matches_orders.sql:
-- Test that daily revenue aggregation matches sum of individual orders
select
order_date_clean,
abs(
sum(total_amount) -
(select total_revenue from {{ ref('sales_daily_summary') }} s
where s.report_date = o.order_date_clean)
) as revenue_difference
from {{ ref('int_order_summary') }} o
group by 1
having revenue_difference > 0.01
If this query returns any rows, the test fails, indicating a discrepancy in your revenue calculations.
Execute all tests:
dbt test
You'll see results for each test:
Found 6 models, 8 tests, 0 snapshots, 0 analyses, 165 macros, 0 operations, 0 seed files, 3 sources
Completed successfully
Done. PASS=8 WARN=0 ERROR=0 SKIP=0 TOTAL=8
Run tests for specific models:
dbt test --models sales_daily_summary
Pro tip: Set up tests to run automatically in your CI/CD pipeline. Failed tests should block deployments to production, ensuring data quality issues are caught early.
dbt automatically generates documentation from your code comments and schema definitions:
dbt docs generate
dbt docs serve
This creates an interactive website showing:
Enhance your documentation with descriptions and metadata:
models/marts/schema.yml:
version: 2
models:
- name: sales_daily_summary
description: |
## Daily Sales Summary
This table provides executive-level daily sales metrics including:
- Revenue and volume KPIs
- Customer acquisition metrics
- Order size distribution
- Weekend vs weekday performance
**Refresh Schedule**: Daily at 6 AM UTC
**Data Latency**: Previous day's data available by 8 AM UTC
**Owner**: Analytics Team
columns:
- name: report_date
description: Business date (order placement date in company timezone)
- name: avg_order_value
description: |
Average order value calculated as total_revenue / orders_count.
Excludes refunds and cancelled orders.
The generated documentation becomes your team's single source of truth about data definitions, lineage, and business context.
The documentation site includes an interactive lineage graph showing how data flows through your models. Click on any model to see:
This lineage graph becomes invaluable for impact analysis when changing upstream models.
Let's build a complete analytics pipeline for an e-commerce company that needs to track customer behavior and business performance.
Organize your dbt project using a standard structure:
models/
├── staging/ # Clean, standardized source data
│ ├── _sources.yml
│ ├── stg_orders.sql
│ ├── stg_order_items.sql
│ └── stg_customers.sql
├── intermediate/ # Business logic and complex joins
│ ├── int_order_summary.sql
│ └── int_customer_metrics.sql
├── marts/ # Business-facing analytics tables
│ ├── sales_daily_summary.sql
│ ├── customer_segments.sql
│ └── schema.yml
└── utils/ # Reusable macros and utilities
Build a cohort analysis model to track customer retention:
models/marts/customer_cohorts.sql:
{{ config(materialized='table') }}
with customer_first_orders as (
select
customer_id,
min(order_date_clean) as first_order_date,
date_trunc(min(order_date_clean), month) as cohort_month
from {{ ref('int_order_summary') }}
group by 1
),
monthly_activity as (
select
o.customer_id,
date_trunc(o.order_date_clean, month) as activity_month,
sum(o.total_amount) as monthly_revenue,
count(o.order_id) as monthly_orders
from {{ ref('int_order_summary') }} o
group by 1, 2
),
cohort_data as (
select
c.cohort_month,
a.activity_month,
date_diff(a.activity_month, c.cohort_month, month) as months_since_first_order,
-- Customer counts
count(distinct a.customer_id) as active_customers,
-- Revenue metrics
sum(a.monthly_revenue) as cohort_revenue,
avg(a.monthly_revenue) as avg_revenue_per_customer,
-- Calculate cohort size (customers who made first order in cohort_month)
count(distinct c.customer_id) as cohort_size
from customer_first_orders c
left join monthly_activity a
on c.customer_id = a.customer_id
where a.activity_month >= c.cohort_month
group by 1, 2, 3
),
cohort_retention as (
select
*,
-- Calculate retention rate
round(100.0 * active_customers / cohort_size, 2) as retention_rate,
-- Label cohort periods
case
when months_since_first_order = 0 then 'Month 0 (Acquisition)'
when months_since_first_order = 1 then 'Month 1'
when months_since_first_order <= 6 then concat('Month ', months_since_first_order)
when months_since_first_order <= 12 then concat('Month ', months_since_first_order, ' (Year 1)')
else concat('Month ', months_since_first_order, ' (Year ', div(months_since_first_order, 12) + 1, ')')
end as cohort_period_label
from cohort_data
)
select * from cohort_retention
order by cohort_month desc, months_since_first_order
This model enables powerful retention analysis showing how customer behavior evolves over time by acquisition cohort.
For large datasets, use incremental models that only process new records:
models/marts/order_events_incremental.sql:
{{ config(
materialized='incremental',
unique_key='event_id',
on_schema_change='fail'
) }}
with order_events as (
select
concat(order_id, '_placed') as event_id,
order_id,
customer_id,
order_date_clean as event_date,
'order_placed' as event_type,
total_amount as event_value,
order_size_category,
current_timestamp() as processed_at
from {{ ref('int_order_summary') }}
{% if is_incremental() %}
-- Only process orders from the last 2 days on incremental runs
where order_date_clean >= date_sub(current_date(), interval 2 day)
{% endif %}
)
select * from order_events
The {% if is_incremental() %} block ensures only recent data is processed on subsequent runs, dramatically improving performance for large tables.
Run the incremental model:
dbt run --models order_events_incremental
On the first run, dbt processes all historical data. On subsequent runs, it only processes new records.
Structure your project following dbt's recommended conventions:
Use consistent naming:
stg_ prefix for staging modelsint_ prefix for intermediate models dim_ and fct_ prefixes for dimensional and fact tablescustomer_order_summary not cosChoose appropriate materializations:
Optimize warehouse-specific features:
{{ config(
materialized='table',
cluster_by=['order_date', 'customer_segment'],
partition_by={
'field': 'order_date',
'data_type': 'date',
'granularity': 'day'
}
) }}
Use macros for repetitive logic:
macros/calculate_recency.sql:
{% macro calculate_recency(date_column) %}
date_diff(current_date(), {{ date_column }}, day)
{% endmacro %}
Then use it in models: {{ calculate_recency('last_order_date') }} as days_since_last_order
Structure your Git workflow:
Use environment-specific configurations in dbt_project.yml:
models:
ecommerce_analytics:
staging:
+materialized: view
intermediate:
+materialized: "{{ 'table' if target.name == 'prod' else 'view' }}"
marts:
+materialized: table
Pro tip: Use dbt's
--deferflag in development to reference production tables for upstream dependencies you haven't changed, reducing compute costs.
Build a customer lifetime value (CLV) model that calculates predicted customer value using historical purchase patterns.
Create a model that calculates:
Create models/marts/customer_lifetime_value.sql with this foundation:
{{ config(materialized='table') }}
with customer_purchase_history as (
select
customer_id,
count(order_id) as total_orders,
sum(total_amount) as total_spent,
avg(total_amount) as avg_order_value,
-- Add your calculations here
from {{ ref('int_order_summary') }}
group by 1
),
-- Add more CTEs for your CLV calculations
final as (
select
customer_id,
-- Your final CLV metrics here
from customer_purchase_history
)
select * from final
Your solution should include:
The complete solution should enable business users to identify high-value customers, predict churn risk, and optimize marketing spend.
Add these tests to validate your CLV model:
# models/marts/schema.yml
- name: customer_lifetime_value
tests:
- dbt_utils.expression_is_true:
expression: "historical_clv >= 0"
- dbt_utils.expression_is_true:
expression: "predicted_clv >= 0"
columns:
- name: customer_id
tests:
- unique
- not_null
- name: clv_segment
tests:
- accepted_values:
values: ['low', 'medium', 'high']
Mistake: Creating models that reference each other directly or indirectly.
-- models/model_a.sql
select * from {{ ref('model_b') }}
-- models/model_b.sql
select * from {{ ref('model_a') }}
Why it fails: dbt cannot determine execution order when models have circular dependencies.
Fix: Restructure your models. Usually this means creating a shared upstream model that both can reference:
-- models/shared_base.sql
select * from {{ source('raw_data', 'transactions') }}
-- models/model_a.sql
select * from {{ ref('shared_base') }} where condition_a
-- models/model_b.sql
select * from {{ ref('shared_base') }} where condition_b
Mistake: Hardcoding table names instead of using source references.
-- Wrong
select * from raw_database.orders
-- Right
select * from {{ source('raw_ecommerce', 'orders') }}
Why it fails: Hard-coded references break when table names change and prevent dbt from tracking lineage.
Fix: Always define sources in _sources.yml files and use {{ source() }} function.
Mistake: Not filtering source data in incremental models.
{{ config(materialized='incremental') }}
select * from {{ source('events', 'page_views') }}
{% if is_incremental() %}
-- Only filtering the final result
where event_date >= (select max(event_date) from {{ this }})
{% endif %}
Why it fails: The source query still processes all historical data, just filters at the end.
Fix: Filter the source data early in your query:
select * from {{ source('events', 'page_views') }}
{% if is_incremental() %}
where event_date >= (select max(event_date) from {{ this }})
{% endif %}
Mistake: Tests that work in development but fail in production due to data volume or timing issues.
Why it fails: Production data has edge cases and scale that development data lacks.
Fix: Use more robust test logic and sample data appropriately:
-- Instead of testing all data
select customer_id from {{ ref('customers') }}
group by 1 having count(*) > 1
-- Sample and test subset
select customer_id from {{ ref('customers') }}
where created_date >= current_date - 30 -- Only test recent data
group by 1 having count(*) > 1
Mistake: Creating models that process too much data without optimization.
Why it fails: Large aggregations can exceed warehouse memory limits.
Fix: Use incremental materialization, partitioning, or break large models into smaller pieces:
{{ config(
materialized='incremental',
partition_by={'field': 'event_date', 'data_type': 'date'},
cluster_by=['customer_id', 'product_category']
) }}
You've learned how dbt transforms ad-hoc SQL scripts into robust, maintainable data pipelines. The key concepts - models, sources, tests, and documentation - work together to create a development workflow that treats analytics code like software engineering.
The progression from staging through intermediate to mart models creates clear separation of concerns: data cleaning, business logic, and final presentation. Testing ensures data quality, while automatic documentation keeps your team aligned on definitions and lineage.
In production environments, dbt becomes even more powerful with version control, CI/CD integration, and advanced materializations that optimize for your warehouse's specific features.
Next steps to deepen your dbt expertise:
Advanced Testing and Macros - Learn to write custom tests, generic tests, and reusable macros that encapsulate business logic you can apply across models.
dbt Package Ecosystem - Explore community packages like dbt_utils, dbt_expectations, and domain-specific packages that extend dbt's capabilities with pre-built functions and tests.
Production Deployment Patterns - Study environment management, blue/green deployments, and integration with orchestration tools like Airflow or dbt Cloud for automated pipeline execution.
Each builds naturally on the foundation you've established here, moving you toward building production-grade analytics infrastructure that scales with your organization's data needs.
Learning Path: Modern Data Stack