Wicked Smart Data
LearnArticlesAbout
Sign InSign Up
LearnArticlesAboutContact
Sign InSign Up
Wicked Smart Data

The go-to platform for professionals who want to master data, automation, and AI — from Excel fundamentals to cutting-edge machine learning.

Platform

  • Learning Paths
  • Articles
  • About
  • Contact

Connect

  • Contact Us
  • RSS Feed

© 2026 Wicked Smart Data. All rights reserved.

Privacy PolicyTerms of Service
All Articles
Hero image for Common Table Expressions (CTEs) for Cleaner SQL

Common Table Expressions (CTEs) for Cleaner SQL

SQL⚡ Practitioner23 min readMar 23, 2026Updated Mar 24, 2026
Table of Contents
  • Prerequisites
  • Understanding CTEs: Your First Refactor
  • CTE Syntax and Structure
  • Replacing Complex Subqueries
  • Recursive CTEs for Hierarchical Data
  • Performance Considerations and Optimization
  • Materialization vs. Inlining
  • Indexing Strategy for CTEs
  • When to Use MATERIALIZED CTEs
  • Performance Anti-Patterns to Avoid
  • Real-World Data Pipeline Pattern
  • Hands-On Exercise
  • Common Mistakes & Troubleshooting

Common Table Expressions (CTEs) for Cleaner SQL

You're staring at a SQL query that spans three screens, has nested subqueries five levels deep, and includes the same complex logic repeated in multiple places. Your colleague asks for a "quick modification" to add another metric, and you realize you'll need to scroll through 150 lines of barely comprehensible code to figure out where to make the change. Sound familiar?

This is exactly the problem Common Table Expressions (CTEs) were designed to solve. CTEs let you break complex queries into readable, modular pieces—think of them as temporary named result sets that exist only for the duration of your query. Instead of cramming everything into nested subqueries, you can define intermediate steps with meaningful names, making your SQL as readable as a well-structured program.

By the end of this lesson, you'll be writing SQL that your future self (and your teammates) will actually understand six months from now.

What you'll learn:

  • How to replace nested subqueries with readable CTE chains
  • When to use recursive CTEs for hierarchical data traversal
  • Performance considerations and optimization strategies for CTEs
  • How to modularize complex analytical queries using multiple CTEs
  • Real-world patterns for data transformation pipelines

Prerequisites

You should be comfortable with:

  • Basic SELECT statements, JOINs, and aggregate functions
  • Subqueries and their limitations
  • Window functions (ROW_NUMBER, RANK, LAG/LEAD)

We'll use PostgreSQL syntax throughout, but CTEs work similarly in SQL Server, MySQL 8.0+, and other modern databases.

Understanding CTEs: Your First Refactor

Let's start with a realistic scenario. You're analyzing customer purchase behavior and need to identify high-value customers who've made repeat purchases in the last quarter. Here's how this might look with traditional subqueries:

-- The nested subquery approach (don't do this)
SELECT 
    customer_id,
    customer_name,
    total_orders,
    total_revenue,
    avg_order_value
FROM (
    SELECT 
        c.customer_id,
        c.customer_name,
        COUNT(o.order_id) as total_orders,
        SUM(o.order_total) as total_revenue,
        AVG(o.order_total) as avg_order_value
    FROM customers c
    INNER JOIN (
        SELECT 
            customer_id,
            order_id,
            order_total,
            order_date
        FROM orders 
        WHERE order_date >= CURRENT_DATE - INTERVAL '3 months'
        AND order_total > 0
    ) o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.customer_name
    HAVING COUNT(o.order_id) >= 2
) customer_summary
WHERE total_revenue > 1000
ORDER BY total_revenue DESC;

This works, but it's hard to follow the logic flow. Now let's refactor using CTEs:

-- The CTE approach (much cleaner)
WITH recent_orders AS (
    SELECT 
        customer_id,
        order_id,
        order_total,
        order_date
    FROM orders 
    WHERE order_date >= CURRENT_DATE - INTERVAL '3 months'
      AND order_total > 0
),

customer_metrics AS (
    SELECT 
        c.customer_id,
        c.customer_name,
        COUNT(ro.order_id) as total_orders,
        SUM(ro.order_total) as total_revenue,
        AVG(ro.order_total) as avg_order_value
    FROM customers c
    INNER JOIN recent_orders ro ON c.customer_id = ro.customer_id
    GROUP BY c.customer_id, c.customer_name
    HAVING COUNT(ro.order_id) >= 2
)

SELECT 
    customer_id,
    customer_name,
    total_orders,
    total_revenue,
    avg_order_value
FROM customer_metrics
WHERE total_revenue > 1000
ORDER BY total_revenue DESC;

The CTE version tells a clear story: first we define what "recent orders" means, then we calculate customer metrics based on those orders, and finally we filter for high-value customers. Each step has a meaningful name and serves a single purpose.

Here's what the output might look like:

customer_id customer_name total_orders total_revenue avg_order_value
1247 TechCorp Inc 8 12450.00 1556.25
3891 DataFlow LLC 5 8750.50 1750.10
2156 CloudBiz Co 12 6890.25 574.19

CTE Syntax and Structure

The basic CTE syntax follows this pattern:

WITH cte_name AS (
    -- Your query here
),
another_cte AS (
    -- This can reference the first CTE
    SELECT * FROM cte_name WHERE condition
)
-- Main query that can reference any CTE
SELECT * FROM another_cte;

Key points about CTE syntax:

  • Start with WITH followed by your CTE name
  • Each CTE definition is wrapped in parentheses
  • Multiple CTEs are separated by commas (no additional WITH keywords)
  • The main query comes after all CTE definitions
  • CTEs can reference previously defined CTEs, but not ones defined later

Let's see this in action with a more complex example involving sales territory analysis:

WITH regional_sales AS (
    -- Step 1: Aggregate sales by region and salesperson
    SELECT 
        s.region,
        s.salesperson_id,
        s.salesperson_name,
        SUM(o.order_total) as total_sales,
        COUNT(DISTINCT o.customer_id) as unique_customers
    FROM salespeople s
    INNER JOIN orders o ON s.salesperson_id = o.salesperson_id
    WHERE o.order_date >= '2024-01-01'
    GROUP BY s.region, s.salesperson_id, s.salesperson_name
),

region_benchmarks AS (
    -- Step 2: Calculate regional averages for comparison
    SELECT 
        region,
        AVG(total_sales) as avg_sales_per_person,
        AVG(unique_customers) as avg_customers_per_person
    FROM regional_sales
    GROUP BY region
),

performance_analysis AS (
    -- Step 3: Compare each salesperson to their regional average
    SELECT 
        rs.region,
        rs.salesperson_name,
        rs.total_sales,
        rs.unique_customers,
        rb.avg_sales_per_person,
        rb.avg_customers_per_person,
        ROUND(
            (rs.total_sales / rb.avg_sales_per_person - 1) * 100, 
            1
        ) as sales_vs_regional_avg_pct,
        ROUND(
            (rs.unique_customers::DECIMAL / rb.avg_customers_per_person - 1) * 100, 
            1
        ) as customers_vs_regional_avg_pct
    FROM regional_sales rs
    INNER JOIN region_benchmarks rb ON rs.region = rb.region
)

-- Final query: Show top and bottom performers
SELECT 
    region,
    salesperson_name,
    total_sales,
    unique_customers,
    sales_vs_regional_avg_pct,
    customers_vs_regional_avg_pct,
    CASE 
        WHEN sales_vs_regional_avg_pct > 20 THEN 'Top Performer'
        WHEN sales_vs_regional_avg_pct < -20 THEN 'Needs Support'
        ELSE 'Average Performer'
    END as performance_category
FROM performance_analysis
ORDER BY region, sales_vs_regional_avg_pct DESC;

This query would produce output like:

region salesperson_name total_sales unique_customers sales_vs_regional_avg_pct performance_category
East Sarah Chen 245000 28 34.2 Top Performer
East Mike Rodriguez 189000 22 3.4 Average Performer
West Jennifer Park 198000 31 12.8 Average Performer

Each CTE serves a specific purpose and builds toward the final analysis, making the complex calculation easy to understand and debug.

Replacing Complex Subqueries

One of the most powerful uses of CTEs is eliminating deeply nested subqueries. Let's work through a real example: finding customers whose purchase patterns have changed significantly between quarters.

Here's the subquery nightmare version:

-- Don't write SQL like this
SELECT 
    customer_id,
    customer_name,
    q1_revenue,
    q2_revenue,
    revenue_change_pct
FROM (
    SELECT 
        c.customer_id,
        c.customer_name,
        q1_stats.revenue as q1_revenue,
        q2_stats.revenue as q2_revenue,
        ROUND(
            ((q2_stats.revenue - q1_stats.revenue) / q1_stats.revenue * 100), 
            1
        ) as revenue_change_pct
    FROM customers c
    INNER JOIN (
        SELECT 
            customer_id,
            SUM(order_total) as revenue
        FROM orders 
        WHERE order_date >= '2024-01-01' 
          AND order_date < '2024-04-01'
        GROUP BY customer_id
        HAVING SUM(order_total) > 500
    ) q1_stats ON c.customer_id = q1_stats.customer_id
    INNER JOIN (
        SELECT 
            customer_id,
            SUM(order_total) as revenue
        FROM orders 
        WHERE order_date >= '2024-04-01' 
          AND order_date < '2024-07-01'
        GROUP BY customer_id
        HAVING SUM(order_total) > 500
    ) q2_stats ON c.customer_id = q2_stats.customer_id
) quarterly_comparison
WHERE ABS(revenue_change_pct) > 25
ORDER BY revenue_change_pct DESC;

Now the CTE version:

-- Much cleaner with CTEs
WITH q1_customer_revenue AS (
    SELECT 
        customer_id,
        SUM(order_total) as revenue
    FROM orders 
    WHERE order_date >= '2024-01-01' 
      AND order_date < '2024-04-01'
    GROUP BY customer_id
    HAVING SUM(order_total) > 500
),

q2_customer_revenue AS (
    SELECT 
        customer_id,
        SUM(order_total) as revenue
    FROM orders 
    WHERE order_date >= '2024-04-01' 
      AND order_date < '2024-07-01'
    GROUP BY customer_id
    HAVING SUM(order_total) > 500
),

quarterly_comparison AS (
    SELECT 
        c.customer_id,
        c.customer_name,
        q1.revenue as q1_revenue,
        q2.revenue as q2_revenue,
        ROUND(
            ((q2.revenue - q1.revenue) / q1.revenue * 100), 
            1
        ) as revenue_change_pct
    FROM customers c
    INNER JOIN q1_customer_revenue q1 ON c.customer_id = q1.customer_id
    INNER JOIN q2_customer_revenue q2 ON c.customer_id = q2.customer_id
)

SELECT 
    customer_id,
    customer_name,
    q1_revenue,
    q2_revenue,
    revenue_change_pct,
    CASE 
        WHEN revenue_change_pct > 25 THEN 'Strong Growth'
        WHEN revenue_change_pct < -25 THEN 'Significant Decline'
    END as trend_category
FROM quarterly_comparison
WHERE ABS(revenue_change_pct) > 25
ORDER BY revenue_change_pct DESC;

The CTE version makes it immediately clear what we're doing:

  1. Calculate Q1 revenue for customers with meaningful spend
  2. Calculate Q2 revenue for the same criteria
  3. Compare the quarters and calculate percentage change
  4. Filter for significant changes and categorize trends

Pro tip: When refactoring complex queries, start by identifying repeated logic or conceptually distinct steps. Each of these becomes a candidate for its own CTE.

Recursive CTEs for Hierarchical Data

Recursive CTEs are perfect for traversing tree-like structures: organizational charts, category hierarchies, or bill-of-materials relationships. The syntax includes a base case and a recursive case that references the CTE itself.

Let's build an employee hierarchy query that shows the complete reporting structure:

-- Sample data structure
CREATE TABLE employees (
    employee_id INT PRIMARY KEY,
    employee_name VARCHAR(100),
    manager_id INT,
    department VARCHAR(50),
    salary DECIMAL(10,2)
);

-- Insert some hierarchical data
INSERT INTO employees VALUES
(1, 'Sarah Johnson', NULL, 'Executive', 180000),      -- CEO
(2, 'Mike Chen', 1, 'Engineering', 140000),           -- VP Engineering
(3, 'Lisa Rodriguez', 1, 'Sales', 135000),            -- VP Sales  
(4, 'Tom Wilson', 2, 'Engineering', 110000),          -- Engineering Manager
(5, 'Amy Park', 2, 'Engineering', 105000),            -- Engineering Manager
(6, 'Chris Davis', 3, 'Sales', 95000),                -- Sales Manager
(7, 'Jennifer Liu', 4, 'Engineering', 85000),         -- Senior Developer
(8, 'Mark Thompson', 4, 'Engineering', 80000),        -- Developer
(9, 'Rachel Kim', 5, 'Engineering', 82000),           -- Senior Developer
(10, 'David Brown', 6, 'Sales', 70000);               -- Sales Rep

Now let's use a recursive CTE to show the complete hierarchy with levels:

WITH RECURSIVE employee_hierarchy AS (
    -- Base case: Start with the CEO (no manager)
    SELECT 
        employee_id,
        employee_name,
        manager_id,
        department,
        salary,
        0 as level,
        employee_name as hierarchy_path
    FROM employees 
    WHERE manager_id IS NULL
    
    UNION ALL
    
    -- Recursive case: Find direct reports of employees we've already processed
    SELECT 
        e.employee_id,
        e.employee_name,
        e.manager_id,
        e.department,
        e.salary,
        eh.level + 1,
        eh.hierarchy_path || ' > ' || e.employee_name
    FROM employees e
    INNER JOIN employee_hierarchy eh ON e.manager_id = eh.employee_id
)

SELECT 
    employee_id,
    REPEAT('  ', level) || employee_name as indented_name,
    department,
    salary,
    level,
    hierarchy_path
FROM employee_hierarchy
ORDER BY hierarchy_path;

This produces a clear organizational view:

employee_id indented_name department salary level hierarchy_path
1 Sarah Johnson Executive 180000 0 Sarah Johnson
2 Mike Chen Engineering 140000 1 Sarah Johnson > Mike Chen
4 Tom Wilson Engineering 110000 2 Sarah Johnson > Mike Chen > Tom Wilson
7 Jennifer Liu Engineering 85000 3 Sarah Johnson > Mike Chen > Tom Wilson > Jennifer Liu
8 Mark Thompson Engineering 80000 3 Sarah Johnson > Mike Chen > Tom Wilson > Mark Thompson

The recursive CTE works by:

  1. Base case: Starting with employees who have no manager (the root of the tree)
  2. Recursive case: Finding employees whose manager_id matches an employee_id we've already processed
  3. Termination: The recursion automatically stops when no more matches are found

Here's another practical example—finding all subcategories within a product taxonomy:

WITH RECURSIVE category_tree AS (
    -- Base case: Start with a specific parent category
    SELECT 
        category_id,
        category_name,
        parent_category_id,
        0 as depth,
        category_name as full_path
    FROM product_categories 
    WHERE category_name = 'Electronics'
    
    UNION ALL
    
    -- Recursive case: Find all subcategories
    SELECT 
        pc.category_id,
        pc.category_name,
        pc.parent_category_id,
        ct.depth + 1,
        ct.full_path || ' > ' || pc.category_name
    FROM product_categories pc
    INNER JOIN category_tree ct ON pc.parent_category_id = ct.category_id
),

category_product_counts AS (
    SELECT 
        ct.category_id,
        ct.category_name,
        ct.depth,
        ct.full_path,
        COUNT(p.product_id) as product_count,
        AVG(p.price) as avg_price
    FROM category_tree ct
    LEFT JOIN products p ON ct.category_id = p.category_id
    GROUP BY ct.category_id, ct.category_name, ct.depth, ct.full_path
)

SELECT 
    REPEAT('  ', depth) || category_name as indented_category,
    product_count,
    ROUND(avg_price, 2) as avg_price,
    full_path
FROM category_product_counts
ORDER BY full_path;

Important: Most databases have recursion limits (typically 100-1000 levels) to prevent infinite loops. PostgreSQL defaults to 100, which you can adjust with SET max_recursive_depth = 500; if needed.

Performance Considerations and Optimization

CTEs aren't just about readability—they can significantly impact query performance, both positively and negatively. Understanding when and how to optimize them is crucial for production systems.

Materialization vs. Inlining

Different databases handle CTE execution differently:

PostgreSQL 12+: CTEs are typically inlined (treated like subqueries) unless they're recursive, contain data-modifying statements, or are referenced multiple times. This usually improves performance.

SQL Server: CTEs are always materialized, which can be slower for simple cases but faster when the CTE result is used multiple times.

MySQL 8.0+: Similar to PostgreSQL—inlines when possible.

Let's look at a performance comparison:

-- This CTE will likely be inlined (good performance)
WITH recent_orders AS (
    SELECT customer_id, order_total, order_date
    FROM orders 
    WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
)
SELECT customer_id, COUNT(*) as order_count
FROM recent_orders
GROUP BY customer_id;

-- This CTE will be materialized because it's used twice
WITH customer_totals AS (
    SELECT 
        customer_id, 
        SUM(order_total) as total_spent
    FROM orders 
    GROUP BY customer_id
)
SELECT 
    ct1.customer_id,
    ct1.total_spent,
    ct2.total_spent as comparison_value
FROM customer_totals ct1
CROSS JOIN (SELECT AVG(total_spent) as total_spent FROM customer_totals) ct2;

Indexing Strategy for CTEs

CTEs benefit from the same indexing strategies as regular queries. The key is understanding what the CTE is actually doing:

-- This CTE will benefit from an index on (order_date, customer_id)
WITH recent_high_value_orders AS (
    SELECT 
        customer_id,
        order_id,
        order_total,
        order_date
    FROM orders 
    WHERE order_date >= '2024-01-01'
      AND order_total > 1000
)
-- Query execution continues...
-- Create the supporting index
CREATE INDEX idx_orders_date_customer_total 
ON orders (order_date, customer_id) 
INCLUDE (order_total, order_id);

When to Use MATERIALIZED CTEs

In PostgreSQL, you can force materialization when it's beneficial:

-- Force materialization for expensive calculations used multiple times
WITH MATERIALIZED monthly_aggregates AS (
    SELECT 
        DATE_TRUNC('month', order_date) as month,
        customer_id,
        COUNT(*) as order_count,
        SUM(order_total) as revenue,
        AVG(order_total) as avg_order_value,
        STDDEV(order_total) as revenue_volatility
    FROM orders 
    WHERE order_date >= '2023-01-01'
    GROUP BY DATE_TRUNC('month', order_date), customer_id
)
-- Now use this expensive CTE multiple times without recalculation
SELECT ...;

Performance Anti-Patterns to Avoid

-- DON'T: Chain CTEs unnecessarily when a single query would work
WITH step1 AS (SELECT customer_id FROM orders WHERE order_date > '2024-01-01'),
     step2 AS (SELECT customer_id FROM step1),
     step3 AS (SELECT customer_id FROM step2)
SELECT * FROM step3;

-- DO: Keep it simple when possible
SELECT DISTINCT customer_id 
FROM orders 
WHERE order_date > '2024-01-01';

-- DON'T: Use CTEs for simple filtering
WITH filtered_orders AS (
    SELECT * FROM orders WHERE order_total > 100
)
SELECT customer_id FROM filtered_orders;

-- DO: Filter directly in the main query
SELECT customer_id 
FROM orders 
WHERE order_total > 100;

Real-World Data Pipeline Pattern

Let's build a comprehensive example that demonstrates CTEs in a realistic data transformation pipeline. We're creating a customer segmentation report that requires multiple complex calculations:

WITH order_metrics AS (
    -- Step 1: Calculate basic order statistics per customer
    SELECT 
        customer_id,
        COUNT(DISTINCT order_id) as total_orders,
        SUM(order_total) as total_revenue,
        AVG(order_total) as avg_order_value,
        MIN(order_date) as first_order_date,
        MAX(order_date) as last_order_date,
        STDDEV(order_total) as order_value_volatility
    FROM orders 
    WHERE order_date >= '2023-01-01'
    GROUP BY customer_id
),

customer_lifecycle AS (
    -- Step 2: Add lifecycle metrics
    SELECT 
        om.*,
        CURRENT_DATE - om.last_order_date as days_since_last_order,
        om.last_order_date - om.first_order_date as customer_lifespan_days,
        CASE 
            WHEN CURRENT_DATE - om.last_order_date <= 30 THEN 'Active'
            WHEN CURRENT_DATE - om.last_order_date <= 90 THEN 'At Risk'
            ELSE 'Churned'
        END as lifecycle_status
    FROM order_metrics om
),

purchase_frequency AS (
    -- Step 3: Calculate purchase frequency metrics
    SELECT 
        cl.*,
        CASE 
            WHEN cl.customer_lifespan_days > 0 THEN
                cl.total_orders::DECIMAL / (cl.customer_lifespan_days / 30.0)
            ELSE 0 
        END as orders_per_month,
        CASE 
            WHEN cl.total_orders > 1 THEN
                cl.customer_lifespan_days::DECIMAL / (cl.total_orders - 1)
            ELSE NULL
        END as avg_days_between_orders
    FROM customer_lifecycle cl
),

customer_segments AS (
    -- Step 4: Assign customer segments using RFM-like analysis
    SELECT 
        pf.*,
        c.customer_name,
        c.customer_email,
        -- Recency score (1-5, 5 is most recent)
        CASE 
            WHEN pf.days_since_last_order <= 15 THEN 5
            WHEN pf.days_since_last_order <= 30 THEN 4
            WHEN pf.days_since_last_order <= 60 THEN 3
            WHEN pf.days_since_last_order <= 90 THEN 2
            ELSE 1
        END as recency_score,
        -- Frequency score (1-5, 5 is highest frequency)
        CASE 
            WHEN pf.orders_per_month >= 4 THEN 5
            WHEN pf.orders_per_month >= 2 THEN 4
            WHEN pf.orders_per_month >= 1 THEN 3
            WHEN pf.orders_per_month >= 0.5 THEN 2
            ELSE 1
        END as frequency_score,
        -- Monetary score (1-5, 5 is highest value)
        NTILE(5) OVER (ORDER BY pf.total_revenue) as monetary_score
    FROM purchase_frequency pf
    INNER JOIN customers c ON pf.customer_id = c.customer_id
),

final_segmentation AS (
    -- Step 5: Create final segment labels
    SELECT 
        *,
        CASE 
            WHEN recency_score >= 4 AND frequency_score >= 4 AND monetary_score >= 4 THEN 'Champions'
            WHEN recency_score >= 3 AND frequency_score >= 3 AND monetary_score >= 3 THEN 'Loyal Customers'
            WHEN recency_score >= 4 AND frequency_score <= 2 THEN 'New Customers'
            WHEN recency_score <= 2 AND frequency_score >= 3 AND monetary_score >= 3 THEN 'At Risk'
            WHEN recency_score <= 2 AND frequency_score <= 2 THEN 'Churned'
            WHEN monetary_score >= 4 THEN 'Big Spenders'
            ELSE 'Casual Customers'
        END as customer_segment,
        recency_score + frequency_score + monetary_score as rfm_total_score
    FROM customer_segments
)

-- Final output with segment summary
SELECT 
    customer_segment,
    COUNT(*) as customer_count,
    ROUND(AVG(total_revenue), 2) as avg_revenue_per_customer,
    ROUND(AVG(total_orders), 1) as avg_orders_per_customer,
    ROUND(AVG(avg_order_value), 2) as avg_order_value,
    ROUND(AVG(rfm_total_score), 1) as avg_rfm_score
FROM final_segmentation
GROUP BY customer_segment
ORDER BY avg_revenue_per_customer DESC;

This query produces a comprehensive customer segmentation report:

customer_segment customer_count avg_revenue_per_customer avg_orders_per_customer avg_order_value avg_rfm_score
Champions 127 3247.85 8.3 391.29 13.2
Big Spenders 89 2156.42 4.1 525.83 11.8
Loyal Customers 234 1543.67 6.2 248.98 10.5
At Risk 156 987.34 5.7 173.22 8.3

The beauty of this CTE chain is that each step is:

  • Testable: You can run just the first few CTEs to verify intermediate results
  • Debuggable: If something looks wrong, you know exactly which step to investigate
  • Maintainable: Need to change the segmentation logic? It's all in the final CTE
  • Reusable: Other queries can reference any of these intermediate CTEs

Hands-On Exercise

Now let's put your CTE skills to work with a realistic business scenario. You're analyzing the performance of an e-commerce platform's product recommendation engine.

Scenario: Your company tracks when customers view products and when they actually purchase them. You need to create a report showing:

  1. Which products are frequently viewed but rarely purchased (high bounce rate)
  2. Which products convert viewers to buyers efficiently
  3. Revenue impact of the recommendation engine

Sample Data Setup:

-- You can create these tables to follow along
CREATE TABLE product_views (
    view_id SERIAL PRIMARY KEY,
    customer_id INT,
    product_id INT,
    view_date TIMESTAMP,
    came_from_recommendation BOOLEAN
);

CREATE TABLE products (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(200),
    category VARCHAR(100),
    price DECIMAL(10,2)
);

CREATE TABLE purchases (
    purchase_id SERIAL PRIMARY KEY,
    customer_id INT,
    product_id INT,
    purchase_date TIMESTAMP,
    quantity INT,
    total_amount DECIMAL(10,2)
);

Your Task: Write a CTE-based query that produces a report with these columns:

  • product_name
  • category
  • total_views
  • total_purchases
  • conversion_rate (purchases/views as percentage)
  • recommended_views (views that came from recommendations)
  • organic_views (views that didn't come from recommendations)
  • recommended_conversion_rate
  • organic_conversion_rate
  • total_revenue_generated

Requirements:

  1. Only include products viewed in the last 60 days
  2. Only count purchases that happened within 7 days of a product view
  3. Exclude products with fewer than 10 total views
  4. Show the top 20 products by total revenue

Solution:

WITH recent_views AS (
    -- Step 1: Get all views from the last 60 days
    SELECT 
        pv.customer_id,
        pv.product_id,
        pv.view_date,
        pv.came_from_recommendation,
        p.product_name,
        p.category,
        p.price
    FROM product_views pv
    INNER JOIN products p ON pv.product_id = p.product_id
    WHERE pv.view_date >= CURRENT_DATE - INTERVAL '60 days'
),

relevant_purchases AS (
    -- Step 2: Get purchases that happened within 7 days of a view
    SELECT DISTINCT
        rv.customer_id,
        rv.product_id,
        rv.view_date,
        rv.came_from_recommendation,
        pu.purchase_date,
        pu.quantity,
        pu.total_amount
    FROM recent_views rv
    INNER JOIN purchases pu ON rv.customer_id = pu.customer_id 
                           AND rv.product_id = pu.product_id
    WHERE pu.purchase_date BETWEEN rv.view_date 
                              AND rv.view_date + INTERVAL '7 days'
),

view_metrics AS (
    -- Step 3: Calculate view statistics by product
    SELECT 
        product_id,
        product_name,
        category,
        price,
        COUNT(*) as total_views,
        COUNT(CASE WHEN came_from_recommendation THEN 1 END) as recommended_views,
        COUNT(CASE WHEN NOT came_from_recommendation THEN 1 END) as organic_views
    FROM recent_views
    GROUP BY product_id, product_name, category, price
    HAVING COUNT(*) >= 10  -- Minimum 10 views requirement
),

purchase_metrics AS (
    -- Step 4: Calculate purchase statistics by product
    SELECT 
        product_id,
        COUNT(DISTINCT customer_id || '|' || view_date) as total_purchases,
        COUNT(CASE WHEN came_from_recommendation 
              THEN customer_id || '|' || view_date END) as recommended_purchases,
        COUNT(CASE WHEN NOT came_from_recommendation 
              THEN customer_id || '|' || view_date END) as organic_purchases,
        SUM(total_amount) as total_revenue_generated
    FROM relevant_purchases
    GROUP BY product_id
),

conversion_analysis AS (
    -- Step 5: Combine metrics and calculate conversion rates
    SELECT 
        vm.product_id,
        vm.product_name,
        vm.category,
        vm.total_views,
        COALESCE(pm.total_purchases, 0) as total_purchases,
        vm.recommended_views,
        vm.organic_views,
        COALESCE(pm.recommended_purchases, 0) as recommended_purchases,
        COALESCE(pm.organic_purchases, 0) as organic_purchases,
        COALESCE(pm.total_revenue_generated, 0) as total_revenue_generated,
        -- Calculate conversion rates
        ROUND(
            COALESCE(pm.total_purchases, 0)::DECIMAL / vm.total_views * 100, 
            2
        ) as conversion_rate,
        CASE 
            WHEN vm.recommended_views > 0 THEN
                ROUND(
                    COALESCE(pm.recommended_purchases, 0)::DECIMAL / vm.recommended_views * 100, 
                    2
                )
            ELSE 0 
        END as recommended_conversion_rate,
        CASE 
            WHEN vm.organic_views > 0 THEN
                ROUND(
                    COALESCE(pm.organic_purchases, 0)::DECIMAL / vm.organic_views * 100, 
                    2
                )
            ELSE 0 
        END as organic_conversion_rate
    FROM view_metrics vm
    LEFT JOIN purchase_metrics pm ON vm.product_id = pm.product_id
)

SELECT 
    product_name,
    category,
    total_views,
    total_purchases,
    conversion_rate,
    recommended_views,
    organic_views,
    recommended_conversion_rate,
    organic_conversion_rate,
    total_revenue_generated,
    -- Add insights
    CASE 
        WHEN recommended_conversion_rate > organic_conversion_rate + 5 
        THEN 'Recommendation Engine Helps'
        WHEN organic_conversion_rate > recommended_conversion_rate + 5 
        THEN 'Organic Discovery Better'
        ELSE 'Similar Performance'
    END as recommendation_impact
FROM conversion_analysis
ORDER BY total_revenue_generated DESC
LIMIT 20;

This solution demonstrates several key CTE patterns:

  • Filtering: recent_views applies the 60-day filter
  • Complex Joins: relevant_purchases joins with time-based conditions
  • Aggregation: Multiple CTEs calculate different metrics
  • Conditional Logic: Extensive use of CASE statements for business rules
  • Final Assembly: The last CTE combines everything with calculated insights

Common Mistakes & Troubleshooting

1. Referencing CTEs in the Wrong Order

The Mistake:

-- This will fail!
WITH second_cte AS (
    SELECT * FROM first_cte  -- Error: first_cte doesn't exist yet
),
first_cte AS (
    SELECT customer_id FROM customers
)
SELECT * FROM second_cte;

Why It Fails: CTEs can only reference previously defined CTEs, not ones that come later in the WITH clause.

The Fix:

WITH first_cte AS (
    SELECT customer_id FROM customers
),
second_cte AS (
    SELECT * FROM first_cte  -- Now this works
)
SELECT * FROM second_cte;

2. Forgetting Column Aliases in Recursive CTEs

The Mistake:

WITH RECURSIVE hierarchy AS (
    SELECT employee_id, manager_id FROM employees WHERE manager_id IS NULL
    UNION ALL
    SELECT e.employee_id, e.manager_id, eh.some_column  -- Wrong number of columns!
    FROM employees e
    JOIN hierarchy eh ON e.manager_id = eh.employee_id
)
SELECT * FROM hierarchy;

Why It Fails: The base case and recursive case must return the exact same number of columns with compatible types.

The Fix:

WITH RECURSIVE hierarchy AS (
    SELECT employee_id, manager_id, 0 as level FROM employees WHERE manager_id IS NULL
    UNION ALL
    SELECT e.employee_id, e.manager_id, eh.level + 1
    FROM employees e
    JOIN hierarchy eh ON e.manager_id = eh.employee_id
)
SELECT * FROM hierarchy;

3. Performance Issues with Large CTEs

The Mistake:

-- This CTE processes millions of rows but is only used for a small filter
WITH all_order_details AS (
    SELECT 
        o.*,
        c.customer_name,
        p.product_name,
        -- Lots of expensive calculations
        COMPLEX_FUNCTION(o.order_total) as processed_value
    FROM orders o
    JOIN customers c ON o.customer_id = c.customer_id
    JOIN products p ON o.product_id = p.product_id
)
SELECT customer_name 
FROM all_order_details 
WHERE customer_id = 12345;  -- Only need one customer!

Why It's Slow: The CTE processes all data before the filter is applied.

The Fix:

-- Filter early, process less data
WITH target_customer_orders AS (
    SELECT o.*
    FROM orders o
    WHERE o.customer_id = 12345  -- Filter first
),
enriched_orders AS (
    SELECT 
        tco.*,
        c.customer_name,
        p.product_name,
        COMPLEX_FUNCTION(tco.order_total) as processed_value
    FROM target_customer_orders tco
    JOIN customers c ON tco.customer_id = c.customer_id
    JOIN products p ON tco.product_id = p.product_id
)
SELECT customer_name FROM enriched_orders;

4. Infinite Recursion in Recursive CTEs

The Mistake:

-- This creates a cycle: A reports to B, B reports to A
WITH RECURSIVE bad_hierarchy AS (
    SELECT employee_id, manager_id, 1 as level
    FROM employees WHERE employee_id = 1
    UNION ALL
    SELECT e.employee_id, e.manager_id, bh.level + 1
    FROM employees e
    JOIN bad_hierarchy bh ON e.manager_id = bh.employee_id
    -- No termination condition for cycles!
)
SELECT * FROM bad_hierarchy;

Why It Fails: If your data has cycles, the recursion will continue indefinitely until it hits the database's recursion limit.

The Fix:

WITH RECURSIVE safe_hierarchy AS (
    SELECT 
        employee_id, 
        manager_id, 
        1 as level,
        ARRAY[employee_id] as path  -- Track the path to detect cycles
    FROM employees WHERE employee_id = 1
    UNION ALL
    SELECT 
        e.employee_id, 
        e.manager_id, 
        sh.level + 1,
        sh.path || e.employee_id
    FROM employees e
    JOIN safe_hierarchy sh ON e.manager_id = sh.employee_id
    WHERE e.employee_id != ALL(sh.path)  -- Prevent cycles
      AND sh.level < 10  -- Also add a reasonable depth limit
)
SELECT * FROM safe_hierarchy;

5. Column Name Conflicts in Multiple CTEs

The Mistake:

WITH sales_data AS (
    SELECT region, SUM(amount) as total FROM sales GROUP BY region
),
target_data AS (
    SELECT region, target_amount as total FROM targets  -- Same column name!
)
SELECT * FROM sales_data 
JOIN target_data USING (region, total);  -- Ambiguous!

Why It Fails: When joining CTEs with identical column names, the database can't determine which column you mean.

The Fix:

WITH sales_data AS (
    SELECT region, SUM(amount) as sales_total FROM sales GROUP BY region
),
target_data AS (
    SELECT region, target_amount as target_total FROM targets
)
SELECT 
    sd.region,
    sd.sales_total,
    td.target_total,
    ROUND((sd.sales_total / td.target_total * 100), 1) as achievement_pct
FROM sales_data sd
JOIN target_data td ON sd.region = td.region;

Debugging Tip: When a CTE query isn't working, test each CTE individually by selecting from it directly: WITH my_cte AS (...) SELECT * FROM my_cte LIMIT 10; This helps isolate where the problem occurs.

Summary & Next Steps

Common Table Expressions transform SQL from a write-once, debug-forever nightmare into readable, maintainable code. You've learned how to:

  • Replace nested subquery chaos with clear, named intermediate steps that tell a story
  • Use recursive CTEs to traverse hierarchical data like organizational charts and category trees
  • Optimize CTE performance by understanding materialization, indexing needs, and when to filter early
  • Build complex analytical pipelines using multiple CTEs that each serve a single, well-defined purpose
  • Debug and troubleshoot the most common CTE pitfalls that trip up even experienced developers

The key insight is that CTEs aren't just about syntax—they're about thinking differently about SQL structure. Instead of cramming complex logic into monolithic queries, you can break problems into logical steps that mirror how you'd explain the solution to a colleague.

Ready to take your SQL skills further? Here are three natural next steps:

Window Functions with CTEs: Learn how to combine CTEs with advanced window functions like LEAD/LAG, FIRST_VALUE, and custom window frames. This combination is incredibly powerful for time-series analysis and complex ranking problems.

SQL Query Optimization: Dive deeper into execution plans, index strategies, and performance tuning. Understanding how the query planner works with CTEs will help you write faster, more efficient code.

Advanced Recursive Patterns: Explore sophisticated recursive CTE applications like graph traversal, finding shortest paths, and handling complex many-to-many hierarchies. These patterns show up frequently in real-world data modeling scenarios.

Learning Path: Advanced SQL Queries

Previous

Window Functions: RANK, ROW_NUMBER, and LAG

Related Articles

SQL🔥 Expert

Advanced SQL Filtering and Sorting: WHERE, AND, OR, IN, BETWEEN, ORDER BY Mastery

17 min
SQL⚡ Practitioner

Advanced Data Filtering and Sorting: Mastering WHERE Clauses and ORDER BY in SQL

13 min
SQL🌱 Foundation

SQL Filtering and Sorting: Master WHERE, AND, OR, IN, BETWEEN, and ORDER BY

13 min

On this page

  • Prerequisites
  • Understanding CTEs: Your First Refactor
  • CTE Syntax and Structure
  • Replacing Complex Subqueries
  • Recursive CTEs for Hierarchical Data
  • Performance Considerations and Optimization
  • Materialization vs. Inlining
  • Indexing Strategy for CTEs
  • When to Use MATERIALIZED CTEs
  • Performance Anti-Patterns to Avoid
  • Real-World Data Pipeline Pattern
  • 1. Referencing CTEs in the Wrong Order
  • 2. Forgetting Column Aliases in Recursive CTEs
  • 3. Performance Issues with Large CTEs
  • 4. Infinite Recursion in Recursive CTEs
  • 5. Column Name Conflicts in Multiple CTEs
  • Summary & Next Steps
  • Hands-On Exercise
  • Common Mistakes & Troubleshooting
  • 1. Referencing CTEs in the Wrong Order
  • 2. Forgetting Column Aliases in Recursive CTEs
  • 3. Performance Issues with Large CTEs
  • 4. Infinite Recursion in Recursive CTEs
  • 5. Column Name Conflicts in Multiple CTEs
  • Summary & Next Steps