Mastering Subqueries and CTEs: Advanced SQL Patterns for Complex Analysis

You're staring at a complex business question: "Which customers made their largest purchase in Q4 2023, and how does that purchase compare to their historical average?" Your first instinct might be to export data to Excel or write multiple queries and manually piece together the results. But experienced SQL practitioners know there's a more elegant path—one that keeps all the logic in SQL and makes your analysis both readable and maintainable.

This is where subqueries and Common Table Expressions (CTEs) transform from mere syntax features into powerful analytical weapons. They let you break down complex problems into digestible chunks, create reusable intermediate results, and build sophisticated analyses that would otherwise require multiple round trips to the database or complex application logic.

By mastering these techniques, you'll write queries that colleagues can actually understand, debug analyses faster, and tackle business questions that seemed impossibly complex. More importantly, you'll think differently about data problems—seeing them as a series of logical steps rather than monolithic challenges.

What you'll learn:

How subqueries and CTEs differ in execution, readability, and performance characteristics
Advanced subquery patterns including correlated subqueries and their optimization strategies
CTE techniques for recursive operations, complex windowing, and multi-step transformations
Performance tuning strategies for nested queries and when to choose alternatives
Real-world patterns for customer analytics, financial calculations, and hierarchical data processing

Prerequisites

You should be comfortable with intermediate SQL concepts including joins, window functions, and aggregate operations. Experience with query execution plans and basic performance tuning will help you understand the optimization discussions, though we'll explain the key concepts as we go.

Understanding the Execution Landscape

Before diving into syntax, let's establish how subqueries and CTEs fit into the SQL execution model. This foundation will guide our decisions throughout the lesson.

A subquery is a query nested inside another query. It can appear in SELECT, FROM, WHERE, or HAVING clauses, and the database engine decides how to execute it—sometimes materializing results, sometimes transforming it into joins, sometimes executing it repeatedly.

A CTE is a named temporary result set that exists for the duration of a single query. Think of it as a way to define a view that only lives for one statement. CTEs make complex queries readable by letting you build up results step by step.

Here's a concrete example that illustrates the difference. Suppose we're analyzing customer purchase patterns in an e-commerce database:

-- Schema context
-- customers: customer_id, name, registration_date, tier
-- orders: order_id, customer_id, order_date, total_amount
-- order_items: order_id, product_id, quantity, unit_price

Using a subquery approach:

SELECT 
    c.name,
    c.tier,
    (SELECT COUNT(*) 
     FROM orders o 
     WHERE o.customer_id = c.customer_id 
       AND o.order_date >= '2023-01-01') as orders_2023,
    (SELECT AVG(total_amount) 
     FROM orders o 
     WHERE o.customer_id = c.customer_id) as avg_order_value
FROM customers c
WHERE c.tier = 'premium'
  AND EXISTS (
    SELECT 1 
    FROM orders o 
    WHERE o.customer_id = c.customer_id 
      AND o.total_amount > 1000
  );

Using a CTE approach:

WITH customer_stats AS (
    SELECT 
        customer_id,
        COUNT(CASE WHEN order_date >= '2023-01-01' THEN 1 END) as orders_2023,
        AVG(total_amount) as avg_order_value,
        MAX(total_amount) as max_order_value
    FROM orders
    GROUP BY customer_id
)
SELECT 
    c.name,
    c.tier,
    cs.orders_2023,
    cs.avg_order_value
FROM customers c
JOIN customer_stats cs ON c.customer_id = cs.customer_id
WHERE c.tier = 'premium'
  AND cs.max_order_value > 1000;

The CTE version has several advantages: it calculates statistics once per customer rather than multiple times, it's easier to modify and debug, and it makes the main query's logic clearer. However, the subquery version might perform better in some database systems if only a few customers meet the criteria, since it can short-circuit calculations.

Subquery Patterns and Optimization

Subqueries come in several flavors, each with distinct performance characteristics and use cases. Let's explore them systematically.

Scalar Subqueries

Scalar subqueries return a single value and can appear anywhere an expression is valid. They're particularly useful for calculations that depend on aggregate values from other tables.

-- Find customers whose latest order was above their personal average
SELECT 
    c.name,
    latest_order.order_date,
    latest_order.total_amount,
    (SELECT AVG(total_amount) 
     FROM orders o2 
     WHERE o2.customer_id = c.customer_id) as personal_avg
FROM customers c
JOIN (
    SELECT DISTINCT customer_id, 
           FIRST_VALUE(order_date) OVER (PARTITION BY customer_id ORDER BY order_date DESC) as order_date,
           FIRST_VALUE(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date DESC) as total_amount
    FROM orders
) latest_order ON c.customer_id = latest_order.customer_id
WHERE latest_order.total_amount > (
    SELECT AVG(total_amount) 
    FROM orders o3 
    WHERE o3.customer_id = c.customer_id
);

Notice we're calculating the personal average twice—once in SELECT and once in WHERE. This is inefficient, but sometimes necessary when the same calculation appears in different clauses. A CTE would solve this duplication.

Correlated Subqueries

Correlated subqueries reference columns from the outer query. They're powerful but potentially expensive since they execute once for each row in the outer query.

-- Find products that are above average price in their category
SELECT 
    p.product_name,
    p.category,
    p.price,
    (SELECT AVG(price) FROM products p2 WHERE p2.category = p.category) as category_avg
FROM products p
WHERE p.price > (
    SELECT AVG(price) 
    FROM products p2 
    WHERE p2.category = p.category
);

This query executes the average calculation once per product. Modern optimizers often transform this into a window function or join, but understanding the logical execution helps you write better queries.

Here's the equivalent using window functions:

WITH product_stats AS (
    SELECT 
        product_name,
        category,
        price,
        AVG(price) OVER (PARTITION BY category) as category_avg
    FROM products
)
SELECT 
    product_name,
    category,
    price,
    category_avg
FROM product_stats
WHERE price > category_avg;

The window function version calculates each category average once, making it more efficient for large datasets.

EXISTS and NOT EXISTS

EXISTS subqueries are optimized for boolean logic—they stop executing as soon as they find a matching row. This makes them excellent for filtering.

-- Customers who have never made a purchase over $500
SELECT c.name, c.registration_date
FROM customers c
WHERE NOT EXISTS (
    SELECT 1 
    FROM orders o 
    WHERE o.customer_id = c.customer_id 
      AND o.total_amount > 500
);

-- Customers who made purchases in every quarter of 2023
SELECT c.name
FROM customers c
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id AND EXTRACT(quarter FROM o.order_date) = 1 AND EXTRACT(year FROM o.order_date) = 2023)
  AND EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id AND EXTRACT(quarter FROM o.order_date) = 2 AND EXTRACT(year FROM o.order_date) = 2023)
  AND EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id AND EXTRACT(quarter FROM o.order_date) = 3 AND EXTRACT(year FROM o.order_date) = 2023)
  AND EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id AND EXTRACT(quarter FROM o.order_date) = 4 AND EXTRACT(year FROM o.order_date) = 2023);

The multiple EXISTS approach works but is verbose. Here's a more elegant solution using aggregation:

-- More elegant version using HAVING
SELECT c.name
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
      AND EXTRACT(year FROM o.order_date) = 2023
    GROUP BY EXTRACT(quarter FROM o.order_date)
    HAVING COUNT(DISTINCT EXTRACT(quarter FROM o.order_date)) = 4
);

Actually, that's not quite right either. Let me fix it:

-- Correct version
SELECT c.name
FROM customers c
WHERE (
    SELECT COUNT(DISTINCT EXTRACT(quarter FROM o.order_date))
    FROM orders o
    WHERE o.customer_id = c.customer_id
      AND EXTRACT(year FROM o.order_date) = 2023
) = 4;

Subqueries in FROM Clauses

Subqueries in FROM clauses create derived tables—temporary result sets you can join with other tables. These are often the most readable way to handle multi-step calculations.

-- Monthly customer acquisition and their first purchase patterns
SELECT 
    acq.registration_month,
    acq.new_customers,
    fp.avg_days_to_first_purchase,
    fp.avg_first_purchase_amount
FROM (
    SELECT 
        DATE_TRUNC('month', registration_date) as registration_month,
        COUNT(*) as new_customers
    FROM customers
    GROUP BY DATE_TRUNC('month', registration_date)
) acq
LEFT JOIN (
    SELECT 
        DATE_TRUNC('month', c.registration_date) as registration_month,
        AVG(o.order_date - c.registration_date) as avg_days_to_first_purchase,
        AVG(o.total_amount) as avg_first_purchase_amount
    FROM customers c
    JOIN orders o ON c.customer_id = o.customer_id
    JOIN (
        SELECT customer_id, MIN(order_date) as first_order_date
        FROM orders
        GROUP BY customer_id
    ) first_orders ON o.customer_id = first_orders.customer_id 
                  AND o.order_date = first_orders.first_order_date
    GROUP BY DATE_TRUNC('month', c.registration_date)
) fp ON acq.registration_month = fp.registration_month
ORDER BY acq.registration_month;

This query has three levels of nesting and multiple derived tables. It works, but it's getting hard to follow. This is where CTEs shine.

CTE Mastery: Building Complex Analysis Step by Step

CTEs transform the previous query into something much more maintainable:

WITH monthly_acquisitions AS (
    SELECT 
        DATE_TRUNC('month', registration_date) as registration_month,
        COUNT(*) as new_customers
    FROM customers
    GROUP BY DATE_TRUNC('month', registration_date)
),
first_orders AS (
    SELECT 
        customer_id, 
        MIN(order_date) as first_order_date
    FROM orders
    GROUP BY customer_id
),
first_purchase_stats AS (
    SELECT 
        DATE_TRUNC('month', c.registration_date) as registration_month,
        AVG(o.order_date - c.registration_date) as avg_days_to_first_purchase,
        AVG(o.total_amount) as avg_first_purchase_amount
    FROM customers c
    JOIN orders o ON c.customer_id = o.customer_id
    JOIN first_orders fo ON o.customer_id = fo.customer_id 
                        AND o.order_date = fo.first_order_date
    GROUP BY DATE_TRUNC('month', c.registration_date)
)
SELECT 
    ma.registration_month,
    ma.new_customers,
    fps.avg_days_to_first_purchase,
    fps.avg_first_purchase_amount
FROM monthly_acquisitions ma
LEFT JOIN first_purchase_stats fps ON ma.registration_month = fps.registration_month
ORDER BY ma.registration_month;

Now each logical step is clearly defined and reusable. You can easily add more analysis by creating additional CTEs that build on these foundations.

Multiple CTEs and Dependencies

CTEs can reference earlier CTEs in the same statement, letting you build complex analyses incrementally:

WITH customer_segments AS (
    SELECT 
        customer_id,
        CASE 
            WHEN COUNT(*) >= 10 THEN 'frequent'
            WHEN COUNT(*) >= 5 THEN 'regular'
            ELSE 'occasional'
        END as segment,
        SUM(total_amount) as lifetime_value,
        AVG(total_amount) as avg_order_value,
        MIN(order_date) as first_order,
        MAX(order_date) as last_order
    FROM orders
    GROUP BY customer_id
),
segment_benchmarks AS (
    SELECT 
        segment,
        AVG(lifetime_value) as avg_lifetime_value,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY lifetime_value) as median_lifetime_value,
        PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY lifetime_value) as p90_lifetime_value
    FROM customer_segments
    GROUP BY segment
),
customer_analysis AS (
    SELECT 
        cs.*,
        sb.avg_lifetime_value as segment_avg_ltv,
        sb.median_lifetime_value as segment_median_ltv,
        CASE 
            WHEN cs.lifetime_value >= sb.p90_lifetime_value THEN 'top_10_percent'
            WHEN cs.lifetime_value >= sb.median_lifetime_value THEN 'above_median'
            ELSE 'below_median'
        END as segment_performance,
        EXTRACT(days FROM (CURRENT_DATE - cs.last_order)) as days_since_last_order
    FROM customer_segments cs
    JOIN segment_benchmarks sb ON cs.segment = sb.segment
)
SELECT 
    segment,
    segment_performance,
    COUNT(*) as customer_count,
    AVG(lifetime_value) as avg_ltv,
    AVG(days_since_last_order) as avg_days_inactive,
    COUNT(CASE WHEN days_since_last_order > 90 THEN 1 END) as at_risk_customers
FROM customer_analysis
GROUP BY segment, segment_performance
ORDER BY segment, segment_performance;

This analysis creates customer segments, calculates benchmarks for each segment, classifies customers within their segments, then summarizes the results. Each step builds on the previous ones, making the logic easy to follow and modify.

Performance Tip: Most database systems materialize CTE results, meaning they're calculated once and stored temporarily. This can be more efficient than repeated subqueries, but it also means every CTE result is fully computed even if only partially used.

Recursive CTEs

Recursive CTEs handle hierarchical or graph-like data by repeatedly applying the same logic until no new rows are generated. They're essential for organizational charts, bill-of-materials, social networks, and similar structures.

-- Build organizational hierarchy from employee table
-- employees: employee_id, name, manager_id, department, salary

WITH RECURSIVE org_hierarchy AS (
    -- Base case: top-level managers (no manager_id)
    SELECT 
        employee_id,
        name,
        manager_id,
        department,
        salary,
        0 as level,
        CAST(name AS VARCHAR(1000)) as hierarchy_path
    FROM employees
    WHERE manager_id IS NULL
    
    UNION ALL
    
    -- Recursive case: employees with managers
    SELECT 
        e.employee_id,
        e.name,
        e.manager_id,
        e.department,
        e.salary,
        oh.level + 1,
        CAST(oh.hierarchy_path || ' > ' || e.name AS VARCHAR(1000))
    FROM employees e
    JOIN org_hierarchy oh ON e.manager_id = oh.employee_id
)
SELECT 
    level,
    REPEAT('  ', level) || name as indented_name,
    hierarchy_path,
    department,
    salary,
    COUNT(*) OVER (PARTITION BY level) as peers_at_level
FROM org_hierarchy
ORDER BY hierarchy_path;

This query starts with top-level managers, then repeatedly finds their direct reports, building the complete organizational tree. The hierarchy_path column shows the chain of command, and level indicates how many management layers down each employee sits.

Recursive CTEs can also solve more complex problems like finding shortest paths or calculating running totals with dependencies:

-- Calculate cumulative impact of product recommendations
-- recommendations: from_product_id, to_product_id, influence_score
-- sales: product_id, sales_amount

WITH RECURSIVE recommendation_impact AS (
    -- Base case: direct product sales
    SELECT 
        s.product_id,
        s.product_id as source_product_id,
        s.sales_amount as direct_impact,
        s.sales_amount as total_impact,
        0 as depth
    FROM sales s
    
    UNION ALL
    
    -- Recursive case: indirect impact through recommendations
    SELECT 
        ri.product_id,
        s.product_id as source_product_id,
        s.sales_amount * r.influence_score as direct_impact,
        ri.total_impact + (s.sales_amount * r.influence_score) as total_impact,
        ri.depth + 1
    FROM recommendation_impact ri
    JOIN recommendations r ON ri.source_product_id = r.from_product_id
    JOIN sales s ON r.to_product_id = s.product_id
    WHERE ri.depth < 3  -- Prevent infinite recursion
)
SELECT 
    product_id,
    SUM(direct_impact) as total_direct_sales,
    SUM(total_impact) as total_influence_impact,
    COUNT(DISTINCT source_product_id) as influenced_products
FROM recommendation_impact
GROUP BY product_id
ORDER BY total_influence_impact DESC;

Warning: Always include a termination condition in recursive CTEs. Most databases have recursion limits, but runaway recursion can consume significant resources before hitting those limits.

Advanced CTE Patterns for Analytics

Window Function Integration

CTEs work beautifully with window functions, letting you create sophisticated analytical queries that would be nearly impossible with subqueries alone.

-- Customer lifecycle analysis with cohort retention
WITH customer_cohorts AS (
    SELECT 
        customer_id,
        DATE_TRUNC('month', MIN(order_date)) as cohort_month
    FROM orders
    GROUP BY customer_id
),
monthly_activity AS (
    SELECT 
        o.customer_id,
        cc.cohort_month,
        DATE_TRUNC('month', o.order_date) as activity_month,
        SUM(o.total_amount) as monthly_revenue
    FROM orders o
    JOIN customer_cohorts cc ON o.customer_id = cc.customer_id
    GROUP BY o.customer_id, cc.cohort_month, DATE_TRUNC('month', o.order_date)
),
cohort_sizes AS (
    SELECT 
        cohort_month,
        COUNT(DISTINCT customer_id) as cohort_size
    FROM customer_cohorts
    GROUP BY cohort_month
),
retention_data AS (
    SELECT 
        ma.cohort_month,
        ma.activity_month,
        EXTRACT(month FROM age(ma.activity_month, ma.cohort_month)) as months_since_first,
        COUNT(DISTINCT ma.customer_id) as active_customers,
        SUM(ma.monthly_revenue) as cohort_revenue
    FROM monthly_activity ma
    GROUP BY ma.cohort_month, ma.activity_month
)
SELECT 
    rd.cohort_month,
    rd.months_since_first,
    cs.cohort_size,
    rd.active_customers,
    ROUND(100.0 * rd.active_customers / cs.cohort_size, 2) as retention_rate,
    rd.cohort_revenue,
    SUM(rd.cohort_revenue) OVER (
        PARTITION BY rd.cohort_month 
        ORDER BY rd.months_since_first
    ) as cumulative_revenue_per_cohort
FROM retention_data rd
JOIN cohort_sizes cs ON rd.cohort_month = cs.cohort_month
ORDER BY rd.cohort_month, rd.months_since_first;

This analysis tracks how customers from each monthly cohort behave over time, showing retention rates and cumulative revenue generation. Each CTE handles one logical step, making the complex analysis manageable.

Dynamic Pivoting with CTEs

CTEs can create dynamic pivot tables by building the query structure programmatically:

-- Product performance across different time periods
WITH date_periods AS (
    SELECT 
        'last_30_days' as period,
        CURRENT_DATE - INTERVAL '30 days' as start_date,
        CURRENT_DATE as end_date
    UNION ALL
    SELECT 
        'last_90_days',
        CURRENT_DATE - INTERVAL '90 days',
        CURRENT_DATE
    UNION ALL
    SELECT 
        'last_365_days',
        CURRENT_DATE - INTERVAL '365 days',
        CURRENT_DATE
),
product_performance AS (
    SELECT 
        p.product_id,
        p.product_name,
        p.category,
        dp.period,
        COUNT(oi.order_id) as total_orders,
        SUM(oi.quantity) as total_quantity,
        SUM(oi.quantity * oi.unit_price) as total_revenue
    FROM products p
    CROSS JOIN date_periods dp
    LEFT JOIN order_items oi ON p.product_id = oi.product_id
    LEFT JOIN orders o ON oi.order_id = o.order_id
        AND o.order_date >= dp.start_date 
        AND o.order_date < dp.end_date
    GROUP BY p.product_id, p.product_name, p.category, dp.period
)
SELECT 
    product_name,
    category,
    SUM(CASE WHEN period = 'last_30_days' THEN total_revenue ELSE 0 END) as revenue_30d,
    SUM(CASE WHEN period = 'last_90_days' THEN total_revenue ELSE 0 END) as revenue_90d,
    SUM(CASE WHEN period = 'last_365_days' THEN total_revenue ELSE 0 END) as revenue_365d,
    SUM(CASE WHEN period = 'last_30_days' THEN total_orders ELSE 0 END) as orders_30d,
    SUM(CASE WHEN period = 'last_90_days' THEN total_orders ELSE 0 END) as orders_90d,
    SUM(CASE WHEN period = 'last_365_days' THEN total_orders ELSE 0 END) as orders_365d
FROM product_performance
GROUP BY product_id, product_name, category
ORDER BY revenue_90d DESC;

This pattern lets you analyze the same metrics across different time windows in a single query, making it easy to spot trends and seasonal patterns.

Performance Considerations and Optimization

Understanding when CTEs and subqueries perform well is crucial for production systems. Let's examine the key factors and optimization strategies.

Execution Plan Analysis

Different databases handle CTEs and subqueries differently. PostgreSQL materializes CTEs by default, meaning they're computed once and stored. SQL Server and Oracle are more flexible, sometimes inlining simple CTEs as subqueries.

-- This CTE will likely be materialized
WITH expensive_calculation AS (
    SELECT 
        customer_id,
        COUNT(*) as order_count,
        AVG(total_amount) as avg_amount,
        STDDEV(total_amount) as amount_stddev,
        -- Complex calculation that benefits from being done once
        EXP(SUM(LN(CASE WHEN total_amount > 0 THEN total_amount ELSE 1 END))) as geometric_mean
    FROM orders
    WHERE order_date >= '2023-01-01'
    GROUP BY customer_id
    HAVING COUNT(*) >= 5
)
SELECT 
    c.name,
    ec.order_count,
    ec.avg_amount,
    ec.amount_stddev
FROM customers c
JOIN expensive_calculation ec ON c.customer_id = ec.customer_id
WHERE c.tier IN ('premium', 'gold')
  AND ec.avg_amount > 100;

If only 10% of customers are premium/gold, this CTE does unnecessary work. A subquery might perform better:

-- Subquery version that can be optimized with predicate pushdown
SELECT 
    c.name,
    stats.order_count,
    stats.avg_amount,
    stats.amount_stddev
FROM customers c
JOIN (
    SELECT 
        customer_id,
        COUNT(*) as order_count,
        AVG(total_amount) as avg_amount,
        STDDEV(total_amount) as amount_stddev
    FROM orders
    WHERE order_date >= '2023-01-01'
    GROUP BY customer_id
    HAVING COUNT(*) >= 5
) stats ON c.customer_id = stats.customer_id
WHERE c.tier IN ('premium', 'gold')
  AND stats.avg_amount > 100;

Modern optimizers might push the customer tier filter down to reduce the work done in the subquery.

Memory and Disk Considerations

CTEs that return large result sets can consume significant memory or spill to disk. Monitor your queries' memory usage, especially in environments with limited resources.

-- This CTE might create a very large intermediate result
WITH daily_product_stats AS (
    SELECT 
        product_id,
        order_date,
        COUNT(*) as daily_orders,
        SUM(quantity) as daily_quantity,
        AVG(unit_price) as daily_avg_price,
        -- Adding more aggregations increases memory usage
        STDDEV(unit_price) as price_stddev,
        MIN(unit_price) as min_price,
        MAX(unit_price) as max_price
    FROM order_items oi
    JOIN orders o ON oi.order_id = o.order_id
    WHERE o.order_date >= '2020-01-01'  -- Large date range
    GROUP BY product_id, order_date
)
-- If only using a small subset, consider filtering earlier
SELECT product_id, AVG(daily_orders) as avg_daily_orders
FROM daily_product_stats
WHERE product_id IN (101, 102, 103)  -- Only 3 products needed
GROUP BY product_id;

For better performance, filter early:

WITH daily_product_stats AS (
    SELECT 
        product_id,
        order_date,
        COUNT(*) as daily_orders,
        SUM(quantity) as daily_quantity
    FROM order_items oi
    JOIN orders o ON oi.order_id = o.order_id
    WHERE o.order_date >= '2020-01-01'
      AND oi.product_id IN (101, 102, 103)  -- Filter moved here
    GROUP BY product_id, order_date
)
SELECT product_id, AVG(daily_orders) as avg_daily_orders
FROM daily_product_stats
GROUP BY product_id;

Index Strategy for Subqueries

Correlated subqueries put heavy pressure on indexes. Make sure the columns referenced in the correlation condition are well-indexed.

-- This query needs indexes on orders(customer_id, order_date)
-- and customers(tier, registration_date)
SELECT 
    c.customer_id,
    c.name,
    (SELECT COUNT(*) 
     FROM orders o 
     WHERE o.customer_id = c.customer_id 
       AND o.order_date >= '2023-01-01') as recent_orders
FROM customers c
WHERE c.tier = 'premium'
  AND c.registration_date >= '2022-01-01';

Without proper indexes, this query executes a full table scan of orders for each premium customer—potentially millions of operations.

Index Tip: For correlated subqueries, create covering indexes that include all referenced columns. An index on orders(customer_id, order_date, total_amount) would be better than separate indexes on each column.

Hands-On Exercise

Let's build a comprehensive customer segmentation analysis using both subqueries and CTEs. We'll analyze a fictional e-commerce database to identify customer segments and their characteristics.

Scenario: You're analyzing customer behavior to design a targeted marketing campaign. You need to segment customers based on their purchasing patterns and identify the most valuable segments.

Database Schema:

-- Create sample tables for the exercise
CREATE TABLE customers (
    customer_id INTEGER PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(100),
    registration_date DATE,
    tier VARCHAR(20)
);

CREATE TABLE orders (
    order_id INTEGER PRIMARY KEY,
    customer_id INTEGER,
    order_date DATE,
    total_amount DECIMAL(10,2),
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);

CREATE TABLE order_items (
    order_id INTEGER,
    product_id INTEGER,
    quantity INTEGER,
    unit_price DECIMAL(10,2),
    FOREIGN KEY (order_id) REFERENCES orders(order_id)
);

CREATE TABLE products (
    product_id INTEGER PRIMARY KEY,
    product_name VARCHAR(100),
    category VARCHAR(50),
    price DECIMAL(10,2)
);

Exercise Tasks:

Customer Lifetime Value Analysis: Create a query using CTEs that calculates each customer's lifetime value, average order value, and purchase frequency. Segment customers into quintiles based on lifetime value.
Product Affinity Analysis: Use correlated subqueries to find products that are frequently bought together. For each product, find the top 3 products most commonly purchased in the same order.
Churn Prediction Features: Build a comprehensive analysis using multiple CTEs that calculates features for churn prediction: days since last order, purchase frequency trend, seasonal purchasing patterns, and average order value trend.

Solution Approach:

-- Task 1: Customer Lifetime Value Analysis
WITH customer_metrics AS (
    SELECT 
        c.customer_id,
        c.name,
        c.registration_date,
        c.tier,
        COUNT(DISTINCT o.order_id) as total_orders,
        SUM(o.total_amount) as lifetime_value,
        AVG(o.total_amount) as avg_order_value,
        MIN(o.order_date) as first_order_date,
        MAX(o.order_date) as last_order_date,
        EXTRACT(days FROM (MAX(o.order_date) - MIN(o.order_date))) as customer_lifetime_days
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.name, c.registration_date, c.tier
),
ltv_quintiles AS (
    SELECT 
        *,
        NTILE(5) OVER (ORDER BY lifetime_value) as ltv_quintile,
        CASE 
            WHEN total_orders = 0 THEN 0
            ELSE lifetime_value / NULLIF(customer_lifetime_days, 0) * 365
        END as annualized_value
    FROM customer_metrics
)
SELECT 
    ltv_quintile,
    COUNT(*) as customer_count,
    AVG(lifetime_value) as avg_lifetime_value,
    AVG(total_orders) as avg_total_orders,
    AVG(avg_order_value) as avg_order_value,
    AVG(annualized_value) as avg_annualized_value,
    MIN(lifetime_value) as min_ltv_in_quintile,
    MAX(lifetime_value) as max_ltv_in_quintile
FROM ltv_quintiles
WHERE total_orders > 0  -- Exclude customers who never purchased
GROUP BY ltv_quintile
ORDER BY ltv_quintile;

-- Task 2: Product Affinity Analysis
SELECT 
    p1.product_name as product,
    p1.category,
    affinity_products.companion_product,
    affinity_products.companion_category,
    affinity_products.co_occurrence_count,
    affinity_products.affinity_strength
FROM products p1
CROSS APPLY (
    SELECT TOP 3
        p2.product_name as companion_product,
        p2.category as companion_category,
        COUNT(*) as co_occurrence_count,
        ROUND(
            COUNT(*) * 100.0 / (
                SELECT COUNT(DISTINCT oi1.order_id)
                FROM order_items oi1
                WHERE oi1.product_id = p1.product_id
            ), 2
        ) as affinity_strength
    FROM order_items oi1
    JOIN order_items oi2 ON oi1.order_id = oi2.order_id 
        AND oi1.product_id != oi2.product_id
    JOIN products p2 ON oi2.product_id = p2.product_id
    WHERE oi1.product_id = p1.product_id
    GROUP BY p2.product_id, p2.product_name, p2.category
    ORDER BY COUNT(*) DESC
) affinity_products
WHERE EXISTS (
    SELECT 1 FROM order_items oi WHERE oi.product_id = p1.product_id
)
ORDER BY p1.category, p1.product_name;

-- Task 3: Churn Prediction Features
WITH customer_order_history AS (
    SELECT 
        customer_id,
        order_date,
        total_amount,
        ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) as order_sequence,
        LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date) as prev_order_date,
        EXTRACT(month FROM order_date) as order_month,
        EXTRACT(year FROM order_date) as order_year
    FROM orders
),
customer_patterns AS (
    SELECT 
        customer_id,
        COUNT(*) as total_orders,
        AVG(total_amount) as avg_order_value,
        STDDEV(total_amount) as order_value_stddev,
        MAX(order_date) as last_order_date,
        AVG(EXTRACT(days FROM (order_date - prev_order_date))) as avg_days_between_orders,
        -- Seasonal pattern: percentage of orders in each quarter
        100.0 * COUNT(CASE WHEN EXTRACT(month FROM order_date) IN (1,2,3) THEN 1 END) / COUNT(*) as q1_percentage,
        100.0 * COUNT(CASE WHEN EXTRACT(month FROM order_date) IN (4,5,6) THEN 1 END) / COUNT(*) as q2_percentage,
        100.0 * COUNT(CASE WHEN EXTRACT(month FROM order_date) IN (7,8,9) THEN 1 END) / COUNT(*) as q3_percentage,
        100.0 * COUNT(CASE WHEN EXTRACT(month FROM order_date) IN (10,11,12) THEN 1 END) / COUNT(*) as q4_percentage
    FROM customer_order_history
    WHERE prev_order_date IS NOT NULL OR order_sequence = 1
    GROUP BY customer_id
),
recent_trends AS (
    SELECT 
        customer_id,
        -- Recent vs historical average order value
        AVG(CASE WHEN order_date >= CURRENT_DATE - INTERVAL '90 days' 
                 THEN total_amount END) as recent_avg_order_value,
        AVG(CASE WHEN order_date < CURRENT_DATE - INTERVAL '90 days' 
                 THEN total_amount END) as historical_avg_order_value,
        -- Recent order frequency
        COUNT(CASE WHEN order_date >= CURRENT_DATE - INTERVAL '90 days' 
                   THEN 1 END) as recent_order_count
    FROM orders
    GROUP BY customer_id
)
SELECT 
    c.customer_id,
    c.name,
    c.tier,
    cp.total_orders,
    cp.avg_order_value,
    cp.order_value_stddev,
    EXTRACT(days FROM (CURRENT_DATE - cp.last_order_date)) as days_since_last_order,
    cp.avg_days_between_orders,
    rt.recent_avg_order_value,
    rt.historical_avg_order_value,
    CASE 
        WHEN rt.historical_avg_order_value > 0 
        THEN (rt.recent_avg_order_value - rt.historical_avg_order_value) / rt.historical_avg_order_value * 100
        ELSE 0 
    END as order_value_trend_pct,
    rt.recent_order_count,
    cp.q1_percentage,
    cp.q2_percentage, 
    cp.q3_percentage,
    cp.q4_percentage,
    -- Churn risk score (simple heuristic)
    CASE 
        WHEN EXTRACT(days FROM (CURRENT_DATE - cp.last_order_date)) > 180 THEN 'high'
        WHEN EXTRACT(days FROM (CURRENT_DATE - cp.last_order_date)) > 90 THEN 'medium'
        ELSE 'low'
    END as churn_risk
FROM customers c
LEFT JOIN customer_patterns cp ON c.customer_id = cp.customer_id
LEFT JOIN recent_trends rt ON c.customer_id = rt.customer_id
ORDER BY days_since_last_order DESC;

Try running these queries on your own dataset, modifying the time windows and metrics to match your specific business needs.

Common Mistakes & Troubleshooting

Even experienced developers encounter pitfalls when working with complex subqueries and CTEs. Let's examine the most common issues and their solutions.

Correlated Subquery Performance Traps

Mistake: Writing correlated subqueries that execute unnecessarily often.

-- PROBLEMATIC: This subquery executes once per row
SELECT 
    product_id,
    product_name,
    price,
    (SELECT AVG(price) FROM products p2 WHERE p2.category = p1.category) as category_avg,
    (SELECT COUNT(*) FROM products p3 WHERE p3.category = p1.category) as category_count
FROM products p1
ORDER BY category, price;

Solution: Use window functions or CTEs to calculate once per category.

-- BETTER: Window functions calculate once per category
SELECT 
    product_id,
    product_name,
    price,
    AVG(price) OVER (PARTITION BY category) as category_avg,
    COUNT(*) OVER (PARTITION BY category) as category_count
FROM products
ORDER BY category, price;

CTE Reference Errors

Mistake: Trying to reference CTEs out of order or incorrectly.

-- WRONG: second_cte tries to use third_cte before it's defined
WITH second_cte AS (
    SELECT customer_id, total_value
    FROM third_cte  -- Error: third_cte not yet defined
    WHERE total_value > 1000
),
first_cte AS (
    SELECT customer_id, SUM(total_amount) as total_value
    FROM orders
    GROUP BY customer_id
),
third_cte AS (
    SELECT customer_id, total_value
    FROM first_cte
)
SELECT * FROM second_cte;

Solution: Define CTEs in dependency order and understand scoping rules.

-- CORRECT: CTEs defined in proper order
WITH first_cte AS (
    SELECT customer_id, SUM(total_amount) as total_value
    FROM orders
    GROUP BY customer_id
),
second_cte AS (
    SELECT customer_id, total_value
    FROM first_cte
    WHERE total_value > 1000
)
SELECT * FROM second_cte;

Recursive CTE Infinite Loops

Mistake: Forgetting termination conditions in recursive CTEs.

-- DANGEROUS: No termination condition
WITH RECURSIVE employee_hierarchy AS (
    SELECT employee_id, manager_id, name, 0 as level
    FROM employees
    WHERE manager_id IS NULL
    
    UNION ALL
    
    SELECT e.employee_id, e.manager_id, e.name, eh.level + 1
    FROM employees e
    JOIN employee_hierarchy eh ON e.manager_id = eh.employee_id
    -- Missing: WHERE eh.level < some_limit
)
SELECT * FROM employee_hierarchy;

Solution: Always include explicit termination conditions.

-- SAFE: Multiple termination conditions
WITH RECURSIVE employee_hierarchy AS (
    SELECT employee_id, manager_id, name, 0 as level
    FROM employees
    WHERE manager_id IS NULL
    
    UNION ALL
    
    SELECT e.employee_id, e.manager_id, e.name, eh.level + 1
    FROM employees e
    JOIN employee_hierarchy eh ON e.manager_id = eh.employee_id
    WHERE eh.level < 10  -- Prevent runaway recursion
      AND e.employee_id != eh.employee_id  -- Prevent self-references
)
SELECT * FROM employee_hierarchy;

NULL Handling in Aggregations

Mistake: Forgetting how NULLs affect subquery results and aggregations.

-- PROBLEMATIC: What happens when a customer has no orders?
SELECT 
    c.name,
    (SELECT AVG(total_amount) FROM orders o WHERE o.customer_id = c.customer_id) as avg_order
FROM customers c
WHERE (
    SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id
) > 5;

If a customer has no orders, the AVG() returns NULL, and the COUNT() returns 0. The WHERE condition correctly filters them out, but the SELECT will show NULL for avg_order.

Solution: Use explicit NULL handling and consider edge cases.

-- BETTER: Explicit NULL handling
SELECT 
    c.name,
    COALESCE(
        (SELECT AVG(total_amount) FROM orders o WHERE o.customer_id = c.customer_id),
        0
    ) as avg_order,
    COALESCE(
        (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id),
        0
    ) as order_count
FROM customers c
WHERE (
    SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id
) > 5;

Data Type Mismatches in UNION

Mistake: Mixing incompatible data types in recursive or UNION queries.

-- ERROR: Different data types in UNION
WITH customer_events AS (
    SELECT customer_id, 'registration' as event_type, registration_date as event_date
    FROM customers
    
    UNION ALL
    
    SELECT customer_id, 'order', order_id  -- order_id is INTEGER, not DATE
    FROM orders
)
SELECT * FROM customer_events;

Solution: Ensure compatible data types and explicit casting when necessary.

-- CORRECT: Compatible data types
WITH customer_events AS (
    SELECT customer_id, 'registration' as event_type, registration_date as event_date
    FROM customers
    
    UNION ALL
    
    SELECT customer_id, 'order', order_date
    FROM orders
)
SELECT * FROM customer_events
ORDER BY customer_id, event_date;

Debugging Complex CTEs

When CTE queries produce unexpected results, debug systematically:

Test each CTE individually: Run each CTE as a standalone query to verify results.
Add intermediate columns: Include debug columns that show your logic.
Use small datasets: Test with limited data first.
Check cardinality: Ensure JOINs produce expected row counts.

-- Add debug information to troubleshoot
WITH customer_stats AS (
    SELECT 
        customer_id,
        COUNT(*) as order_count,
        SUM(total_amount) as total_spent,
        -- Debug: show individual components
        COUNT(*) as debug_order_count,
        SUM(CASE WHEN total_amount IS NULL THEN 1 ELSE 0 END) as null_amounts
    FROM orders
    GROUP BY customer_id
)
SELECT 
    customer_id,
    order_count,
    total_spent,
    -- Debug columns help verify logic
    debug_order_count,
    null_amounts,
    CASE WHEN order_count != debug_order_count THEN 'MISMATCH' ELSE 'OK' END as count_check
FROM customer_stats
WHERE customer_id IN (101, 102, 103)  -- Limit to specific customers for testing
ORDER BY customer_id;

Summary & Next Steps

You've now mastered the fundamental patterns for subqueries and CTEs, along with their performance characteristics and common pitfalls. These tools transform SQL from a simple data retrieval language into a powerful analytical platform capable of sophisticated multi-step reasoning.

Key takeaways:

Subqueries excel for simple lookups, EXISTS conditions, and cases where the optimizer can push predicates down effectively
CTEs shine for complex multi-step analysis, readability, and cases where intermediate results are reused
Recursive CTEs handle hierarchical data and graph traversal problems that are difficult or impossible with other SQL constructs
Performance depends heavily on your specific database system, data volumes, and query patterns—always test with realistic data

Advanced patterns you've learned:

Building customer segmentation and cohort analysis with multi-step CTEs
Using correlated subqueries for row-by-row calculations and comparisons
Implementing dynamic pivoting and cross-tabulation with CTEs
Creating reusable analytical building blocks that colleagues can understand and modify

Next steps for deeper expertise:

Study execution plans for your specific database system to understand how it handles CTEs vs subqueries
Explore advanced window functions that complement the patterns you've learned here
Learn about materialized views for cases where CTE-like logic needs to persist across queries
Practice with real datasets in your domain—customer analytics, financial modeling, operational reporting, or scientific analysis
Investigate stored procedures and user-defined functions for complex logic that's reused across many queries

The patterns you've learned here form the foundation for advanced SQL analytics. Whether you're building business intelligence dashboards, performing ad-hoc analysis, or designing data pipelines, these techniques will help you write queries that are both powerful and maintainable.

Remember: the best SQL query is one that solves the business problem correctly, performs well with your data volumes, and can be understood by the next person who needs to modify it. Subqueries and CTEs are tools in service of those goals, not ends in themselves.