Subqueries and CTEs in SQL: Advanced Query Techniques for Data Professionals

You're analyzing sales performance across multiple regions, and your manager asks a seemingly simple question: "Which sales reps exceeded their regional average by more than 20%?" Your first instinct might be to write multiple queries, export to Excel, and manually calculate the comparisons. But there's a better way—one that keeps all the logic in SQL and produces results you can trust and reproduce.

Subqueries and Common Table Expressions (CTEs) are the tools that separate casual SQL users from true data professionals. They let you break complex problems into logical steps, reuse intermediate calculations, and write queries that read like the business logic they represent. By the end of this lesson, you'll be writing sophisticated multi-step analyses that would have seemed impossible with basic SELECT statements.

What you'll learn:

How to use correlated and non-correlated subqueries to solve complex filtering and calculation problems
When to choose subqueries versus JOINs for performance and readability
How to structure CTEs to break down complex business logic into manageable steps
Advanced CTE techniques including recursive queries and window functions
Performance considerations and optimization strategies for nested queries

Prerequisites

You should be comfortable with basic SELECT statements, JOINs, aggregate functions, and window functions. If you're unsure about any of these, review those fundamentals before diving into subqueries and CTEs.

Understanding Subqueries: Queries Within Queries

A subquery is simply a query nested inside another query. Think of it as asking a question that depends on the answer to another question. Let's start with our sales scenario using a realistic dataset.

-- Sample data structure we'll work with
CREATE TABLE sales_reps (
    rep_id INT PRIMARY KEY,
    rep_name VARCHAR(100),
    region VARCHAR(50),
    hire_date DATE
);

CREATE TABLE sales_transactions (
    transaction_id INT PRIMARY KEY,
    rep_id INT,
    customer_id INT,
    sale_amount DECIMAL(10,2),
    sale_date DATE,
    product_category VARCHAR(50)
);

Non-Correlated Subqueries

Non-correlated subqueries execute independently of the outer query. They run once and return a result that the outer query uses. Here's how to find sales reps who generated more revenue than the company average:

SELECT 
    sr.rep_name,
    sr.region,
    SUM(st.sale_amount) as total_sales
FROM sales_reps sr
JOIN sales_transactions st ON sr.rep_id = st.rep_id
GROUP BY sr.rep_id, sr.rep_name, sr.region
HAVING SUM(st.sale_amount) > (
    SELECT AVG(rep_total) 
    FROM (
        SELECT SUM(sale_amount) as rep_total
        FROM sales_transactions
        GROUP BY rep_id
    ) rep_totals
);

Notice how the subquery calculates the average total sales across all reps, then the outer query uses that single value to filter results. The subquery runs once, regardless of how many rows are in the outer query.

Correlated Subqueries

Correlated subqueries reference columns from the outer query, creating a dependent relationship. They execute once for each row processed by the outer query. Here's how to find reps who exceed their regional average:

SELECT 
    sr.rep_name,
    sr.region,
    SUM(st.sale_amount) as rep_sales,
    (
        SELECT AVG(regional_sales.rep_total)
        FROM (
            SELECT SUM(st2.sale_amount) as rep_total
            FROM sales_reps sr2
            JOIN sales_transactions st2 ON sr2.rep_id = st2.rep_id
            WHERE sr2.region = sr.region  -- This creates the correlation
            GROUP BY sr2.rep_id
        ) regional_sales
    ) as regional_average
FROM sales_reps sr
JOIN sales_transactions st ON sr.rep_id = st.rep_id
GROUP BY sr.rep_id, sr.rep_name, sr.region
HAVING SUM(st.sale_amount) > (
    SELECT AVG(regional_sales.rep_total)
    FROM (
        SELECT SUM(st2.sale_amount) as rep_total
        FROM sales_reps sr2
        JOIN sales_transactions st2 ON sr2.rep_id = st2.rep_id
        WHERE sr2.region = sr.region
        GROUP BY sr2.rep_id
    ) regional_sales
);

This query is more complex because for each rep in the outer query, the subquery recalculates the average for that specific region. The WHERE sr2.region = sr.region clause creates the correlation.

Performance Warning: Correlated subqueries can be expensive because they execute multiple times. For the query above, if you have 100 sales reps, the subquery runs 100 times. Always consider whether a JOIN might be more efficient.

Subqueries in Different Clauses

Subqueries aren't limited to WHERE and HAVING clauses. Here are practical examples for each location:

SELECT clause (calculated columns):

SELECT 
    sr.rep_name,
    sr.region,
    (
        SELECT COUNT(*)
        FROM sales_transactions st
        WHERE st.rep_id = sr.rep_id
        AND st.sale_date >= '2024-01-01'
    ) as ytd_transaction_count,
    (
        SELECT SUM(sale_amount)
        FROM sales_transactions st
        WHERE st.rep_id = sr.rep_id
        AND st.sale_date >= '2024-01-01'
    ) as ytd_sales
FROM sales_reps sr;

FROM clause (derived tables):

SELECT 
    monthly_sales.sale_month,
    AVG(monthly_sales.monthly_total) as avg_monthly_sales,
    MAX(monthly_sales.monthly_total) as peak_monthly_sales
FROM (
    SELECT 
        DATE_TRUNC('month', sale_date) as sale_month,
        SUM(sale_amount) as monthly_total
    FROM sales_transactions
    GROUP BY DATE_TRUNC('month', sale_date)
) monthly_sales
GROUP BY monthly_sales.sale_month
ORDER BY sale_month;

Common Table Expressions: Readable Multi-Step Logic

CTEs solve the readability problem that complex subqueries create. They let you name intermediate result sets and reference them multiple times within a single query. Think of CTEs as temporary views that exist only for the duration of your query.

Basic CTE Syntax

Here's the basic structure:

WITH cte_name AS (
    -- Your query here
)
SELECT * FROM cte_name;

Let's rewrite our regional average comparison using a CTE:

WITH rep_totals AS (
    SELECT 
        sr.rep_id,
        sr.rep_name,
        sr.region,
        SUM(st.sale_amount) as total_sales
    FROM sales_reps sr
    JOIN sales_transactions st ON sr.rep_id = st.rep_id
    GROUP BY sr.rep_id, sr.rep_name, sr.region
),
regional_averages AS (
    SELECT 
        region,
        AVG(total_sales) as avg_regional_sales
    FROM rep_totals
    GROUP BY region
)
SELECT 
    rt.rep_name,
    rt.region,
    rt.total_sales,
    ra.avg_regional_sales,
    rt.total_sales - ra.avg_regional_sales as difference,
    ROUND(
        ((rt.total_sales - ra.avg_regional_sales) / ra.avg_regional_sales) * 100, 
        2
    ) as percent_above_average
FROM rep_totals rt
JOIN regional_averages ra ON rt.region = ra.region
WHERE rt.total_sales > ra.avg_regional_sales * 1.2  -- 20% above average
ORDER BY percent_above_average DESC;

This CTE approach is much more readable. We can see the logical flow: calculate individual totals, then regional averages, then compare them. Each CTE has a clear purpose and can be understood independently.

Multiple CTEs and Complex Analysis

CTEs really shine when you need to perform multi-step analysis. Let's build a comprehensive sales performance dashboard:

WITH monthly_sales AS (
    SELECT 
        sr.rep_id,
        sr.rep_name,
        sr.region,
        DATE_TRUNC('month', st.sale_date) as sale_month,
        SUM(st.sale_amount) as monthly_total,
        COUNT(st.transaction_id) as transaction_count
    FROM sales_reps sr
    JOIN sales_transactions st ON sr.rep_id = st.rep_id
    WHERE st.sale_date >= '2024-01-01'
    GROUP BY sr.rep_id, sr.rep_name, sr.region, DATE_TRUNC('month', st.sale_date)
),
rep_performance AS (
    SELECT 
        rep_id,
        rep_name,
        region,
        SUM(monthly_total) as ytd_sales,
        AVG(monthly_total) as avg_monthly_sales,
        COUNT(sale_month) as active_months,
        SUM(transaction_count) as total_transactions
    FROM monthly_sales
    GROUP BY rep_id, rep_name, region
),
regional_benchmarks AS (
    SELECT 
        region,
        AVG(ytd_sales) as avg_regional_ytd,
        AVG(avg_monthly_sales) as avg_regional_monthly,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY ytd_sales) as top_quartile_ytd
    FROM rep_performance
    GROUP BY region
),
performance_rankings AS (
    SELECT 
        rp.*,
        rb.avg_regional_ytd,
        rb.avg_regional_monthly,
        rb.top_quartile_ytd,
        RANK() OVER (PARTITION BY rp.region ORDER BY rp.ytd_sales DESC) as regional_rank,
        RANK() OVER (ORDER BY rp.ytd_sales DESC) as company_rank
    FROM rep_performance rp
    JOIN regional_benchmarks rb ON rp.region = rb.region
)
SELECT 
    rep_name,
    region,
    ytd_sales,
    avg_monthly_sales,
    total_transactions,
    regional_rank,
    company_rank,
    CASE 
        WHEN ytd_sales >= top_quartile_ytd THEN 'Top Performer'
        WHEN ytd_sales >= avg_regional_ytd THEN 'Above Average'
        ELSE 'Needs Improvement'
    END as performance_category,
    ROUND(
        ((ytd_sales - avg_regional_ytd) / avg_regional_ytd) * 100, 
        2
    ) as vs_regional_average_pct
FROM performance_rankings
ORDER BY company_rank;

This query demonstrates the power of CTEs for complex analysis. Each step builds on the previous one, creating a clear narrative from raw transactions to final performance rankings.

Recursive CTEs: Handling Hierarchical Data

Recursive CTEs handle hierarchical or tree-like data structures. They're essential for organizational charts, product categories, or any scenario where data has parent-child relationships.

-- Sample hierarchical data
CREATE TABLE employees (
    employee_id INT PRIMARY KEY,
    employee_name VARCHAR(100),
    manager_id INT,
    department VARCHAR(50),
    salary DECIMAL(10,2)
);

-- Find all employees in a management chain
WITH RECURSIVE management_chain AS (
    -- Base case: start with a specific manager
    SELECT 
        employee_id,
        employee_name,
        manager_id,
        department,
        salary,
        0 as level,
        employee_name as chain_path
    FROM employees
    WHERE employee_id = 101  -- Starting manager ID
    
    UNION ALL
    
    -- Recursive case: find direct reports
    SELECT 
        e.employee_id,
        e.employee_name,
        e.manager_id,
        e.department,
        e.salary,
        mc.level + 1,
        mc.chain_path || ' -> ' || e.employee_name
    FROM employees e
    JOIN management_chain mc ON e.manager_id = mc.employee_id
)
SELECT 
    employee_id,
    employee_name,
    department,
    salary,
    level,
    chain_path,
    CASE level
        WHEN 0 THEN 'Senior Manager'
        WHEN 1 THEN 'Manager'
        WHEN 2 THEN 'Senior Associate'
        ELSE 'Associate'
    END as role_level
FROM management_chain
ORDER BY level, employee_name;

Recursion Tip: Always include a termination condition to prevent infinite loops. Most databases have built-in limits, but it's better to be explicit about when recursion should stop.

Performance Considerations and Optimization

Understanding when to use subqueries versus JOINs is crucial for performance. Here are the key decision factors:

When to Choose Subqueries

Use subqueries when:

You need a single value for comparison (non-correlated scalar subqueries)
The logic is clearer with a nested approach
You're filtering based on aggregate calculations from related tables

-- Good use of subquery: filtering by aggregate
SELECT customer_id, customer_name
FROM customers
WHERE customer_id IN (
    SELECT customer_id
    FROM orders
    WHERE order_date >= '2024-01-01'
    GROUP BY customer_id
    HAVING SUM(order_amount) > 10000
);

When to Choose JOINs

Use JOINs when:

You need columns from multiple tables in the result set
Performance is critical and you're dealing with large datasets
You can achieve the same result with either approach

-- Often more efficient than correlated subqueries
SELECT 
    sr.rep_name,
    sr.region,
    SUM(st.sale_amount) as rep_sales,
    regional_avg.avg_sales
FROM sales_reps sr
JOIN sales_transactions st ON sr.rep_id = st.rep_id
JOIN (
    SELECT 
        sr2.region,
        AVG(rep_totals.total_sales) as avg_sales
    FROM sales_reps sr2
    JOIN (
        SELECT rep_id, SUM(sale_amount) as total_sales
        FROM sales_transactions
        GROUP BY rep_id
    ) rep_totals ON sr2.rep_id = rep_totals.rep_id
    GROUP BY sr2.region
) regional_avg ON sr.region = regional_avg.region
GROUP BY sr.rep_id, sr.rep_name, sr.region, regional_avg.avg_sales
HAVING SUM(st.sale_amount) > regional_avg.avg_sales * 1.2;

CTE Performance Considerations

CTEs in most databases are materialized once and can be referenced multiple times efficiently. However, some databases treat CTEs as views and re-execute them for each reference. Know your database's behavior:

-- This CTE will be referenced twice - efficient in most databases
WITH expensive_calculation AS (
    SELECT 
        product_id,
        AVG(sale_amount) as avg_price,
        COUNT(*) as sale_count,
        -- Expensive window function
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sale_amount) as median_price
    FROM sales_transactions
    WHERE sale_date >= '2024-01-01'
    GROUP BY product_id
)
SELECT 
    ec1.product_id,
    ec1.avg_price,
    ec1.sale_count,
    ec1.median_price,
    -- Second reference to the same CTE
    ec2.avg_price as category_avg
FROM expensive_calculation ec1
JOIN expensive_calculation ec2 ON ec1.product_id = ec2.product_id;

Hands-On Exercise

Let's build a comprehensive customer segmentation analysis using both subqueries and CTEs. You'll analyze an e-commerce dataset to identify high-value customers and their purchasing patterns.

-- Setup: Create sample e-commerce data
CREATE TABLE customers (
    customer_id INT PRIMARY KEY,
    customer_name VARCHAR(100),
    email VARCHAR(100),
    registration_date DATE,
    country VARCHAR(50)
);

CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    order_date DATE,
    order_amount DECIMAL(10,2),
    shipping_cost DECIMAL(10,2)
);

CREATE TABLE order_items (
    item_id INT PRIMARY KEY,
    order_id INT,
    product_category VARCHAR(50),
    quantity INT,
    unit_price DECIMAL(10,2)
);

-- Your task: Write a query that identifies "VIP customers" based on these criteria:
-- 1. Total lifetime value > 75th percentile of all customers
-- 2. Average order value > company average
-- 3. Has made purchases in at least 3 different product categories
-- 4. Has been active (made a purchase) in the last 90 days

-- Solution using CTEs:
WITH customer_metrics AS (
    SELECT 
        c.customer_id,
        c.customer_name,
        c.email,
        c.country,
        c.registration_date,
        COUNT(o.order_id) as total_orders,
        SUM(o.order_amount) as lifetime_value,
        AVG(o.order_amount) as avg_order_value,
        MAX(o.order_date) as last_order_date,
        COUNT(DISTINCT oi.product_category) as categories_purchased
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    LEFT JOIN order_items oi ON o.order_id = oi.order_id
    WHERE o.order_date IS NOT NULL  -- Only customers who have made purchases
    GROUP BY c.customer_id, c.customer_name, c.email, c.country, c.registration_date
),
company_benchmarks AS (
    SELECT 
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY lifetime_value) as ltv_75th_percentile,
        AVG(avg_order_value) as company_avg_order_value
    FROM customer_metrics
),
vip_candidates AS (
    SELECT 
        cm.*,
        cb.ltv_75th_percentile,
        cb.company_avg_order_value,
        CASE 
            WHEN cm.lifetime_value > cb.ltv_75th_percentile
                AND cm.avg_order_value > cb.company_avg_order_value
                AND cm.categories_purchased >= 3
                AND cm.last_order_date >= CURRENT_DATE - INTERVAL '90 days'
            THEN TRUE
            ELSE FALSE
        END as is_vip
    FROM customer_metrics cm
    CROSS JOIN company_benchmarks cb
)
SELECT 
    customer_name,
    email,
    country,
    total_orders,
    lifetime_value,
    avg_order_value,
    categories_purchased,
    last_order_date,
    ROUND(
        ((lifetime_value - ltv_75th_percentile) / ltv_75th_percentile) * 100, 
        2
    ) as ltv_vs_75th_percentile,
    CASE 
        WHEN is_vip THEN 'VIP Customer'
        WHEN lifetime_value > ltv_75th_percentile THEN 'High Value'
        WHEN avg_order_value > company_avg_order_value THEN 'High Spender'
        ELSE 'Standard'
    END as customer_segment
FROM vip_candidates
WHERE is_vip = TRUE  -- Only show VIP customers
ORDER BY lifetime_value DESC;

This exercise demonstrates several advanced concepts:

Multiple CTEs building on each other
Cross joins for benchmark comparisons
Complex boolean logic for segmentation
Practical business metrics calculation

Common Mistakes & Troubleshooting

Mistake 1: Correlated Subquery Performance Issues

Problem: Using correlated subqueries when a JOIN would be more efficient.

-- Inefficient: This subquery runs once per row
SELECT 
    customer_name,
    (
        SELECT AVG(order_amount)
        FROM orders o
        WHERE o.customer_id = c.customer_id
    ) as avg_order_amount
FROM customers c;

Solution: Use window functions or JOINs instead:

-- More efficient: Single scan with window function
SELECT DISTINCT
    c.customer_name,
    AVG(o.order_amount) OVER (PARTITION BY c.customer_id) as avg_order_amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id;

Mistake 2: CTE Column Reference Errors

Problem: Referencing columns that don't exist in the CTE.

-- Error: total_sales isn't selected in the CTE
WITH sales_summary AS (
    SELECT rep_id, SUM(sale_amount) as sales_total
    FROM sales_transactions
    GROUP BY rep_id
)
SELECT rep_id, total_sales  -- Column doesn't exist!
FROM sales_summary;

Solution: Ensure column names match between CTE definition and usage.

Mistake 3: Recursive CTE Infinite Loops

Problem: Missing or incorrect termination conditions.

-- Dangerous: Could loop infinitely
WITH RECURSIVE bad_recursion AS (
    SELECT employee_id, manager_id, 1 as level
    FROM employees
    WHERE manager_id IS NULL
    
    UNION ALL
    
    SELECT e.employee_id, e.manager_id, br.level + 1
    FROM employees e
    JOIN bad_recursion br ON e.manager_id = br.employee_id
    -- No termination condition!
)
SELECT * FROM bad_recursion;

Solution: Add explicit termination conditions:

WITH RECURSIVE safe_recursion AS (
    SELECT employee_id, manager_id, 1 as level
    FROM employees
    WHERE manager_id IS NULL
    
    UNION ALL
    
    SELECT e.employee_id, e.manager_id, sr.level + 1
    FROM employees e
    JOIN safe_recursion sr ON e.manager_id = sr.employee_id
    WHERE sr.level < 10  -- Prevent runaway recursion
)
SELECT * FROM safe_recursion;

Troubleshooting Tips

Test CTEs independently: Run each CTE as a separate query to verify the results before combining them.
Use EXPLAIN PLAN: Understand how your database executes complex queries to identify performance bottlenecks.
Start simple: Begin with basic subqueries and gradually add complexity.
Validate assumptions: Check that your subqueries return the expected number of rows and data types.

Summary & Next Steps

You've learned how to use subqueries and CTEs to solve complex analytical problems that would be difficult or impossible with basic SQL. Subqueries let you embed one query inside another for dynamic filtering and calculations, while CTEs provide a readable way to break complex logic into manageable steps.

Key takeaways:

Use non-correlated subqueries for single-value comparisons and filtering
Be cautious with correlated subqueries—they can impact performance significantly
CTEs improve readability and maintainability of complex queries
Recursive CTEs handle hierarchical data elegantly
Consider performance implications when choosing between subqueries, CTEs, and JOINs

Next steps in your SQL journey:

Explore advanced window functions for analytical queries
Learn about query optimization and execution plans
Study database-specific features like materialized views and table functions
Practice with real-world datasets to build confidence with complex scenarios

The techniques you've learned here form the foundation for advanced analytical SQL. With practice, you'll find yourself naturally thinking in terms of multi-step logical processes that CTEs and subqueries can elegantly express.