Master Subqueries and CTEs in SQL: From Simple Filters to Complex Analytics

You're staring at a spreadsheet with thousands of rows of sales data, trying to answer what should be a simple question: "Which products sold above our average price last month?" In a spreadsheet, you'd need multiple steps—first calculate the average, then filter based on that result. But in SQL, you can answer complex questions like this in a single, elegant query using subqueries and Common Table Expressions (CTEs).

These powerful SQL features let you break down complicated data problems into manageable pieces, creating queries within queries that build upon each other logically. Think of them as a way to create temporary "scratch work" that helps you solve multi-step analytical problems without creating permanent tables or running multiple separate queries.

By the end of this lesson, you'll transform from someone who struggles with complex data questions into someone who can tackle sophisticated analytical challenges with confidence and clarity.

What you'll learn:

How to write subqueries that answer questions requiring multiple steps of logic
When to use subqueries in SELECT, WHERE, and FROM clauses for different analytical purposes
How to create and use Common Table Expressions (CTEs) to make complex queries readable and maintainable
How to chain multiple CTEs together to solve advanced analytical problems
Best practices for choosing between subqueries and CTEs based on your specific needs

Prerequisites

You should be comfortable with basic SQL SELECT statements, including filtering with WHERE clauses, sorting with ORDER BY, and using aggregate functions like COUNT, SUM, and AVG. If you can write queries that join tables together, you're ready for this lesson.

Understanding Subqueries: Queries Within Queries

A subquery is simply a SQL query nested inside another SQL query. It's like asking a question that depends on the answer to another question. Let's start with a concrete example using a realistic e-commerce dataset.

Imagine you're analyzing an online store's data with these tables:

-- Products table
CREATE TABLE products (
    product_id INT,
    product_name VARCHAR(100),
    category VARCHAR(50),
    price DECIMAL(10,2)
);

-- Orders table  
CREATE TABLE orders (
    order_id INT,
    product_id INT,
    customer_id INT,
    quantity INT,
    order_date DATE,
    total_amount DECIMAL(10,2)
);

Now, suppose you want to find all products that cost more than the average product price. Without subqueries, you'd need to:

First, calculate the average price
Remember that number
Write another query using that specific value

With a subquery, you can do this in one step:

SELECT product_name, price
FROM products
WHERE price > (
    SELECT AVG(price)
    FROM products
);

The query inside the parentheses (SELECT AVG(price) FROM products) runs first, calculating the average price. Then the outer query uses that result to filter products.

Key Insight: Subqueries always execute from the inside out. The innermost query runs first, then its results feed into the outer query.

Subqueries in Different Positions

Subqueries can appear in three main places in a SQL statement, each serving different purposes.

Subqueries in the WHERE Clause

WHERE clause subqueries are perfect for filtering based on calculated values or conditions from other tables. Here's a practical example: finding customers who placed orders above the average order value.

SELECT DISTINCT customer_id
FROM orders
WHERE total_amount > (
    SELECT AVG(total_amount)
    FROM orders
);

You can also use subqueries with operators like IN, NOT IN, EXISTS, and NOT EXISTS:

-- Find products that have never been ordered
SELECT product_name
FROM products
WHERE product_id NOT IN (
    SELECT DISTINCT product_id
    FROM orders
    WHERE product_id IS NOT NULL
);

Important: Always include WHERE product_id IS NOT NULL when using NOT IN with subqueries. If the subquery returns any NULL values, NOT IN will return no results at all—a common source of confusion.

Subqueries in the SELECT Clause

SELECT clause subqueries let you add calculated columns based on related data. This is useful for adding context or comparisons to your main results:

SELECT 
    product_name,
    price,
    (SELECT AVG(price) FROM products) AS avg_price,
    price - (SELECT AVG(price) FROM products) AS price_difference
FROM products
ORDER BY price_difference DESC;

This query shows each product's price alongside the overall average and the difference from that average.

Subqueries in the FROM Clause

FROM clause subqueries create temporary result sets that you can query like tables. This is powerful for multi-step analysis:

SELECT 
    category,
    AVG(price) AS category_avg_price
FROM (
    SELECT category, price
    FROM products
    WHERE price > 50
) AS expensive_products
GROUP BY category;

Here, the subquery first filters to expensive products, then the outer query calculates averages by category for just those expensive items.

Introduction to Common Table Expressions (CTEs)

While subqueries work well for simple cases, they can become hard to read and maintain as your logic gets more complex. Common Table Expressions (CTEs) solve this problem by letting you define temporary named result sets that exist for the duration of your query.

Think of a CTE as giving a name to a subquery and defining it at the top of your statement, making your overall query much more readable.

Here's the basic syntax:

WITH cte_name AS (
    -- Your subquery here
    SELECT column1, column2
    FROM some_table
    WHERE some_condition
)
SELECT *
FROM cte_name;

Let's rewrite our expensive products analysis using a CTE:

WITH expensive_products AS (
    SELECT category, price
    FROM products
    WHERE price > 50
)
SELECT 
    category,
    AVG(price) AS category_avg_price
FROM expensive_products
GROUP BY category;

The logic is identical, but notice how much clearer the intent becomes. You can immediately see that we're working with "expensive products" without having to parse through nested parentheses.

Practical CTE Examples

Let's work through some realistic scenarios where CTEs shine.

Monthly Sales Analysis

Suppose you want to analyze month-over-month sales growth. This requires multiple steps:

WITH monthly_sales AS (
    SELECT 
        YEAR(order_date) AS year,
        MONTH(order_date) AS month,
        SUM(total_amount) AS monthly_total
    FROM orders
    GROUP BY YEAR(order_date), MONTH(order_date)
),
sales_with_previous AS (
    SELECT 
        year,
        month,
        monthly_total,
        LAG(monthly_total) OVER (ORDER BY year, month) AS previous_month_total
    FROM monthly_sales
)
SELECT 
    year,
    month,
    monthly_total,
    previous_month_total,
    CASE 
        WHEN previous_month_total IS NOT NULL 
        THEN ((monthly_total - previous_month_total) / previous_month_total) * 100 
        ELSE NULL 
    END AS growth_rate_percent
FROM sales_with_previous
ORDER BY year, month;

This query uses two CTEs:

monthly_sales aggregates orders by month
sales_with_previous adds the previous month's total using a window function
The final SELECT calculates growth rates

Customer Segmentation

Here's a more complex example that segments customers based on their purchasing behavior:

WITH customer_stats AS (
    SELECT 
        customer_id,
        COUNT(*) AS total_orders,
        SUM(total_amount) AS total_spent,
        AVG(total_amount) AS avg_order_value,
        MAX(order_date) AS last_order_date,
        MIN(order_date) AS first_order_date
    FROM orders
    GROUP BY customer_id
),
customer_segments AS (
    SELECT 
        customer_id,
        total_orders,
        total_spent,
        avg_order_value,
        DATEDIFF(CURRENT_DATE, last_order_date) AS days_since_last_order,
        CASE 
            WHEN total_spent > 1000 AND total_orders > 10 THEN 'VIP'
            WHEN total_spent > 500 AND total_orders > 5 THEN 'High Value'
            WHEN days_since_last_order < 30 THEN 'Active'
            WHEN days_since_last_order < 90 THEN 'At Risk'
            ELSE 'Inactive'
        END AS segment
    FROM customer_stats
)
SELECT 
    segment,
    COUNT(*) AS customer_count,
    AVG(total_spent) AS avg_customer_value,
    AVG(total_orders) AS avg_orders_per_customer
FROM customer_segments
GROUP BY segment
ORDER BY customer_count DESC;

This analysis would be nearly impossible to read as nested subqueries, but with CTEs, each step is clear and logical.

Recursive CTEs: Advanced Pattern

Some database systems (like PostgreSQL, SQL Server, and newer versions of MySQL) support recursive CTEs, which can call themselves. This is useful for hierarchical data like organizational charts or category trees.

Here's a simple example with an employee hierarchy:

WITH RECURSIVE employee_hierarchy AS (
    -- Base case: top-level managers
    SELECT employee_id, employee_name, manager_id, 1 as level
    FROM employees
    WHERE manager_id IS NULL
    
    UNION ALL
    
    -- Recursive case: employees reporting to someone in the hierarchy
    SELECT e.employee_id, e.employee_name, e.manager_id, h.level + 1
    FROM employees e
    INNER JOIN employee_hierarchy h ON e.manager_id = h.employee_id
)
SELECT employee_name, level
FROM employee_hierarchy
ORDER BY level, employee_name;

Note: Recursive CTEs are an advanced feature not supported in all database systems. Check your database documentation before using them.

When to Use Subqueries vs CTEs

Choosing between subqueries and CTEs depends on several factors:

Use subqueries when:

You have a simple, one-step calculation
The subquery is used only once
You're working with a database that doesn't support CTEs (rare with modern systems)

Use CTEs when:

Your logic involves multiple steps
You need to reference the same subquery multiple times
You want to make your query more readable and maintainable
You're building complex analytical queries that others need to understand

Here's the same analysis written both ways to illustrate the difference:

With subqueries (harder to read):

SELECT p.product_name, p.price
FROM products p
WHERE p.price > (
    SELECT AVG(o.total_amount / o.quantity)
    FROM orders o
    WHERE o.product_id IN (
        SELECT product_id 
        FROM products 
        WHERE category = p.category
    )
);

With CTEs (much clearer):

WITH category_avg_prices AS (
    SELECT 
        p.category,
        AVG(o.total_amount / o.quantity) AS avg_selling_price
    FROM products p
    JOIN orders o ON p.product_id = o.product_id
    GROUP BY p.category
)
SELECT p.product_name, p.price
FROM products p
JOIN category_avg_prices cap ON p.category = cap.category
WHERE p.price > cap.avg_selling_price;

Hands-On Exercise

Let's practice with a realistic scenario. You're analyzing a subscription business with these tables:

-- Create sample data
CREATE TABLE subscriptions (
    subscription_id INT,
    customer_id INT,
    plan_type VARCHAR(50),
    monthly_fee DECIMAL(10,2),
    start_date DATE,
    end_date DATE
);

CREATE TABLE payments (
    payment_id INT,
    subscription_id INT,
    payment_date DATE,
    amount DECIMAL(10,2),
    status VARCHAR(20)
);

Challenge: Write a query using CTEs to find:

Customers who have paid more than the average customer lifetime value
Show their total payments, number of payments, and average payment amount
Only include customers with successful payments

Try writing this yourself first, then compare with this solution:

WITH successful_payments AS (
    SELECT 
        s.customer_id,
        p.amount,
        p.payment_date
    FROM subscriptions s
    JOIN payments p ON s.subscription_id = p.subscription_id
    WHERE p.status = 'successful'
),
customer_totals AS (
    SELECT 
        customer_id,
        SUM(amount) AS total_paid,
        COUNT(*) AS payment_count,
        AVG(amount) AS avg_payment_amount
    FROM successful_payments
    GROUP BY customer_id
),
overall_avg AS (
    SELECT AVG(total_paid) AS avg_customer_value
    FROM customer_totals
)
SELECT 
    ct.customer_id,
    ct.total_paid,
    ct.payment_count,
    ct.avg_payment_amount
FROM customer_totals ct
CROSS JOIN overall_avg oa
WHERE ct.total_paid > oa.avg_customer_value
ORDER BY ct.total_paid DESC;

Common Mistakes & Troubleshooting

Mistake 1: Forgetting About NULLs with NOT IN

-- This might return no results if any product_id in orders is NULL
SELECT product_name
FROM products
WHERE product_id NOT IN (SELECT product_id FROM orders);

-- Fix: Handle NULLs explicitly
SELECT product_name
FROM products
WHERE product_id NOT IN (
    SELECT product_id 
    FROM orders 
    WHERE product_id IS NOT NULL
);

Mistake 2: Using Correlated Subqueries When CTEs Would Be Better

-- Slow and hard to read
SELECT 
    product_name,
    (SELECT COUNT(*) FROM orders WHERE product_id = p.product_id) AS order_count
FROM products p
WHERE (SELECT COUNT(*) FROM orders WHERE product_id = p.product_id) > 5;

-- Better with CTE
WITH product_order_counts AS (
    SELECT 
        product_id,
        COUNT(*) AS order_count
    FROM orders
    GROUP BY product_id
)
SELECT p.product_name, poc.order_count
FROM products p
JOIN product_order_counts poc ON p.product_id = poc.product_id
WHERE poc.order_count > 5;

Mistake 3: Not Aliasing CTE Columns Properly

-- This will cause errors if column names conflict
WITH sales_data AS (
    SELECT customer_id, SUM(amount)  -- Missing alias!
    FROM orders
    GROUP BY customer_id
)
SELECT * FROM sales_data;  -- Which column is the sum?

-- Fix: Always alias calculated columns
WITH sales_data AS (
    SELECT 
        customer_id, 
        SUM(amount) AS total_amount
    FROM orders
    GROUP BY customer_id
)
SELECT * FROM sales_data;

Troubleshooting Performance

If your CTEs or subqueries are running slowly:

Check if your database optimizes CTEs: Some databases treat CTEs as materialized (stored temporarily), while others optimize them like views. This affects performance differently.
Consider indexes: Make sure columns used in WHERE clauses and JOINs are indexed.
Break down complex CTEs: If a CTE is doing too much work, split it into multiple steps.
Use EXPLAIN: Most databases have an EXPLAIN command that shows how your query will execute:
```
EXPLAIN WITH my_cte AS (...) SELECT ...
```

Summary & Next Steps

You've now mastered two of SQL's most powerful features for handling complex analytical questions. Subqueries let you embed queries within queries for straightforward multi-step logic, while CTEs provide a clean, readable way to break down complex problems into manageable pieces.

Key takeaways:

Use subqueries for simple, single-use calculations
Choose CTEs when you need multiple steps or want readable, maintainable code
Always handle NULLs properly when using NOT IN with subqueries
Alias all calculated columns in CTEs to avoid confusion
Consider performance implications and use EXPLAIN to understand query execution

What to explore next:

Window functions, which often complement CTEs beautifully for analytical queries
Advanced JOIN techniques for combining data from multiple sources
Query optimization techniques to make your complex queries run faster
Stored procedures and functions for reusing complex CTE-based logic

Practice these concepts with your own datasets. Start with simple subqueries, then graduate to multi-step CTEs as you encounter more complex analytical challenges. The combination of clear thinking about your data problems and these powerful SQL tools will make you significantly more effective at extracting insights from data.