Master SQL Subqueries and CTEs: Write Complex Queries That Actually Make Sense

Imagine you're analyzing sales data for a retail company. You need to find all customers who made purchases above the average order value, but only for the top-performing products. This requires you to first calculate averages, then filter based on those calculations, then join that back to customer data. You could try to cram all of this logic into one massive, unreadable query—or you could use subqueries and Common Table Expressions (CTEs) to break it down into manageable, logical pieces.

Subqueries and CTEs are your tools for writing SQL that thinks in layers. Instead of trying to solve complex problems in one giant leap, you can build solutions step by step, making your code more readable, maintainable, and less prone to errors. Think of them as your ability to create "mini-queries" within larger queries, each solving one piece of the puzzle.

By the end of this lesson, you'll understand how to leverage these powerful SQL features to write cleaner, more organized queries that solve real business problems.

What you'll learn:

How to write and use subqueries in SELECT, FROM, and WHERE clauses
When to choose subqueries vs JOINs for better performance
How to create and structure Common Table Expressions (CTEs)
How to chain multiple CTEs together for complex analysis
Best practices for readable, maintainable SQL code

Prerequisites

You should be comfortable with basic SQL operations including SELECT statements, WHERE clauses, and basic JOINs. We'll build on these fundamentals to create more sophisticated queries.

Understanding Subqueries: Queries Within Queries

A subquery is simply a query nested inside another query. Think of it like asking a question within a question: "Show me all products where the price is higher than [what's the average price of all products?]" The part in brackets is your subquery—it runs first, returns a result, and that result gets used by the outer query.

Let's start with a practical example using an e-commerce database. Suppose we have these tables:

-- Products table
CREATE TABLE products (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(100),
    category VARCHAR(50),
    price DECIMAL(10,2)
);

-- Orders table  
CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    product_id INT,
    quantity INT,
    order_date DATE,
    total_amount DECIMAL(10,2)
);

Subqueries in the WHERE Clause

The most common use of subqueries is in WHERE clauses for filtering. Let's find all products that are priced above the average:

SELECT product_name, price
FROM products
WHERE price > (
    SELECT AVG(price)
    FROM products
);

Here's what happens step by step:

The inner query SELECT AVG(price) FROM products runs first, calculating the average price
That average (let's say it's $45.67) gets substituted into the outer query
The outer query becomes WHERE price > 45.67 and returns matching products

This is much cleaner than trying to calculate the average in your application code or running multiple separate queries.

Subqueries with IN and EXISTS

You can also use subqueries with IN to check against multiple values. Let's find all customers who have ordered products from the 'Electronics' category:

SELECT DISTINCT customer_id
FROM orders
WHERE product_id IN (
    SELECT product_id
    FROM products
    WHERE category = 'Electronics'
);

The EXISTS operator is even more powerful—it checks whether a subquery returns any rows at all:

SELECT customer_id
FROM orders o1
WHERE EXISTS (
    SELECT 1
    FROM orders o2
    WHERE o2.customer_id = o1.customer_id
    AND o2.order_date > '2024-01-01'
);

This query finds customers who have made at least one order since January 1st, 2024. Notice how the subquery references the outer query's table (o1) - this is called a correlated subquery.

Tip: EXISTS is often more efficient than IN when dealing with large datasets, especially when the subquery might return many rows.

Subqueries in the SELECT Clause

You can use subqueries in your SELECT clause to add calculated columns:

SELECT 
    product_name,
    price,
    (SELECT AVG(price) FROM products) as avg_price,
    price - (SELECT AVG(price) FROM products) as price_difference
FROM products
ORDER BY price_difference DESC;

This shows each product with its price, the overall average price, and how much above or below average each product is priced.

Correlated vs Non-Correlated Subqueries

Understanding the difference between these two types is crucial for writing efficient SQL.

Non-correlated subqueries are independent—they can run completely on their own and always return the same result:

-- This subquery doesn't depend on the outer query
SELECT product_name, price
FROM products
WHERE price > (SELECT AVG(price) FROM products);

Correlated subqueries reference columns from the outer query and must be evaluated for each row:

-- This subquery uses o1.customer_id from the outer query
SELECT customer_id, order_date, total_amount
FROM orders o1
WHERE total_amount > (
    SELECT AVG(total_amount)
    FROM orders o2
    WHERE o2.customer_id = o1.customer_id
);

This finds orders where each customer spent more than their own personal average. The subquery runs once for each customer, using their specific customer_id.

Performance Warning: Correlated subqueries can be slow on large datasets because they execute once per row in the outer query. Sometimes a JOIN or CTE is more efficient.

Introduction to Common Table Expressions (CTEs)

Common Table Expressions are like creating temporary named result sets that exist only for the duration of your query. Think of them as giving names to subqueries so you can reference them multiple times and make your SQL more readable.

The basic syntax uses the WITH clause:

WITH cte_name AS (
    -- Your query here
    SELECT column1, column2
    FROM some_table
)
SELECT *
FROM cte_name;

Let's rewrite our earlier average price example using a CTE:

WITH avg_pricing AS (
    SELECT AVG(price) as average_price
    FROM products
)
SELECT 
    p.product_name,
    p.price,
    ap.average_price,
    p.price - ap.average_price as price_difference
FROM products p
CROSS JOIN avg_pricing ap
ORDER BY price_difference DESC;

This is much more readable than repeating the subquery multiple times, and it's more efficient too—the average is calculated only once.

Building Complex Analysis with Multiple CTEs

The real power of CTEs becomes apparent when you chain them together. Let's build a complex analysis step by step to find our top customers by order value, but only for high-value products.

WITH high_value_products AS (
    -- Step 1: Identify products in the top 25% by price
    SELECT product_id, product_name, price
    FROM products
    WHERE price >= (
        SELECT PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY price)
        FROM products
    )
),
customer_high_value_orders AS (
    -- Step 2: Get orders for only these high-value products
    SELECT 
        o.customer_id,
        o.order_id,
        o.total_amount,
        hvp.product_name
    FROM orders o
    JOIN high_value_products hvp ON o.product_id = hvp.product_id
),
customer_totals AS (
    -- Step 3: Calculate total spending per customer on high-value items
    SELECT 
        customer_id,
        COUNT(*) as high_value_orders,
        SUM(total_amount) as total_high_value_spending,
        AVG(total_amount) as avg_order_value
    FROM customer_high_value_orders
    GROUP BY customer_id
)
-- Step 4: Final results with ranking
SELECT 
    customer_id,
    high_value_orders,
    total_high_value_spending,
    avg_order_value,
    RANK() OVER (ORDER BY total_high_value_spending DESC) as spending_rank
FROM customer_totals
WHERE high_value_orders >= 3  -- At least 3 high-value orders
ORDER BY total_high_value_spending DESC;

Each CTE builds on the previous one, creating a clear logical flow:

Define what "high-value" products are
Find orders for those products
Aggregate by customer
Rank and filter the results

This approach is far more readable and maintainable than trying to write this as nested subqueries or one massive query.

Recursive CTEs: When You Need to Go Deeper

Recursive CTEs are a special type that can reference themselves, useful for hierarchical data like organizational charts or category trees. Here's the syntax:

WITH RECURSIVE hierarchy_name AS (
    -- Base case: starting point
    SELECT employee_id, manager_id, employee_name, 1 as level
    FROM employees
    WHERE manager_id IS NULL  -- Top-level managers
    
    UNION ALL
    
    -- Recursive case: find next level
    SELECT e.employee_id, e.manager_id, e.employee_name, h.level + 1
    FROM employees e
    JOIN hierarchy_name h ON e.manager_id = h.employee_id
)
SELECT * FROM hierarchy_name
ORDER BY level, employee_name;

This builds a complete organizational hierarchy, showing each employee's level in the company structure.

When to Use Subqueries vs CTEs vs JOINs

Choosing the right approach depends on your specific situation:

Use subqueries when:

You need a single value (like an average) for comparison
The logic is simple and won't be reused
You're filtering with EXISTS or IN

Use CTEs when:

You need to reference the same calculation multiple times
You're building complex logic in steps
You want to improve code readability
You're working with hierarchical data (recursive CTEs)

Use JOINs when:

You need to combine data from multiple tables
Performance is critical (JOINs are often faster than correlated subqueries)
You're doing straightforward table relationships

Let's see a performance comparison. These queries do the same thing, but one is much faster:

-- Slower: Correlated subquery
SELECT customer_id, order_date, total_amount
FROM orders o1
WHERE total_amount > (
    SELECT AVG(total_amount)
    FROM orders o2
    WHERE o2.customer_id = o1.customer_id
);

-- Faster: CTE with window function
WITH customer_averages AS (
    SELECT 
        customer_id,
        order_date,
        total_amount,
        AVG(total_amount) OVER (PARTITION BY customer_id) as customer_avg
    FROM orders
)
SELECT customer_id, order_date, total_amount
FROM customer_averages
WHERE total_amount > customer_avg;

The CTE version calculates each customer's average once using a window function, while the correlated subquery recalculates it for every row.

Hands-On Exercise

Let's put your new skills to work with a realistic scenario. You're analyzing an online bookstore database with these tables:

-- Books table
CREATE TABLE books (
    book_id INT PRIMARY KEY,
    title VARCHAR(200),
    author VARCHAR(100),
    genre VARCHAR(50),
    price DECIMAL(8,2),
    publication_year INT
);

-- Sales table
CREATE TABLE sales (
    sale_id INT PRIMARY KEY,
    book_id INT,
    customer_id INT,
    sale_date DATE,
    quantity INT,
    total_amount DECIMAL(10,2)
);

-- Sample data
INSERT INTO books VALUES
(1, 'The Data Revolution', 'Sarah Chen', 'Technology', 29.99, 2023),
(2, 'SQL Mastery', 'Mike Rodriguez', 'Technology', 34.99, 2022),
(3, 'Mystery in the Database', 'Jane Smith', 'Mystery', 19.99, 2023),
(4, 'Python for Everyone', 'Alex Johnson', 'Technology', 39.99, 2021),
(5, 'The Analytics Mind', 'Dr. Lisa Park', 'Business', 27.99, 2023);

INSERT INTO sales VALUES
(1, 1, 101, '2024-01-15', 2, 59.98),
(2, 2, 102, '2024-01-16', 1, 34.99),
(3, 1, 103, '2024-01-17', 1, 29.99),
(4, 3, 101, '2024-01-18', 3, 59.97),
(5, 4, 104, '2024-01-19', 1, 39.99);

Your Challenge: Write a query that finds:

All books that have sold above the average quantity per sale
Include the total revenue for each of these books
Show how each book's average sale price compares to the overall average
Only include books published after 2021

Solution Approach:

Try to solve this yourself first, then compare with this solution:

WITH sales_metrics AS (
    -- Calculate overall averages
    SELECT 
        AVG(quantity) as overall_avg_quantity,
        AVG(total_amount) as overall_avg_sale_amount
    FROM sales
),
book_performance AS (
    -- Get performance metrics per book
    SELECT 
        b.book_id,
        b.title,
        b.author,
        b.genre,
        b.price,
        b.publication_year,
        SUM(s.quantity) as total_quantity_sold,
        COUNT(s.sale_id) as number_of_sales,
        SUM(s.total_amount) as total_revenue,
        AVG(s.quantity) as avg_quantity_per_sale,
        AVG(s.total_amount) as avg_sale_amount
    FROM books b
    JOIN sales s ON b.book_id = s.book_id
    WHERE b.publication_year > 2021
    GROUP BY b.book_id, b.title, b.author, b.genre, b.price, b.publication_year
)
SELECT 
    bp.title,
    bp.author,
    bp.genre,
    bp.total_quantity_sold,
    bp.total_revenue,
    bp.avg_quantity_per_sale,
    sm.overall_avg_quantity,
    bp.avg_sale_amount,
    sm.overall_avg_sale_amount,
    bp.avg_sale_amount - sm.overall_avg_sale_amount as price_difference
FROM book_performance bp
CROSS JOIN sales_metrics sm
WHERE bp.avg_quantity_per_sale > sm.overall_avg_quantity
ORDER BY bp.total_revenue DESC;

This solution demonstrates several key concepts:

Multiple CTEs working together
Aggregation and grouping
Cross joining to make overall metrics available
Clear, readable structure that's easy to modify

Common Mistakes & Troubleshooting

Mistake 1: Using subqueries when CTEs would be clearer

Bad:

SELECT *
FROM (
    SELECT customer_id, AVG(total_amount) as avg_amount
    FROM orders
    GROUP BY customer_id
) customer_avg
WHERE avg_amount > (
    SELECT AVG(total_amount)
    FROM orders
);

Better:

WITH customer_averages AS (
    SELECT customer_id, AVG(total_amount) as avg_amount
    FROM orders
    GROUP BY customer_id
),
overall_average AS (
    SELECT AVG(total_amount) as overall_avg
    FROM orders
)
SELECT ca.customer_id, ca.avg_amount
FROM customer_averages ca
CROSS JOIN overall_average oa
WHERE ca.avg_amount > oa.overall_avg;

Mistake 2: Forgetting that subqueries must return single values in comparisons

This will cause an error if the subquery returns multiple rows:

SELECT product_name
FROM products
WHERE price = (
    SELECT price
    FROM products
    WHERE category = 'Electronics'  -- Multiple products!
);

Use IN instead:

SELECT product_name
FROM products
WHERE price IN (
    SELECT price
    FROM products
    WHERE category = 'Electronics'
);

Mistake 3: Performance issues with correlated subqueries

If your query is running slowly, check if you're using correlated subqueries that could be rewritten as JOINs or CTEs with window functions.

Debugging Tip: Test your CTEs independently by running just the CTE portion with a simple SELECT * to verify the data before building your final query.

Mistake 4: Not considering NULL values

Subqueries with IN don't behave as expected when NULL values are involved:

-- This might not return what you expect if any price is NULL
SELECT product_name
FROM products
WHERE price NOT IN (
    SELECT price
    FROM discontinued_products
);

Use EXISTS for safer NULL handling:

SELECT product_name
FROM products p
WHERE NOT EXISTS (
    SELECT 1
    FROM discontinued_products dp
    WHERE dp.price = p.price
);

Summary & Next Steps

You've now learned how to use subqueries and CTEs to break complex SQL problems into manageable pieces. Subqueries let you embed queries within queries for filtering and calculations, while CTEs provide a clean way to build multi-step analyses with readable, reusable components.

Key takeaways:

Use subqueries for simple filtering and single-value calculations
Choose CTEs when you need to reference the same logic multiple times or build complex step-by-step analysis
Consider performance implications—correlated subqueries can be slow on large datasets
Always test your components independently before combining them
Recursive CTEs open up possibilities for hierarchical data analysis

Your next steps:

Practice converting complex single queries into multi-CTE solutions for better readability
Learn about window functions, which often pair well with CTEs for advanced analytics
Explore query optimization techniques to make your subqueries and CTEs run faster
Study advanced CTE patterns like running totals, data pivoting, and time-series analysis

The ability to think in layers—solving one piece at a time—will transform how you approach complex data problems. You're no longer limited to what you can express in a single query; you can now build sophisticated analyses that are both powerful and maintainable.