Imagine you're analyzing sales data for a retail company. You need to find all customers who made purchases above the average order value, but only for the top-performing products. This requires you to first calculate averages, then filter based on those calculations, then join that back to customer data. You could try to cram all of this logic into one massive, unreadable query—or you could use subqueries and Common Table Expressions (CTEs) to break it down into manageable, logical pieces.
Subqueries and CTEs are your tools for writing SQL that thinks in layers. Instead of trying to solve complex problems in one giant leap, you can build solutions step by step, making your code more readable, maintainable, and less prone to errors. Think of them as your ability to create "mini-queries" within larger queries, each solving one piece of the puzzle.
By the end of this lesson, you'll understand how to leverage these powerful SQL features to write cleaner, more organized queries that solve real business problems.
What you'll learn:
You should be comfortable with basic SQL operations including SELECT statements, WHERE clauses, and basic JOINs. We'll build on these fundamentals to create more sophisticated queries.
A subquery is simply a query nested inside another query. Think of it like asking a question within a question: "Show me all products where the price is higher than [what's the average price of all products?]" The part in brackets is your subquery—it runs first, returns a result, and that result gets used by the outer query.
Let's start with a practical example using an e-commerce database. Suppose we have these tables:
-- Products table
CREATE TABLE products (
product_id INT PRIMARY KEY,
product_name VARCHAR(100),
category VARCHAR(50),
price DECIMAL(10,2)
);
-- Orders table
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
product_id INT,
quantity INT,
order_date DATE,
total_amount DECIMAL(10,2)
);
The most common use of subqueries is in WHERE clauses for filtering. Let's find all products that are priced above the average:
SELECT product_name, price
FROM products
WHERE price > (
SELECT AVG(price)
FROM products
);
Here's what happens step by step:
SELECT AVG(price) FROM products runs first, calculating the average priceWHERE price > 45.67 and returns matching productsThis is much cleaner than trying to calculate the average in your application code or running multiple separate queries.
You can also use subqueries with IN to check against multiple values. Let's find all customers who have ordered products from the 'Electronics' category:
SELECT DISTINCT customer_id
FROM orders
WHERE product_id IN (
SELECT product_id
FROM products
WHERE category = 'Electronics'
);
The EXISTS operator is even more powerful—it checks whether a subquery returns any rows at all:
SELECT customer_id
FROM orders o1
WHERE EXISTS (
SELECT 1
FROM orders o2
WHERE o2.customer_id = o1.customer_id
AND o2.order_date > '2024-01-01'
);
This query finds customers who have made at least one order since January 1st, 2024. Notice how the subquery references the outer query's table (o1) - this is called a correlated subquery.
Tip: EXISTS is often more efficient than IN when dealing with large datasets, especially when the subquery might return many rows.
You can use subqueries in your SELECT clause to add calculated columns:
SELECT
product_name,
price,
(SELECT AVG(price) FROM products) as avg_price,
price - (SELECT AVG(price) FROM products) as price_difference
FROM products
ORDER BY price_difference DESC;
This shows each product with its price, the overall average price, and how much above or below average each product is priced.
Understanding the difference between these two types is crucial for writing efficient SQL.
Non-correlated subqueries are independent—they can run completely on their own and always return the same result:
-- This subquery doesn't depend on the outer query
SELECT product_name, price
FROM products
WHERE price > (SELECT AVG(price) FROM products);
Correlated subqueries reference columns from the outer query and must be evaluated for each row:
-- This subquery uses o1.customer_id from the outer query
SELECT customer_id, order_date, total_amount
FROM orders o1
WHERE total_amount > (
SELECT AVG(total_amount)
FROM orders o2
WHERE o2.customer_id = o1.customer_id
);
This finds orders where each customer spent more than their own personal average. The subquery runs once for each customer, using their specific customer_id.
Performance Warning: Correlated subqueries can be slow on large datasets because they execute once per row in the outer query. Sometimes a JOIN or CTE is more efficient.
Common Table Expressions are like creating temporary named result sets that exist only for the duration of your query. Think of them as giving names to subqueries so you can reference them multiple times and make your SQL more readable.
The basic syntax uses the WITH clause:
WITH cte_name AS (
-- Your query here
SELECT column1, column2
FROM some_table
)
SELECT *
FROM cte_name;
Let's rewrite our earlier average price example using a CTE:
WITH avg_pricing AS (
SELECT AVG(price) as average_price
FROM products
)
SELECT
p.product_name,
p.price,
ap.average_price,
p.price - ap.average_price as price_difference
FROM products p
CROSS JOIN avg_pricing ap
ORDER BY price_difference DESC;
This is much more readable than repeating the subquery multiple times, and it's more efficient too—the average is calculated only once.
The real power of CTEs becomes apparent when you chain them together. Let's build a complex analysis step by step to find our top customers by order value, but only for high-value products.
WITH high_value_products AS (
-- Step 1: Identify products in the top 25% by price
SELECT product_id, product_name, price
FROM products
WHERE price >= (
SELECT PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY price)
FROM products
)
),
customer_high_value_orders AS (
-- Step 2: Get orders for only these high-value products
SELECT
o.customer_id,
o.order_id,
o.total_amount,
hvp.product_name
FROM orders o
JOIN high_value_products hvp ON o.product_id = hvp.product_id
),
customer_totals AS (
-- Step 3: Calculate total spending per customer on high-value items
SELECT
customer_id,
COUNT(*) as high_value_orders,
SUM(total_amount) as total_high_value_spending,
AVG(total_amount) as avg_order_value
FROM customer_high_value_orders
GROUP BY customer_id
)
-- Step 4: Final results with ranking
SELECT
customer_id,
high_value_orders,
total_high_value_spending,
avg_order_value,
RANK() OVER (ORDER BY total_high_value_spending DESC) as spending_rank
FROM customer_totals
WHERE high_value_orders >= 3 -- At least 3 high-value orders
ORDER BY total_high_value_spending DESC;
Each CTE builds on the previous one, creating a clear logical flow:
This approach is far more readable and maintainable than trying to write this as nested subqueries or one massive query.
Recursive CTEs are a special type that can reference themselves, useful for hierarchical data like organizational charts or category trees. Here's the syntax:
WITH RECURSIVE hierarchy_name AS (
-- Base case: starting point
SELECT employee_id, manager_id, employee_name, 1 as level
FROM employees
WHERE manager_id IS NULL -- Top-level managers
UNION ALL
-- Recursive case: find next level
SELECT e.employee_id, e.manager_id, e.employee_name, h.level + 1
FROM employees e
JOIN hierarchy_name h ON e.manager_id = h.employee_id
)
SELECT * FROM hierarchy_name
ORDER BY level, employee_name;
This builds a complete organizational hierarchy, showing each employee's level in the company structure.
Choosing the right approach depends on your specific situation:
Use subqueries when:
Use CTEs when:
Use JOINs when:
Let's see a performance comparison. These queries do the same thing, but one is much faster:
-- Slower: Correlated subquery
SELECT customer_id, order_date, total_amount
FROM orders o1
WHERE total_amount > (
SELECT AVG(total_amount)
FROM orders o2
WHERE o2.customer_id = o1.customer_id
);
-- Faster: CTE with window function
WITH customer_averages AS (
SELECT
customer_id,
order_date,
total_amount,
AVG(total_amount) OVER (PARTITION BY customer_id) as customer_avg
FROM orders
)
SELECT customer_id, order_date, total_amount
FROM customer_averages
WHERE total_amount > customer_avg;
The CTE version calculates each customer's average once using a window function, while the correlated subquery recalculates it for every row.
Let's put your new skills to work with a realistic scenario. You're analyzing an online bookstore database with these tables:
-- Books table
CREATE TABLE books (
book_id INT PRIMARY KEY,
title VARCHAR(200),
author VARCHAR(100),
genre VARCHAR(50),
price DECIMAL(8,2),
publication_year INT
);
-- Sales table
CREATE TABLE sales (
sale_id INT PRIMARY KEY,
book_id INT,
customer_id INT,
sale_date DATE,
quantity INT,
total_amount DECIMAL(10,2)
);
-- Sample data
INSERT INTO books VALUES
(1, 'The Data Revolution', 'Sarah Chen', 'Technology', 29.99, 2023),
(2, 'SQL Mastery', 'Mike Rodriguez', 'Technology', 34.99, 2022),
(3, 'Mystery in the Database', 'Jane Smith', 'Mystery', 19.99, 2023),
(4, 'Python for Everyone', 'Alex Johnson', 'Technology', 39.99, 2021),
(5, 'The Analytics Mind', 'Dr. Lisa Park', 'Business', 27.99, 2023);
INSERT INTO sales VALUES
(1, 1, 101, '2024-01-15', 2, 59.98),
(2, 2, 102, '2024-01-16', 1, 34.99),
(3, 1, 103, '2024-01-17', 1, 29.99),
(4, 3, 101, '2024-01-18', 3, 59.97),
(5, 4, 104, '2024-01-19', 1, 39.99);
Your Challenge: Write a query that finds:
Solution Approach:
Try to solve this yourself first, then compare with this solution:
WITH sales_metrics AS (
-- Calculate overall averages
SELECT
AVG(quantity) as overall_avg_quantity,
AVG(total_amount) as overall_avg_sale_amount
FROM sales
),
book_performance AS (
-- Get performance metrics per book
SELECT
b.book_id,
b.title,
b.author,
b.genre,
b.price,
b.publication_year,
SUM(s.quantity) as total_quantity_sold,
COUNT(s.sale_id) as number_of_sales,
SUM(s.total_amount) as total_revenue,
AVG(s.quantity) as avg_quantity_per_sale,
AVG(s.total_amount) as avg_sale_amount
FROM books b
JOIN sales s ON b.book_id = s.book_id
WHERE b.publication_year > 2021
GROUP BY b.book_id, b.title, b.author, b.genre, b.price, b.publication_year
)
SELECT
bp.title,
bp.author,
bp.genre,
bp.total_quantity_sold,
bp.total_revenue,
bp.avg_quantity_per_sale,
sm.overall_avg_quantity,
bp.avg_sale_amount,
sm.overall_avg_sale_amount,
bp.avg_sale_amount - sm.overall_avg_sale_amount as price_difference
FROM book_performance bp
CROSS JOIN sales_metrics sm
WHERE bp.avg_quantity_per_sale > sm.overall_avg_quantity
ORDER BY bp.total_revenue DESC;
This solution demonstrates several key concepts:
Mistake 1: Using subqueries when CTEs would be clearer
Bad:
SELECT *
FROM (
SELECT customer_id, AVG(total_amount) as avg_amount
FROM orders
GROUP BY customer_id
) customer_avg
WHERE avg_amount > (
SELECT AVG(total_amount)
FROM orders
);
Better:
WITH customer_averages AS (
SELECT customer_id, AVG(total_amount) as avg_amount
FROM orders
GROUP BY customer_id
),
overall_average AS (
SELECT AVG(total_amount) as overall_avg
FROM orders
)
SELECT ca.customer_id, ca.avg_amount
FROM customer_averages ca
CROSS JOIN overall_average oa
WHERE ca.avg_amount > oa.overall_avg;
Mistake 2: Forgetting that subqueries must return single values in comparisons
This will cause an error if the subquery returns multiple rows:
SELECT product_name
FROM products
WHERE price = (
SELECT price
FROM products
WHERE category = 'Electronics' -- Multiple products!
);
Use IN instead:
SELECT product_name
FROM products
WHERE price IN (
SELECT price
FROM products
WHERE category = 'Electronics'
);
Mistake 3: Performance issues with correlated subqueries
If your query is running slowly, check if you're using correlated subqueries that could be rewritten as JOINs or CTEs with window functions.
Debugging Tip: Test your CTEs independently by running just the CTE portion with a simple SELECT * to verify the data before building your final query.
Mistake 4: Not considering NULL values
Subqueries with IN don't behave as expected when NULL values are involved:
-- This might not return what you expect if any price is NULL
SELECT product_name
FROM products
WHERE price NOT IN (
SELECT price
FROM discontinued_products
);
Use EXISTS for safer NULL handling:
SELECT product_name
FROM products p
WHERE NOT EXISTS (
SELECT 1
FROM discontinued_products dp
WHERE dp.price = p.price
);
You've now learned how to use subqueries and CTEs to break complex SQL problems into manageable pieces. Subqueries let you embed queries within queries for filtering and calculations, while CTEs provide a clean way to build multi-step analyses with readable, reusable components.
Key takeaways:
Your next steps:
The ability to think in layers—solving one piece at a time—will transform how you approach complex data problems. You're no longer limited to what you can express in a single query; you can now build sophisticated analyses that are both powerful and maintainable.
Learning Path: SQL Fundamentals