
You're staring at a complex data problem that makes your brain hurt. Your sales manager wants a report showing customers who spent more than the average order amount, but only for customers in the top-performing regions. Your first instinct might be to create multiple separate queries, export results to Excel, and manually piece everything together. But there's a much more elegant way.
Subqueries and Common Table Expressions (CTEs) are SQL's power tools for breaking down complex problems into manageable pieces—all within a single query. Instead of juggling multiple result sets, you'll learn to build sophisticated queries that read like well-organized thoughts, making your analysis both more powerful and easier to understand.
By mastering these techniques, you'll transform how you approach data analysis. Complex multi-step problems that once required multiple queries or spreadsheet gymnastics become clean, readable SQL that you can write, debug, and maintain with confidence.
What you'll learn:
You should be comfortable with basic SQL operations including SELECT statements, WHERE clauses, JOINs, and aggregate functions like COUNT, SUM, and AVG. If you can write a query that joins two tables and groups results, you're ready for this lesson.
A subquery is simply a query inside another query. Think of it like a function in programming—it performs a specific calculation or retrieval that feeds into your main query. The key insight is that any place you can put a value in SQL, you can instead put a query that returns that value.
Let's start with a concrete scenario. Imagine you manage an e-commerce platform with three tables:
-- Customers table
CREATE TABLE customers (
customer_id INT,
customer_name VARCHAR(100),
region VARCHAR(50),
signup_date DATE
);
-- Orders table
CREATE TABLE orders (
order_id INT,
customer_id INT,
order_date DATE,
total_amount DECIMAL(10,2)
);
-- Order_items table
CREATE TABLE order_items (
order_id INT,
product_id INT,
quantity INT,
unit_price DECIMAL(10,2)
);
Your first business question: "Show me all customers who have placed orders worth more than $500." Without subqueries, you might think you need two separate steps. But watch how a subquery handles this elegantly:
SELECT customer_name, region
FROM customers
WHERE customer_id IN (
SELECT customer_id
FROM orders
WHERE total_amount > 500
);
The subquery (SELECT customer_id FROM orders WHERE total_amount > 500) runs first, returning a list of customer IDs. Then the main query uses that list to filter customers. The IN operator checks if each customer's ID appears anywhere in that subquery result.
Subqueries aren't limited to WHERE clauses. They can appear in three main contexts, each serving different purposes:
A scalar subquery returns exactly one value, which you can use like any other column:
SELECT
customer_name,
region,
(SELECT COUNT(*)
FROM orders
WHERE orders.customer_id = customers.customer_id) AS total_orders,
(SELECT AVG(total_amount)
FROM orders
WHERE orders.customer_id = customers.customer_id) AS avg_order_value
FROM customers;
This query shows each customer alongside their order statistics. Notice how the subqueries reference customers.customer_id from the outer query—these are called correlated subqueries because they depend on the current row being processed.
You can use a subquery as if it were a table:
SELECT
avg_order_by_region.region,
avg_order_by_region.avg_amount,
COUNT(*) as customer_count
FROM customers
JOIN (
SELECT
c.region,
AVG(o.total_amount) as avg_amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.region
) AS avg_order_by_region ON customers.region = avg_order_by_region.region
GROUP BY avg_order_by_region.region, avg_order_by_region.avg_amount;
The subquery creates a temporary result set showing average order amounts by region, which we then join back to the customers table.
This is the most common use case. Beyond the IN operator we saw earlier, you can use subqueries with comparison operators:
-- Customers with above-average order totals
SELECT customer_name
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.total_amount > (
SELECT AVG(total_amount)
FROM orders
);
Understanding the difference between these two types is crucial for writing efficient queries.
Non-correlated subqueries run once and return a result that the main query uses. They're independent of the outer query:
-- Find customers in regions with above-average sales
SELECT customer_name, region
FROM customers
WHERE region IN (
SELECT region
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY region
HAVING AVG(o.total_amount) > 1000
);
The subquery calculates which regions have high average sales, then the main query filters customers based on that list.
Correlated subqueries reference columns from the outer query and execute once for each row:
-- Find customers whose most recent order was above their personal average
SELECT customer_name
FROM customers c
WHERE (
SELECT MAX(total_amount)
FROM orders o1
WHERE o1.customer_id = c.customer_id
) > (
SELECT AVG(total_amount)
FROM orders o2
WHERE o2.customer_id = c.customer_id
);
For each customer, this query runs both subqueries to compare their highest order amount against their personal average.
Performance Tip: Non-correlated subqueries are generally faster because they run only once. Correlated subqueries can be expensive on large datasets since they execute for every row.
As queries become more complex, nested subqueries can become difficult to read and debug. Common Table Expressions offer a cleaner approach—think of them as named, temporary result sets that exist only for the duration of your query.
Here's the basic syntax:
WITH cte_name AS (
SELECT column1, column2
FROM some_table
WHERE some_condition
)
SELECT *
FROM cte_name
WHERE another_condition;
Let's refactor our earlier complex query using a CTE:
WITH regional_averages AS (
SELECT
c.region,
AVG(o.total_amount) as avg_amount,
COUNT(o.order_id) as total_orders
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.region
),
high_performing_regions AS (
SELECT region
FROM regional_averages
WHERE avg_amount > 1000 AND total_orders > 50
)
SELECT
c.customer_name,
c.region,
COUNT(o.order_id) as customer_orders
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.region IN (SELECT region FROM high_performing_regions)
GROUP BY c.customer_name, c.region
ORDER BY customer_orders DESC;
This query breaks down a complex problem into logical steps:
You can chain multiple CTEs together, with each one building on the previous results. This approach makes complex business logic much more readable:
WITH monthly_sales AS (
SELECT
DATE_TRUNC('month', order_date) as month,
SUM(total_amount) as monthly_revenue
FROM orders
WHERE order_date >= '2023-01-01'
GROUP BY DATE_TRUNC('month', order_date)
),
sales_with_growth AS (
SELECT
month,
monthly_revenue,
LAG(monthly_revenue) OVER (ORDER BY month) as prev_month_revenue,
monthly_revenue - LAG(monthly_revenue) OVER (ORDER BY month) as growth_amount
FROM monthly_sales
),
growth_categories AS (
SELECT
month,
monthly_revenue,
growth_amount,
CASE
WHEN growth_amount > 10000 THEN 'High Growth'
WHEN growth_amount > 0 THEN 'Moderate Growth'
WHEN growth_amount < -5000 THEN 'Significant Decline'
ELSE 'Stable'
END as growth_category
FROM sales_with_growth
WHERE prev_month_revenue IS NOT NULL
)
SELECT
growth_category,
COUNT(*) as months_count,
AVG(monthly_revenue) as avg_monthly_revenue,
AVG(growth_amount) as avg_growth_amount
FROM growth_categories
GROUP BY growth_category
ORDER BY avg_growth_amount DESC;
Each CTE has a single, clear purpose:
monthly_sales: Aggregate orders by monthsales_with_growth: Calculate month-over-month changesgrowth_categories: Classify growth patternsSome business problems involve hierarchical data—organization charts, category trees, or geographic regions. Recursive CTEs handle these elegantly:
-- Assume we have an employee hierarchy table
WITH RECURSIVE employee_hierarchy AS (
-- Base case: find the CEO (employee with no manager)
SELECT
employee_id,
employee_name,
manager_id,
1 as level,
CAST(employee_name AS VARCHAR(1000)) as hierarchy_path
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive case: find direct reports
SELECT
e.employee_id,
e.employee_name,
e.manager_id,
eh.level + 1,
CONCAT(eh.hierarchy_path, ' -> ', e.employee_name)
FROM employees e
JOIN employee_hierarchy eh ON e.manager_id = eh.employee_id
)
SELECT
employee_name,
level,
hierarchy_path
FROM employee_hierarchy
ORDER BY level, employee_name;
The recursive CTE starts with employees who have no manager (the base case), then repeatedly finds their direct reports until no more employees remain.
Both tools solve similar problems, but they have different strengths:
Use subqueries when:
Use CTEs when:
Consider this comparison for finding customers with above-average spending:
-- Subquery approach
SELECT customer_name
FROM customers c
WHERE (
SELECT AVG(total_amount)
FROM orders o
WHERE o.customer_id = c.customer_id
) > (
SELECT AVG(total_amount)
FROM orders
);
-- CTE approach
WITH customer_averages AS (
SELECT
c.customer_id,
c.customer_name,
AVG(o.total_amount) as customer_avg
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name
),
overall_average AS (
SELECT AVG(total_amount) as overall_avg
FROM orders
)
SELECT customer_name
FROM customer_averages ca
CROSS JOIN overall_average oa
WHERE ca.customer_avg > oa.overall_avg;
The CTE version is longer but much clearer about what's being calculated.
Let's work through a realistic business scenario. You're analyzing a subscription service with these tables:
-- Subscriptions table
CREATE TABLE subscriptions (
subscription_id INT,
customer_id INT,
plan_type VARCHAR(20),
start_date DATE,
end_date DATE,
monthly_fee DECIMAL(8,2)
);
-- Payments table
CREATE TABLE payments (
payment_id INT,
subscription_id INT,
payment_date DATE,
amount DECIMAL(8,2),
payment_status VARCHAR(20)
);
Challenge: Create a query that identifies customers who meet all these criteria:
Try building this step by step with CTEs:
WITH active_premium_customers AS (
SELECT
customer_id,
subscription_id,
monthly_fee
FROM subscriptions
WHERE plan_type = 'Premium'
AND end_date IS NULL
),
recent_successful_payments AS (
SELECT
p.subscription_id,
p.customer_id,
COUNT(*) as payment_count,
AVG(p.amount) as avg_payment_amount
FROM payments p
JOIN active_premium_customers apc ON p.subscription_id = apc.subscription_id
WHERE p.payment_status = 'Successful'
AND p.payment_date >= CURRENT_DATE - INTERVAL '6 months'
GROUP BY p.subscription_id, p.customer_id
HAVING COUNT(*) >= 3
),
premium_payment_average AS (
SELECT AVG(p.amount) as overall_premium_avg
FROM payments p
JOIN subscriptions s ON p.subscription_id = s.subscription_id
WHERE s.plan_type = 'Premium'
AND p.payment_status = 'Successful'
)
SELECT
rsp.customer_id,
rsp.payment_count,
rsp.avg_payment_amount,
ppa.overall_premium_avg
FROM recent_successful_payments rsp
CROSS JOIN premium_payment_average ppa
WHERE rsp.avg_payment_amount > ppa.overall_premium_avg
ORDER BY rsp.avg_payment_amount DESC;
Work through each CTE to understand how it builds toward the final result. Notice how breaking the problem into logical steps makes both the writing and debugging process much more manageable.
Mistake 1: Subquery returns multiple values when one is expected
-- This will fail if multiple customers have orders over $1000
SELECT customer_name
FROM customers
WHERE customer_id = (
SELECT customer_id
FROM orders
WHERE total_amount > 1000
);
Fix: Use IN instead of =, or add LIMIT 1 if you truly want just one result.
Mistake 2: Correlated subqueries causing performance issues
-- This runs the subquery for every customer row
SELECT customer_name,
(SELECT COUNT(*) FROM orders WHERE customer_id = c.customer_id)
FROM customers c;
Fix: Convert to a JOIN or CTE when possible:
WITH order_counts AS (
SELECT customer_id, COUNT(*) as order_count
FROM orders
GROUP BY customer_id
)
SELECT c.customer_name, COALESCE(oc.order_count, 0)
FROM customers c
LEFT JOIN order_counts oc ON c.customer_id = oc.customer_id;
Mistake 3: Forgetting to handle NULL values in subqueries
-- If a customer has no orders, this comparison might not work as expected
SELECT customer_name
FROM customers c
WHERE (SELECT AVG(total_amount) FROM orders WHERE customer_id = c.customer_id) > 500;
Fix: Handle NULLs explicitly:
SELECT customer_name
FROM customers c
WHERE COALESCE((SELECT AVG(total_amount) FROM orders WHERE customer_id = c.customer_id), 0) > 500;
Mistake 4: CTE naming conflicts
-- 'customers' CTE shadows the actual customers table
WITH customers AS (
SELECT customer_id FROM orders GROUP BY customer_id
)
SELECT * FROM customers -- Which customers table?
JOIN orders ON customers.customer_id = orders.customer_id;
Fix: Use descriptive, unique names for your CTEs.
Debugging Tip: When debugging complex queries with multiple CTEs, run each CTE individually as a standalone query to verify it returns what you expect.
You've learned to break down complex data problems into manageable pieces using subqueries and CTEs. Subqueries excel at simple, focused calculations that feed into larger queries, while CTEs shine when you need to build complex logic step-by-step with readable, maintainable code.
The key insight is recognizing when your analysis involves multiple logical steps. Instead of trying to cram everything into a single, complex query or splitting work across multiple tools, you can now structure your thinking within SQL itself.
Key techniques you've mastered:
Next steps in your SQL journey:
The combination of subqueries and CTEs transforms SQL from a simple data retrieval language into a powerful analytical toolkit. With these techniques, you're ready to tackle the complex, multi-step analyses that drive real business decisions.
Learning Path: SQL Fundamentals