You're analyzing customer behavior for an e-commerce platform when your manager drops this question: "Which products purchased in the last quarter had above-average profit margins, and what percentage of our top 20% customers bought them?" Your first instinct might be to export data to Excel or write multiple queries. But there's a better way.
This challenge perfectly illustrates why subqueries and Common Table Expressions (CTEs) are essential SQL skills. They let you break complex business questions into logical steps, create reusable query components, and write SQL that actually makes sense to your future self and teammates.
By the end of this lesson, you'll transform unwieldy, nested queries into clear, maintainable SQL that tackles multi-step analytics problems with confidence.
What you'll learn:
You should be comfortable with basic SQL operations (SELECT, WHERE, JOIN, GROUP BY) and understand aggregate functions. Familiarity with window functions will help but isn't required.
A subquery is simply a query nested inside another query. Think of it as asking SQL to answer a question, then using that answer to filter or enrich your main query. Let's start with a realistic scenario.
Imagine you're working with this customer orders dataset:
-- Sample data structure we'll work with
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
customer_name VARCHAR(100),
registration_date DATE,
tier VARCHAR(20) -- 'bronze', 'silver', 'gold'
);
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
order_date DATE,
total_amount DECIMAL(10,2),
product_category VARCHAR(50)
);
CREATE TABLE order_items (
item_id INT PRIMARY KEY,
order_id INT,
product_id INT,
quantity INT,
unit_price DECIMAL(10,2),
cost_price DECIMAL(10,2)
);
The most common use of subqueries is in WHERE clauses to filter based on calculated values. Let's find customers who spent more than the average order amount:
SELECT customer_id, customer_name, tier
FROM customers
WHERE customer_id IN (
SELECT customer_id
FROM orders
WHERE total_amount > (
SELECT AVG(total_amount)
FROM orders
)
);
This query works inside-out. SQL first calculates the average order amount, then finds orders exceeding that average, then returns customers who placed those orders. But there's a more efficient way to write this:
-- More efficient version using EXISTS
SELECT c.customer_id, c.customer_name, c.tier
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.total_amount > (SELECT AVG(total_amount) FROM orders)
);
Performance Tip: Use EXISTS instead of IN when checking for relationships. EXISTS stops searching once it finds one matching record, while IN must retrieve all matches.
Understanding this distinction is crucial for performance and logic:
Uncorrelated subqueries run once and return the same result regardless of the outer query:
-- Find products with above-average profit margins
SELECT product_id, (unit_price - cost_price) as profit
FROM order_items
WHERE (unit_price - cost_price) > (
SELECT AVG(unit_price - cost_price)
FROM order_items
);
Correlated subqueries reference the outer query and run once for each row:
-- Find each customer's orders that exceed their personal average
SELECT o.order_id, o.customer_id, o.total_amount
FROM orders o
WHERE o.total_amount > (
SELECT AVG(o2.total_amount)
FROM orders o2
WHERE o2.customer_id = o.customer_id
);
The correlated version is powerful but potentially expensive—it runs the subquery for every row in the outer query.
You can use subqueries to add calculated columns:
SELECT
c.customer_name,
c.tier,
(SELECT COUNT(*)
FROM orders o
WHERE o.customer_id = c.customer_id) as total_orders,
(SELECT AVG(o.total_amount)
FROM orders o
WHERE o.customer_id = c.customer_id) as avg_order_value,
(SELECT MAX(o.order_date)
FROM orders o
WHERE o.customer_id = c.customer_id) as last_order_date
FROM customers c
WHERE c.tier = 'gold';
While this works, it's inefficient—we're running three subqueries for each customer. This is exactly where CTEs shine.
CTEs are like creating temporary, named result sets that exist only for the duration of your query. Think of them as variables you can reference multiple times. They make complex queries readable and often more performant.
Here's the basic pattern:
WITH cte_name AS (
-- Your query here
SELECT column1, column2
FROM table_name
WHERE condition
)
SELECT *
FROM cte_name
WHERE another_condition;
Let's rewrite our customer analysis using a CTE:
WITH customer_metrics AS (
SELECT
c.customer_id,
c.customer_name,
c.tier,
COUNT(o.order_id) as total_orders,
AVG(o.total_amount) as avg_order_value,
MAX(o.order_date) as last_order_date,
SUM(o.total_amount) as total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name, c.tier
)
SELECT *
FROM customer_metrics
WHERE tier = 'gold'
AND total_orders >= 5;
This is cleaner, more performant (single scan of the orders table), and easier to debug.
For really complex analysis, you can chain multiple CTEs:
WITH quarterly_sales AS (
SELECT
DATE_TRUNC('quarter', order_date) as quarter,
product_category,
SUM(total_amount) as category_revenue
FROM orders
WHERE order_date >= '2023-01-01'
GROUP BY DATE_TRUNC('quarter', order_date), product_category
),
category_averages AS (
SELECT
quarter,
AVG(category_revenue) as avg_category_revenue
FROM quarterly_sales
GROUP BY quarter
),
top_categories AS (
SELECT
qs.quarter,
qs.product_category,
qs.category_revenue,
ca.avg_category_revenue,
(qs.category_revenue - ca.avg_category_revenue) as revenue_vs_average
FROM quarterly_sales qs
JOIN category_averages ca ON qs.quarter = ca.quarter
WHERE qs.category_revenue > ca.avg_category_revenue
)
SELECT
quarter,
product_category,
category_revenue,
ROUND(revenue_vs_average, 2) as above_average_by,
RANK() OVER (PARTITION BY quarter ORDER BY category_revenue DESC) as category_rank
FROM top_categories
ORDER BY quarter, category_rank;
This query answers: "Which product categories performed above average each quarter, and how did they rank?" Each CTE handles one logical step, making the overall logic crystal clear.
CTEs pair beautifully with window functions for sophisticated analytics:
WITH customer_order_analysis AS (
SELECT
c.customer_id,
c.customer_name,
o.order_id,
o.order_date,
o.total_amount,
ROW_NUMBER() OVER (PARTITION BY c.customer_id ORDER BY o.order_date) as order_sequence,
LAG(o.order_date) OVER (PARTITION BY c.customer_id ORDER BY o.order_date) as previous_order_date,
AVG(o.total_amount) OVER (PARTITION BY c.customer_id) as customer_avg_order
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
),
customer_behavior AS (
SELECT
customer_id,
customer_name,
order_id,
order_date,
total_amount,
order_sequence,
CASE
WHEN previous_order_date IS NULL THEN NULL
ELSE order_date - previous_order_date
END as days_between_orders,
CASE
WHEN total_amount > customer_avg_order THEN 'Above Average'
ELSE 'Below Average'
END as order_size_category
FROM customer_order_analysis
)
SELECT
customer_name,
COUNT(*) as total_orders,
AVG(days_between_orders) as avg_days_between_orders,
COUNT(CASE WHEN order_size_category = 'Above Average' THEN 1 END) as large_orders,
MAX(order_sequence) as order_frequency_rank
FROM customer_behavior
GROUP BY customer_id, customer_name
HAVING COUNT(*) >= 3
ORDER BY avg_days_between_orders;
This analysis reveals customer purchasing patterns: frequency, consistency, and order size tendencies.
Recursive CTEs solve problems involving hierarchical or graph-like data. Here's a practical example with an employee hierarchy:
-- Employee hierarchy table
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
employee_name VARCHAR(100),
manager_id INT,
department VARCHAR(50),
salary DECIMAL(10,2)
);
-- Find all reports under a specific manager (direct and indirect)
WITH RECURSIVE employee_hierarchy AS (
-- Anchor query: start with the top manager
SELECT
employee_id,
employee_name,
manager_id,
department,
salary,
0 as level,
CAST(employee_name AS VARCHAR(500)) as hierarchy_path
FROM employees
WHERE manager_id IS NULL OR employee_id = 100 -- Start with CEO or specific manager
UNION ALL
-- Recursive query: find direct reports
SELECT
e.employee_id,
e.employee_name,
e.manager_id,
e.department,
e.salary,
eh.level + 1,
CAST(eh.hierarchy_path || ' -> ' || e.employee_name AS VARCHAR(500))
FROM employees e
JOIN employee_hierarchy eh ON e.manager_id = eh.employee_id
)
SELECT
employee_id,
REPEAT(' ', level) || employee_name as indented_name,
department,
salary,
level,
hierarchy_path
FROM employee_hierarchy
ORDER BY hierarchy_path;
Here's a sophisticated analysis combining multiple analytical patterns:
WITH monthly_cohorts AS (
-- Group customers by their registration month
SELECT
customer_id,
DATE_TRUNC('month', registration_date) as cohort_month
FROM customers
),
customer_monthly_orders AS (
-- Get each customer's monthly order activity
SELECT
o.customer_id,
mc.cohort_month,
DATE_TRUNC('month', o.order_date) as order_month,
COUNT(o.order_id) as orders_count,
SUM(o.total_amount) as monthly_revenue
FROM orders o
JOIN monthly_cohorts mc ON o.customer_id = mc.customer_id
GROUP BY o.customer_id, mc.cohort_month, DATE_TRUNC('month', o.order_date)
),
cohort_analysis AS (
-- Calculate months since registration for each order month
SELECT
cohort_month,
customer_id,
order_month,
EXTRACT(YEAR FROM AGE(order_month, cohort_month)) * 12 +
EXTRACT(MONTH FROM AGE(order_month, cohort_month)) as months_since_registration,
orders_count,
monthly_revenue
FROM customer_monthly_orders
),
cohort_retention AS (
-- Summarize retention and revenue by cohort and month
SELECT
cohort_month,
months_since_registration,
COUNT(DISTINCT customer_id) as active_customers,
SUM(monthly_revenue) as cohort_revenue,
AVG(monthly_revenue) as avg_customer_revenue
FROM cohort_analysis
GROUP BY cohort_month, months_since_registration
),
cohort_sizes AS (
-- Get initial cohort sizes
SELECT
cohort_month,
COUNT(*) as cohort_size
FROM monthly_cohorts
GROUP BY cohort_month
)
SELECT
cr.cohort_month,
cr.months_since_registration,
cr.active_customers,
cs.cohort_size,
ROUND(100.0 * cr.active_customers / cs.cohort_size, 2) as retention_rate,
cr.cohort_revenue,
ROUND(cr.avg_customer_revenue, 2) as avg_customer_revenue
FROM cohort_retention cr
JOIN cohort_sizes cs ON cr.cohort_month = cs.cohort_month
WHERE cr.cohort_month >= '2023-01-01'
ORDER BY cr.cohort_month, cr.months_since_registration;
This cohort analysis shows how customer retention and spending patterns evolve over time—critical for subscription businesses and customer lifetime value calculations.
Use subqueries when:
-- Good use of subquery: simple filter
SELECT product_id, unit_price
FROM order_items
WHERE unit_price > (SELECT AVG(unit_price) FROM order_items);
Use JOINs when:
-- Better as JOIN than subquery
SELECT c.customer_name, o.order_date, o.total_amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.tier = 'gold';
Use CTEs when:
CTEs aren't always materialized (stored temporarily). Some databases optimize them as views. For large datasets, consider these patterns:
-- Instead of this expensive pattern:
WITH all_customer_data AS (
SELECT c.*, o.*, oi.* -- Selecting everything
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
)
SELECT customer_name, SUM(quantity)
FROM all_customer_data
GROUP BY customer_name;
-- Do this:
WITH customer_quantities AS (
SELECT c.customer_name, oi.quantity -- Only what you need
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
)
SELECT customer_name, SUM(quantity)
FROM customer_quantities
GROUP BY customer_name;
Performance Rule: Always select only the columns you actually need in CTEs. This reduces memory usage and can significantly improve query performance.
Let's put everything together with a real-world scenario. You're building an analytics dashboard for an e-commerce platform that needs to answer these questions:
Here's your solution:
WITH quarterly_product_performance AS (
-- Calculate product performance metrics for current quarter
SELECT
oi.product_id,
COUNT(DISTINCT o.order_id) as total_orders,
SUM(oi.quantity) as units_sold,
SUM(oi.quantity * oi.unit_price) as gross_revenue,
SUM(oi.quantity * oi.cost_price) as total_cost,
SUM(oi.quantity * (oi.unit_price - oi.cost_price)) as total_profit,
ROUND(
AVG(100.0 * (oi.unit_price - oi.cost_price) / oi.unit_price),
2
) as avg_margin_percent,
o.product_category
FROM order_items oi
JOIN orders o ON oi.order_id = o.order_id
WHERE o.order_date >= DATE_TRUNC('quarter', CURRENT_DATE)
GROUP BY oi.product_id, o.product_category
),
high_value_customers AS (
-- Identify top 20% of customers by total spend
SELECT
customer_id,
SUM(total_amount) as total_spent,
NTILE(5) OVER (ORDER BY SUM(total_amount) DESC) as spending_quintile
FROM orders
WHERE order_date >= DATE_TRUNC('year', CURRENT_DATE)
GROUP BY customer_id
HAVING NTILE(5) OVER (ORDER BY SUM(total_amount) DESC) = 1
),
category_benchmarks AS (
-- Calculate category average margins for comparison
SELECT
product_category,
AVG(avg_margin_percent) as category_avg_margin,
AVG(total_profit) as category_avg_profit
FROM quarterly_product_performance
GROUP BY product_category
),
top_profit_products AS (
-- Find products with above-category-average margins
SELECT
qpp.*,
cb.category_avg_margin,
CASE
WHEN qpp.avg_margin_percent > cb.category_avg_margin THEN 'Above Category Avg'
ELSE 'Below Category Avg'
END as margin_vs_category,
RANK() OVER (ORDER BY qpp.total_profit DESC) as profit_rank
FROM quarterly_product_performance qpp
JOIN category_benchmarks cb ON qpp.product_category = cb.product_category
WHERE qpp.avg_margin_percent > cb.category_avg_margin
),
high_value_customer_purchases AS (
-- See which high-value customers bought our top products
SELECT
tpp.product_id,
COUNT(DISTINCT hvc.customer_id) as high_value_buyers,
COUNT(DISTINCT o.customer_id) as total_buyers
FROM top_profit_products tpp
JOIN orders o ON tpp.product_id IN (
SELECT oi.product_id
FROM order_items oi
WHERE oi.order_id = o.order_id
)
LEFT JOIN high_value_customers hvc ON o.customer_id = hvc.customer_id
WHERE o.order_date >= DATE_TRUNC('quarter', CURRENT_DATE)
GROUP BY tpp.product_id
)
-- Final dashboard query
SELECT
tpp.product_id,
tpp.product_category,
tpp.units_sold,
tpp.gross_revenue,
tpp.total_profit,
tpp.avg_margin_percent,
tpp.category_avg_margin,
tpp.margin_vs_category,
tpp.profit_rank,
hvcp.high_value_buyers,
hvcp.total_buyers,
ROUND(
100.0 * COALESCE(hvcp.high_value_buyers, 0) / hvcp.total_buyers,
2
) as high_value_customer_penetration
FROM top_profit_products tpp
LEFT JOIN high_value_customer_purchases hvcp ON tpp.product_id = hvcp.product_id
WHERE tpp.profit_rank <= 20
ORDER BY tpp.profit_rank;
Try modifying this query to:
-- Slow: runs subquery for each row
SELECT customer_name,
(SELECT COUNT(*) FROM orders WHERE customer_id = c.customer_id) as order_count
FROM customers c;
-- Fast: use JOINs or CTEs instead
WITH customer_order_counts AS (
SELECT customer_id, COUNT(*) as order_count
FROM orders
GROUP BY customer_id
)
SELECT c.customer_name, COALESCE(coc.order_count, 0) as order_count
FROM customers c
LEFT JOIN customer_order_counts coc ON c.customer_id = coc.customer_id;
-- Problem: returns no results if subquery returns NULL
SELECT * FROM orders WHERE total_amount > (
SELECT AVG(total_amount) FROM orders WHERE customer_id = 999 -- Non-existent customer
);
-- Solution: always consider NULL cases
SELECT * FROM orders WHERE total_amount > (
SELECT COALESCE(AVG(total_amount), 0) FROM orders WHERE customer_id = 999
);
-- Confusing: reusing table names
WITH orders AS ( -- Don't name CTE same as existing table
SELECT * FROM orders WHERE order_date >= '2023-01-01'
)
SELECT * FROM orders; -- Which orders table?
-- Clear: descriptive CTE names
WITH recent_orders AS (
SELECT * FROM orders WHERE order_date >= '2023-01-01'
)
SELECT * FROM recent_orders;
When your multi-CTE query isn't working:
-- Debug by testing intermediate steps
WITH step1 AS (
SELECT customer_id, COUNT(*) as order_count
FROM orders
GROUP BY customer_id
)
-- SELECT * FROM step1 LIMIT 10; -- Uncomment to debug
, step2 AS (
SELECT customer_id,
CASE WHEN order_count >= 5 THEN 'frequent' ELSE 'occasional' END as customer_type
FROM step1
)
-- SELECT customer_type, COUNT(*) FROM step2 GROUP BY customer_type; -- Debug counts
SELECT * FROM step2;
You've now mastered the essential tools for complex SQL analysis. Subqueries give you the power to filter and enrich data based on calculated conditions, while CTEs provide the structure to break down complex business questions into logical, readable steps.
Key takeaways:
Practice opportunities:
Next learning steps:
The patterns you've learned here scale from simple analytics queries to complex data warehouse transformations. Practice building queries that tell complete business stories, not just answer individual questions.
Learning Path: SQL Fundamentals