You're analyzing sales performance across multiple regions, and your manager asks a seemingly simple question: "Which sales reps exceeded their regional average by more than 20%?" Your first instinct might be to write multiple queries, export to Excel, and manually calculate the comparisons. But there's a better way—one that keeps all the logic in SQL and produces results you can trust and reproduce.
Subqueries and Common Table Expressions (CTEs) are the tools that separate casual SQL users from true data professionals. They let you break complex problems into logical steps, reuse intermediate calculations, and write queries that read like the business logic they represent. By the end of this lesson, you'll be writing sophisticated multi-step analyses that would have seemed impossible with basic SELECT statements.
What you'll learn:
You should be comfortable with basic SELECT statements, JOINs, aggregate functions, and window functions. If you're unsure about any of these, review those fundamentals before diving into subqueries and CTEs.
A subquery is simply a query nested inside another query. Think of it as asking a question that depends on the answer to another question. Let's start with our sales scenario using a realistic dataset.
-- Sample data structure we'll work with
CREATE TABLE sales_reps (
rep_id INT PRIMARY KEY,
rep_name VARCHAR(100),
region VARCHAR(50),
hire_date DATE
);
CREATE TABLE sales_transactions (
transaction_id INT PRIMARY KEY,
rep_id INT,
customer_id INT,
sale_amount DECIMAL(10,2),
sale_date DATE,
product_category VARCHAR(50)
);
Non-correlated subqueries execute independently of the outer query. They run once and return a result that the outer query uses. Here's how to find sales reps who generated more revenue than the company average:
SELECT
sr.rep_name,
sr.region,
SUM(st.sale_amount) as total_sales
FROM sales_reps sr
JOIN sales_transactions st ON sr.rep_id = st.rep_id
GROUP BY sr.rep_id, sr.rep_name, sr.region
HAVING SUM(st.sale_amount) > (
SELECT AVG(rep_total)
FROM (
SELECT SUM(sale_amount) as rep_total
FROM sales_transactions
GROUP BY rep_id
) rep_totals
);
Notice how the subquery calculates the average total sales across all reps, then the outer query uses that single value to filter results. The subquery runs once, regardless of how many rows are in the outer query.
Correlated subqueries reference columns from the outer query, creating a dependent relationship. They execute once for each row processed by the outer query. Here's how to find reps who exceed their regional average:
SELECT
sr.rep_name,
sr.region,
SUM(st.sale_amount) as rep_sales,
(
SELECT AVG(regional_sales.rep_total)
FROM (
SELECT SUM(st2.sale_amount) as rep_total
FROM sales_reps sr2
JOIN sales_transactions st2 ON sr2.rep_id = st2.rep_id
WHERE sr2.region = sr.region -- This creates the correlation
GROUP BY sr2.rep_id
) regional_sales
) as regional_average
FROM sales_reps sr
JOIN sales_transactions st ON sr.rep_id = st.rep_id
GROUP BY sr.rep_id, sr.rep_name, sr.region
HAVING SUM(st.sale_amount) > (
SELECT AVG(regional_sales.rep_total)
FROM (
SELECT SUM(st2.sale_amount) as rep_total
FROM sales_reps sr2
JOIN sales_transactions st2 ON sr2.rep_id = st2.rep_id
WHERE sr2.region = sr.region
GROUP BY sr2.rep_id
) regional_sales
);
This query is more complex because for each rep in the outer query, the subquery recalculates the average for that specific region. The WHERE sr2.region = sr.region clause creates the correlation.
Performance Warning: Correlated subqueries can be expensive because they execute multiple times. For the query above, if you have 100 sales reps, the subquery runs 100 times. Always consider whether a JOIN might be more efficient.
Subqueries aren't limited to WHERE and HAVING clauses. Here are practical examples for each location:
SELECT clause (calculated columns):
SELECT
sr.rep_name,
sr.region,
(
SELECT COUNT(*)
FROM sales_transactions st
WHERE st.rep_id = sr.rep_id
AND st.sale_date >= '2024-01-01'
) as ytd_transaction_count,
(
SELECT SUM(sale_amount)
FROM sales_transactions st
WHERE st.rep_id = sr.rep_id
AND st.sale_date >= '2024-01-01'
) as ytd_sales
FROM sales_reps sr;
FROM clause (derived tables):
SELECT
monthly_sales.sale_month,
AVG(monthly_sales.monthly_total) as avg_monthly_sales,
MAX(monthly_sales.monthly_total) as peak_monthly_sales
FROM (
SELECT
DATE_TRUNC('month', sale_date) as sale_month,
SUM(sale_amount) as monthly_total
FROM sales_transactions
GROUP BY DATE_TRUNC('month', sale_date)
) monthly_sales
GROUP BY monthly_sales.sale_month
ORDER BY sale_month;
CTEs solve the readability problem that complex subqueries create. They let you name intermediate result sets and reference them multiple times within a single query. Think of CTEs as temporary views that exist only for the duration of your query.
Here's the basic structure:
WITH cte_name AS (
-- Your query here
)
SELECT * FROM cte_name;
Let's rewrite our regional average comparison using a CTE:
WITH rep_totals AS (
SELECT
sr.rep_id,
sr.rep_name,
sr.region,
SUM(st.sale_amount) as total_sales
FROM sales_reps sr
JOIN sales_transactions st ON sr.rep_id = st.rep_id
GROUP BY sr.rep_id, sr.rep_name, sr.region
),
regional_averages AS (
SELECT
region,
AVG(total_sales) as avg_regional_sales
FROM rep_totals
GROUP BY region
)
SELECT
rt.rep_name,
rt.region,
rt.total_sales,
ra.avg_regional_sales,
rt.total_sales - ra.avg_regional_sales as difference,
ROUND(
((rt.total_sales - ra.avg_regional_sales) / ra.avg_regional_sales) * 100,
2
) as percent_above_average
FROM rep_totals rt
JOIN regional_averages ra ON rt.region = ra.region
WHERE rt.total_sales > ra.avg_regional_sales * 1.2 -- 20% above average
ORDER BY percent_above_average DESC;
This CTE approach is much more readable. We can see the logical flow: calculate individual totals, then regional averages, then compare them. Each CTE has a clear purpose and can be understood independently.
CTEs really shine when you need to perform multi-step analysis. Let's build a comprehensive sales performance dashboard:
WITH monthly_sales AS (
SELECT
sr.rep_id,
sr.rep_name,
sr.region,
DATE_TRUNC('month', st.sale_date) as sale_month,
SUM(st.sale_amount) as monthly_total,
COUNT(st.transaction_id) as transaction_count
FROM sales_reps sr
JOIN sales_transactions st ON sr.rep_id = st.rep_id
WHERE st.sale_date >= '2024-01-01'
GROUP BY sr.rep_id, sr.rep_name, sr.region, DATE_TRUNC('month', st.sale_date)
),
rep_performance AS (
SELECT
rep_id,
rep_name,
region,
SUM(monthly_total) as ytd_sales,
AVG(monthly_total) as avg_monthly_sales,
COUNT(sale_month) as active_months,
SUM(transaction_count) as total_transactions
FROM monthly_sales
GROUP BY rep_id, rep_name, region
),
regional_benchmarks AS (
SELECT
region,
AVG(ytd_sales) as avg_regional_ytd,
AVG(avg_monthly_sales) as avg_regional_monthly,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY ytd_sales) as top_quartile_ytd
FROM rep_performance
GROUP BY region
),
performance_rankings AS (
SELECT
rp.*,
rb.avg_regional_ytd,
rb.avg_regional_monthly,
rb.top_quartile_ytd,
RANK() OVER (PARTITION BY rp.region ORDER BY rp.ytd_sales DESC) as regional_rank,
RANK() OVER (ORDER BY rp.ytd_sales DESC) as company_rank
FROM rep_performance rp
JOIN regional_benchmarks rb ON rp.region = rb.region
)
SELECT
rep_name,
region,
ytd_sales,
avg_monthly_sales,
total_transactions,
regional_rank,
company_rank,
CASE
WHEN ytd_sales >= top_quartile_ytd THEN 'Top Performer'
WHEN ytd_sales >= avg_regional_ytd THEN 'Above Average'
ELSE 'Needs Improvement'
END as performance_category,
ROUND(
((ytd_sales - avg_regional_ytd) / avg_regional_ytd) * 100,
2
) as vs_regional_average_pct
FROM performance_rankings
ORDER BY company_rank;
This query demonstrates the power of CTEs for complex analysis. Each step builds on the previous one, creating a clear narrative from raw transactions to final performance rankings.
Recursive CTEs handle hierarchical or tree-like data structures. They're essential for organizational charts, product categories, or any scenario where data has parent-child relationships.
-- Sample hierarchical data
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
employee_name VARCHAR(100),
manager_id INT,
department VARCHAR(50),
salary DECIMAL(10,2)
);
-- Find all employees in a management chain
WITH RECURSIVE management_chain AS (
-- Base case: start with a specific manager
SELECT
employee_id,
employee_name,
manager_id,
department,
salary,
0 as level,
employee_name as chain_path
FROM employees
WHERE employee_id = 101 -- Starting manager ID
UNION ALL
-- Recursive case: find direct reports
SELECT
e.employee_id,
e.employee_name,
e.manager_id,
e.department,
e.salary,
mc.level + 1,
mc.chain_path || ' -> ' || e.employee_name
FROM employees e
JOIN management_chain mc ON e.manager_id = mc.employee_id
)
SELECT
employee_id,
employee_name,
department,
salary,
level,
chain_path,
CASE level
WHEN 0 THEN 'Senior Manager'
WHEN 1 THEN 'Manager'
WHEN 2 THEN 'Senior Associate'
ELSE 'Associate'
END as role_level
FROM management_chain
ORDER BY level, employee_name;
Recursion Tip: Always include a termination condition to prevent infinite loops. Most databases have built-in limits, but it's better to be explicit about when recursion should stop.
Understanding when to use subqueries versus JOINs is crucial for performance. Here are the key decision factors:
Use subqueries when:
-- Good use of subquery: filtering by aggregate
SELECT customer_id, customer_name
FROM customers
WHERE customer_id IN (
SELECT customer_id
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id
HAVING SUM(order_amount) > 10000
);
Use JOINs when:
-- Often more efficient than correlated subqueries
SELECT
sr.rep_name,
sr.region,
SUM(st.sale_amount) as rep_sales,
regional_avg.avg_sales
FROM sales_reps sr
JOIN sales_transactions st ON sr.rep_id = st.rep_id
JOIN (
SELECT
sr2.region,
AVG(rep_totals.total_sales) as avg_sales
FROM sales_reps sr2
JOIN (
SELECT rep_id, SUM(sale_amount) as total_sales
FROM sales_transactions
GROUP BY rep_id
) rep_totals ON sr2.rep_id = rep_totals.rep_id
GROUP BY sr2.region
) regional_avg ON sr.region = regional_avg.region
GROUP BY sr.rep_id, sr.rep_name, sr.region, regional_avg.avg_sales
HAVING SUM(st.sale_amount) > regional_avg.avg_sales * 1.2;
CTEs in most databases are materialized once and can be referenced multiple times efficiently. However, some databases treat CTEs as views and re-execute them for each reference. Know your database's behavior:
-- This CTE will be referenced twice - efficient in most databases
WITH expensive_calculation AS (
SELECT
product_id,
AVG(sale_amount) as avg_price,
COUNT(*) as sale_count,
-- Expensive window function
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sale_amount) as median_price
FROM sales_transactions
WHERE sale_date >= '2024-01-01'
GROUP BY product_id
)
SELECT
ec1.product_id,
ec1.avg_price,
ec1.sale_count,
ec1.median_price,
-- Second reference to the same CTE
ec2.avg_price as category_avg
FROM expensive_calculation ec1
JOIN expensive_calculation ec2 ON ec1.product_id = ec2.product_id;
Let's build a comprehensive customer segmentation analysis using both subqueries and CTEs. You'll analyze an e-commerce dataset to identify high-value customers and their purchasing patterns.
-- Setup: Create sample e-commerce data
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
customer_name VARCHAR(100),
email VARCHAR(100),
registration_date DATE,
country VARCHAR(50)
);
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
order_date DATE,
order_amount DECIMAL(10,2),
shipping_cost DECIMAL(10,2)
);
CREATE TABLE order_items (
item_id INT PRIMARY KEY,
order_id INT,
product_category VARCHAR(50),
quantity INT,
unit_price DECIMAL(10,2)
);
-- Your task: Write a query that identifies "VIP customers" based on these criteria:
-- 1. Total lifetime value > 75th percentile of all customers
-- 2. Average order value > company average
-- 3. Has made purchases in at least 3 different product categories
-- 4. Has been active (made a purchase) in the last 90 days
-- Solution using CTEs:
WITH customer_metrics AS (
SELECT
c.customer_id,
c.customer_name,
c.email,
c.country,
c.registration_date,
COUNT(o.order_id) as total_orders,
SUM(o.order_amount) as lifetime_value,
AVG(o.order_amount) as avg_order_value,
MAX(o.order_date) as last_order_date,
COUNT(DISTINCT oi.product_category) as categories_purchased
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
LEFT JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_date IS NOT NULL -- Only customers who have made purchases
GROUP BY c.customer_id, c.customer_name, c.email, c.country, c.registration_date
),
company_benchmarks AS (
SELECT
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY lifetime_value) as ltv_75th_percentile,
AVG(avg_order_value) as company_avg_order_value
FROM customer_metrics
),
vip_candidates AS (
SELECT
cm.*,
cb.ltv_75th_percentile,
cb.company_avg_order_value,
CASE
WHEN cm.lifetime_value > cb.ltv_75th_percentile
AND cm.avg_order_value > cb.company_avg_order_value
AND cm.categories_purchased >= 3
AND cm.last_order_date >= CURRENT_DATE - INTERVAL '90 days'
THEN TRUE
ELSE FALSE
END as is_vip
FROM customer_metrics cm
CROSS JOIN company_benchmarks cb
)
SELECT
customer_name,
email,
country,
total_orders,
lifetime_value,
avg_order_value,
categories_purchased,
last_order_date,
ROUND(
((lifetime_value - ltv_75th_percentile) / ltv_75th_percentile) * 100,
2
) as ltv_vs_75th_percentile,
CASE
WHEN is_vip THEN 'VIP Customer'
WHEN lifetime_value > ltv_75th_percentile THEN 'High Value'
WHEN avg_order_value > company_avg_order_value THEN 'High Spender'
ELSE 'Standard'
END as customer_segment
FROM vip_candidates
WHERE is_vip = TRUE -- Only show VIP customers
ORDER BY lifetime_value DESC;
This exercise demonstrates several advanced concepts:
Problem: Using correlated subqueries when a JOIN would be more efficient.
-- Inefficient: This subquery runs once per row
SELECT
customer_name,
(
SELECT AVG(order_amount)
FROM orders o
WHERE o.customer_id = c.customer_id
) as avg_order_amount
FROM customers c;
Solution: Use window functions or JOINs instead:
-- More efficient: Single scan with window function
SELECT DISTINCT
c.customer_name,
AVG(o.order_amount) OVER (PARTITION BY c.customer_id) as avg_order_amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id;
Problem: Referencing columns that don't exist in the CTE.
-- Error: total_sales isn't selected in the CTE
WITH sales_summary AS (
SELECT rep_id, SUM(sale_amount) as sales_total
FROM sales_transactions
GROUP BY rep_id
)
SELECT rep_id, total_sales -- Column doesn't exist!
FROM sales_summary;
Solution: Ensure column names match between CTE definition and usage.
Problem: Missing or incorrect termination conditions.
-- Dangerous: Could loop infinitely
WITH RECURSIVE bad_recursion AS (
SELECT employee_id, manager_id, 1 as level
FROM employees
WHERE manager_id IS NULL
UNION ALL
SELECT e.employee_id, e.manager_id, br.level + 1
FROM employees e
JOIN bad_recursion br ON e.manager_id = br.employee_id
-- No termination condition!
)
SELECT * FROM bad_recursion;
Solution: Add explicit termination conditions:
WITH RECURSIVE safe_recursion AS (
SELECT employee_id, manager_id, 1 as level
FROM employees
WHERE manager_id IS NULL
UNION ALL
SELECT e.employee_id, e.manager_id, sr.level + 1
FROM employees e
JOIN safe_recursion sr ON e.manager_id = sr.employee_id
WHERE sr.level < 10 -- Prevent runaway recursion
)
SELECT * FROM safe_recursion;
Test CTEs independently: Run each CTE as a separate query to verify the results before combining them.
Use EXPLAIN PLAN: Understand how your database executes complex queries to identify performance bottlenecks.
Start simple: Begin with basic subqueries and gradually add complexity.
Validate assumptions: Check that your subqueries return the expected number of rows and data types.
You've learned how to use subqueries and CTEs to solve complex analytical problems that would be difficult or impossible with basic SQL. Subqueries let you embed one query inside another for dynamic filtering and calculations, while CTEs provide a readable way to break complex logic into manageable steps.
Key takeaways:
Next steps in your SQL journey:
The techniques you've learned here form the foundation for advanced analytical SQL. With practice, you'll find yourself naturally thinking in terms of multi-step logical processes that CTEs and subqueries can elegantly express.
Learning Path: SQL Fundamentals