
Picture this: You're analyzing customer purchase patterns for an e-commerce platform, and your manager asks a seemingly simple question: "Which customers spent more than our average order value last month, and what did they buy?"
Your first instinct might be to calculate the average separately, then filter customers—but that's two queries when you need one cohesive analysis. Or perhaps you start building a massive JOIN statement that becomes increasingly unreadable as you add more conditions. This is exactly where subqueries and Common Table Expressions (CTEs) become indispensable tools in your SQL arsenal.
Both techniques allow you to break complex problems into manageable pieces, but they serve different purposes and have distinct performance characteristics. By the end of this lesson, you'll understand when to use each approach and how to write maintainable, efficient queries that solve real business problems.
What you'll learn:
You should be comfortable with basic SQL operations (SELECT, WHERE, GROUP BY), joins between tables, and aggregate functions. We'll assume you understand primary/foreign key relationships and can read intermediate SQL queries without difficulty.
A subquery is a complete SELECT statement nested inside another SQL statement. Think of it as asking a question to answer another question. Let's start with a realistic dataset—imagine you're working with an e-commerce database with these core tables:
-- Sample data structure
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
customer_name VARCHAR(100),
registration_date DATE,
customer_tier VARCHAR(20)
);
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
order_date DATE,
order_total DECIMAL(10,2),
order_status VARCHAR(20)
);
CREATE TABLE order_items (
order_item_id INT PRIMARY KEY,
order_id INT,
product_id INT,
quantity INT,
unit_price DECIMAL(8,2)
);
CREATE TABLE products (
product_id INT PRIMARY KEY,
product_name VARCHAR(200),
category VARCHAR(50),
list_price DECIMAL(8,2)
);
The simplest subquery returns exactly one value. Let's find customers whose lifetime spending exceeds the average:
SELECT
customer_name,
customer_tier,
(SELECT SUM(order_total)
FROM orders o
WHERE o.customer_id = c.customer_id) as lifetime_spend
FROM customers c
WHERE (SELECT SUM(order_total)
FROM orders o
WHERE o.customer_id = c.customer_id) >
(SELECT AVG(order_total) FROM orders);
This query contains both correlated and non-correlated subqueries. The AVG(order_total) subquery runs once for the entire query (non-correlated), while the SUM(order_total) subqueries run once per customer row (correlated).
Performance Note: Correlated subqueries can be expensive because they execute once for each row in the outer query. For large datasets, consider whether a JOIN might be more efficient.
You can treat subquery results as temporary tables. This is particularly useful for complex aggregations:
SELECT
monthly_stats.order_month,
monthly_stats.total_orders,
monthly_stats.avg_order_value,
CASE
WHEN monthly_stats.avg_order_value > yearly_avg.overall_avg
THEN 'Above Average'
ELSE 'Below Average'
END as performance
FROM (
-- Monthly aggregation subquery
SELECT
DATE_FORMAT(order_date, '%Y-%m') as order_month,
COUNT(*) as total_orders,
AVG(order_total) as avg_order_value
FROM orders
WHERE order_status = 'completed'
GROUP BY DATE_FORMAT(order_date, '%Y-%m')
) monthly_stats
CROSS JOIN (
-- Yearly average subquery
SELECT AVG(order_total) as overall_avg
FROM orders
WHERE order_status = 'completed'
) yearly_avg
ORDER BY monthly_stats.order_month;
The EXISTS operator is perfect for filtering based on the presence of related records. Let's find customers who've never purchased from the electronics category:
SELECT
customer_name,
registration_date
FROM customers c
WHERE NOT EXISTS (
SELECT 1
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE o.customer_id = c.customer_id
AND p.category = 'Electronics'
);
The SELECT 1 is a common pattern in EXISTS clauses—we only care whether rows exist, not what they contain.
CTEs provide a way to define temporary named result sets that exist only for the duration of a single query. They make complex queries far more readable and maintainable.
Here's the same customer analysis from earlier, rewritten with a CTE:
WITH customer_spending AS (
SELECT
c.customer_id,
c.customer_name,
c.customer_tier,
COALESCE(SUM(o.order_total), 0) as lifetime_spend
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name, c.customer_tier
),
spending_stats AS (
SELECT AVG(lifetime_spend) as avg_spend
FROM customer_spending
)
SELECT
cs.customer_name,
cs.customer_tier,
cs.lifetime_spend,
CASE
WHEN cs.lifetime_spend > ss.avg_spend THEN 'High Value'
WHEN cs.lifetime_spend > ss.avg_spend * 0.5 THEN 'Medium Value'
ELSE 'Low Value'
END as customer_segment
FROM customer_spending cs
CROSS JOIN spending_stats ss
WHERE cs.lifetime_spend > 0
ORDER BY cs.lifetime_spend DESC;
This approach is much more readable than nested subqueries. Each CTE has a clear purpose and can be easily tested independently.
CTEs really shine when you need to build complex analysis step by step. Let's analyze product performance across different customer segments:
WITH monthly_orders AS (
-- Step 1: Aggregate orders by month
SELECT
DATE_FORMAT(order_date, '%Y-%m') as order_month,
customer_id,
SUM(order_total) as monthly_spend
FROM orders
WHERE order_status = 'completed'
AND order_date >= DATE_SUB(CURDATE(), INTERVAL 12 MONTH)
GROUP BY DATE_FORMAT(order_date, '%Y-%m'), customer_id
),
customer_segments AS (
-- Step 2: Classify customers by spending patterns
SELECT
customer_id,
AVG(monthly_spend) as avg_monthly_spend,
COUNT(*) as active_months,
CASE
WHEN AVG(monthly_spend) >= 1000 THEN 'Premium'
WHEN AVG(monthly_spend) >= 500 THEN 'Standard'
ELSE 'Basic'
END as spending_tier
FROM monthly_orders
GROUP BY customer_id
),
product_preferences AS (
-- Step 3: Analyze product categories by segment
SELECT
cs.spending_tier,
p.category,
COUNT(DISTINCT oi.order_id) as orders_count,
SUM(oi.quantity * oi.unit_price) as category_revenue,
AVG(oi.unit_price) as avg_unit_price
FROM customer_segments cs
JOIN orders o ON cs.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE o.order_status = 'completed'
GROUP BY cs.spending_tier, p.category
)
SELECT
spending_tier,
category,
orders_count,
category_revenue,
ROUND(category_revenue / SUM(category_revenue) OVER (PARTITION BY spending_tier) * 100, 2) as revenue_percentage,
avg_unit_price
FROM product_preferences
WHERE orders_count >= 5 -- Filter out categories with too few orders
ORDER BY spending_tier, category_revenue DESC;
Recursive CTEs are powerful for working with tree-like data structures. Let's say you have an organizational hierarchy and need to find all employees under a specific manager:
-- Sample employee hierarchy table
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
employee_name VARCHAR(100),
manager_id INT,
department VARCHAR(50),
salary DECIMAL(10,2)
);
-- Recursive CTE to find organizational hierarchy
WITH RECURSIVE org_chart AS (
-- Base case: Start with top-level manager
SELECT
employee_id,
employee_name,
manager_id,
department,
salary,
0 as level,
CAST(employee_name AS VARCHAR(500)) as hierarchy_path
FROM employees
WHERE manager_id IS NULL -- Top level managers
UNION ALL
-- Recursive case: Find direct reports
SELECT
e.employee_id,
e.employee_name,
e.manager_id,
e.department,
e.salary,
oc.level + 1,
CONCAT(oc.hierarchy_path, ' > ', e.employee_name)
FROM employees e
INNER JOIN org_chart oc ON e.manager_id = oc.employee_id
)
SELECT
CONCAT(REPEAT(' ', level), employee_name) as org_structure,
department,
salary,
hierarchy_path
FROM org_chart
ORDER BY hierarchy_path;
Warning: Always include a termination condition in recursive CTEs to prevent infinite loops. Most databases have built-in limits, but it's better to design your logic carefully.
The choice between these approaches often comes down to performance and readability:
Use subqueries when:
Use CTEs when:
Use JOINs when:
Let's compare performance approaches for finding customers with above-average spending:
-- Approach 1: Correlated subquery (can be slow)
SELECT customer_name, customer_tier
FROM customers c
WHERE (SELECT SUM(order_total) FROM orders o WHERE o.customer_id = c.customer_id) >
(SELECT AVG(order_total) FROM orders);
-- Approach 2: CTE (more readable, similar performance to JOIN)
WITH customer_totals AS (
SELECT
c.customer_id,
c.customer_name,
c.customer_tier,
COALESCE(SUM(o.order_total), 0) as total_spend
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name, c.customer_tier
),
avg_spend AS (
SELECT AVG(order_total) as avg_order_value
FROM orders
)
SELECT customer_name, customer_tier
FROM customer_totals ct, avg_spend av
WHERE ct.total_spend > av.avg_order_value;
-- Approach 3: Pure JOIN (often fastest for large datasets)
SELECT DISTINCT c.customer_name, c.customer_tier
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN (
SELECT customer_id, SUM(order_total) as total_spend
FROM orders
GROUP BY customer_id
) customer_totals ON c.customer_id = customer_totals.customer_id
CROSS JOIN (
SELECT AVG(order_total) as avg_order_value
FROM orders
) avg_spend
WHERE customer_totals.total_spend > avg_spend.avg_order_value;
Index your subquery conditions: Ensure columns used in WHERE clauses of subqueries have appropriate indexes.
Use EXISTS instead of IN for better performance:
-- Slower with large subquery results
SELECT customer_name
FROM customers
WHERE customer_id IN (SELECT customer_id FROM high_value_customers);
-- Faster
SELECT customer_name
FROM customers c
WHERE EXISTS (SELECT 1 FROM high_value_customers h WHERE h.customer_id = c.customer_id);
Let's build a comprehensive customer segmentation analysis that combines multiple techniques. You'll analyze an e-commerce dataset to create a customer scoring system.
Your task is to create a query that:
-- Start with this framework and fill in the logic:
WITH customer_rfm AS (
-- Calculate Recency (days since last order),
-- Frequency (total orders),
-- and Monetary (total spent) for each customer
SELECT
c.customer_id,
c.customer_name,
c.customer_tier,
-- Your RFM calculations here
DATEDIFF(CURDATE(), MAX(o.order_date)) as recency_days,
COUNT(DISTINCT o.order_id) as frequency_orders,
SUM(o.order_total) as monetary_total
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_status = 'completed' OR o.order_status IS NULL
GROUP BY c.customer_id, c.customer_name, c.customer_tier
),
rfm_scores AS (
-- Convert RFM values to scores (1-5 scale)
-- Hint: Use NTILE() or CASE statements
SELECT *,
CASE
WHEN recency_days <= 30 THEN 5
WHEN recency_days <= 60 THEN 4
WHEN recency_days <= 120 THEN 3
WHEN recency_days <= 365 THEN 2
ELSE 1
END as recency_score,
-- Add frequency and monetary scoring logic
NTILE(5) OVER (ORDER BY frequency_orders) as frequency_score,
NTILE(5) OVER (ORDER BY monetary_total) as monetary_score
FROM customer_rfm
),
customer_segments AS (
-- Create customer segments based on RFM scores
SELECT *,
CASE
WHEN recency_score >= 4 AND frequency_score >= 4 AND monetary_score >= 4 THEN 'Champions'
WHEN recency_score >= 3 AND frequency_score >= 3 AND monetary_score >= 3 THEN 'Loyal Customers'
WHEN recency_score >= 4 AND frequency_score <= 2 THEN 'New Customers'
WHEN recency_score <= 2 AND frequency_score >= 3 AND monetary_score >= 3 THEN 'At Risk'
WHEN recency_score <= 2 AND frequency_score <= 2 THEN 'Lost Customers'
ELSE 'Potential Loyalists'
END as customer_segment
FROM rfm_scores
)
-- Final query: Show segment summary with top products per segment
SELECT
customer_segment,
COUNT(*) as customer_count,
AVG(monetary_total) as avg_customer_value,
AVG(frequency_orders) as avg_order_frequency,
AVG(recency_days) as avg_recency_days
FROM customer_segments
GROUP BY customer_segment
ORDER BY avg_customer_value DESC;
Challenge Extension: Modify the query to also show the top 3 product categories purchased by each customer segment.
-- This will be slow without an index on orders(customer_id)
SELECT customer_name
FROM customers c
WHERE (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id) > 5;
-- Always ensure proper indexing:
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
-- Redundant but harmless
SELECT customer_name
FROM customers c
WHERE EXISTS (SELECT DISTINCT customer_id FROM orders o WHERE o.customer_id = c.customer_id);
-- Correct - EXISTS only checks for existence
SELECT customer_name
FROM customers c
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id);
-- This will cause an error - 'customers' conflicts with table name
WITH customers AS (
SELECT customer_id, customer_name FROM customers WHERE customer_tier = 'Premium'
)
SELECT * FROM customers; -- Ambiguous reference
-- Use descriptive, unique names
WITH premium_customers AS (
SELECT customer_id, customer_name FROM customers WHERE customer_tier = 'Premium'
)
SELECT * FROM premium_customers;
-- Dangerous - no termination condition
WITH RECURSIVE bad_recursion AS (
SELECT 1 as n
UNION ALL
SELECT n + 1 FROM bad_recursion -- Will run forever!
)
SELECT * FROM bad_recursion;
-- Safe - includes termination condition
WITH RECURSIVE safe_recursion AS (
SELECT 1 as n
UNION ALL
SELECT n + 1 FROM safe_recursion WHERE n < 10 -- Stops at 10
)
SELECT * FROM safe_recursion;
When your subqueries or CTEs aren't returning expected results:
Test each CTE independently: Comment out the main query and test each CTE with SELECT * FROM cte_name LIMIT 10
Add debugging columns: Include row counts and intermediate calculations:
WITH debug_cte AS (
SELECT
customer_id,
COUNT(*) as order_count, -- Add this for debugging
SUM(order_total) as total_spend
FROM orders
GROUP BY customer_id
)
SELECT customer_id, order_count, total_spend FROM debug_cte;
You've now mastered the essential techniques for writing complex, readable SQL queries using subqueries and CTEs. Here's what you accomplished:
Immediate next steps:
Advanced topics to explore next:
The techniques you've learned here form the foundation for advanced SQL analytics. Whether you're building data pipelines, creating business intelligence reports, or performing ad-hoc analysis, subqueries and CTEs will help you write maintainable, efficient queries that solve real business problems.
Learning Path: SQL Fundamentals