You're analyzing customer purchase patterns for an e-commerce company, and your manager asks a seemingly simple question: "Which products had above-average sales last month, and what was the performance of their top customers?" Your first instinct might be to write multiple queries, export results to spreadsheets, and manually piece together the analysis. But there's a better way.
This is exactly the kind of complex data question that subqueries and Common Table Expressions (CTEs) were designed to solve. These SQL constructs let you break down intricate business logic into readable, maintainable queries that answer multi-layered questions in a single statement.
By the end of this lesson, you'll confidently use subqueries and CTEs to tackle complex analytical challenges that would otherwise require multiple queries or cumbersome post-processing.
What you'll learn:
You should be comfortable with basic SQL operations including SELECT statements, JOINs, GROUP BY clauses, and aggregate functions. We'll build on these concepts rather than explaining them from scratch.
A subquery is a SQL query nested inside another query. Think of it as asking a question to answer another question. Let's start with a realistic scenario using an e-commerce database.
-- Find customers who spent more than the average order value
SELECT
customer_id,
customer_name,
total_spent
FROM (
SELECT
c.customer_id,
c.customer_name,
SUM(o.order_total) as total_spent
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01'
GROUP BY c.customer_id, c.customer_name
) customer_totals
WHERE total_spent > (
SELECT AVG(order_total)
FROM orders
WHERE order_date >= '2024-01-01'
);
This query demonstrates two types of subqueries working together. The inner subquery in the WHERE clause calculates the average order value, while the FROM subquery (also called a derived table) pre-aggregates customer spending data.
Scalar Subqueries return a single value and are perfect for dynamic comparisons:
-- Products priced above the category average
SELECT
product_name,
price,
category_id
FROM products p1
WHERE price > (
SELECT AVG(price)
FROM products p2
WHERE p2.category_id = p1.category_id
);
Correlated Subqueries reference columns from the outer query, executing once per outer row:
-- Find each customer's most recent order
SELECT
customer_id,
order_date,
order_total
FROM orders o1
WHERE order_date = (
SELECT MAX(order_date)
FROM orders o2
WHERE o2.customer_id = o1.customer_id
);
Performance Note: Correlated subqueries can be slow on large datasets because they execute repeatedly. Consider window functions or JOINs for better performance when possible.
Subqueries with IN/EXISTS are excellent for filtering based on relationships:
-- Customers who have never returned a product
SELECT customer_id, customer_name
FROM customers c
WHERE NOT EXISTS (
SELECT 1
FROM returns r
JOIN orders o ON r.order_id = o.order_id
WHERE o.customer_id = c.customer_id
);
CTEs (Common Table Expressions) provide a cleaner way to structure complex queries. They're like temporary named result sets that exist only for the duration of your query. Here's the same customer analysis from earlier, rewritten with CTEs:
WITH customer_spending AS (
SELECT
c.customer_id,
c.customer_name,
SUM(o.order_total) as total_spent,
COUNT(o.order_id) as order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01'
GROUP BY c.customer_id, c.customer_name
),
average_metrics AS (
SELECT
AVG(order_total) as avg_order_value,
AVG(total_spent) as avg_customer_spent
FROM orders o
JOIN customer_spending cs ON o.customer_id = cs.customer_id
WHERE o.order_date >= '2024-01-01'
)
SELECT
cs.customer_id,
cs.customer_name,
cs.total_spent,
cs.order_count,
ROUND(cs.total_spent / cs.order_count, 2) as avg_order_per_customer,
CASE
WHEN cs.total_spent > am.avg_customer_spent THEN 'Above Average'
ELSE 'Below Average'
END as spending_tier
FROM customer_spending cs
CROSS JOIN average_metrics am
ORDER BY cs.total_spent DESC;
CTEs really shine when you need to perform multi-step analysis. Let's solve a complex business problem: identifying seasonal purchasing patterns and their impact on inventory planning.
WITH monthly_sales AS (
-- Step 1: Aggregate sales by month and product
SELECT
DATE_TRUNC('month', order_date) as sale_month,
p.product_id,
p.product_name,
p.category_id,
SUM(oi.quantity) as units_sold,
SUM(oi.quantity * oi.unit_price) as revenue
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE o.order_date >= '2023-01-01'
GROUP BY DATE_TRUNC('month', order_date), p.product_id, p.product_name, p.category_id
),
seasonal_patterns AS (
-- Step 2: Calculate seasonal trends
SELECT
product_id,
product_name,
EXTRACT(MONTH FROM sale_month) as month_num,
AVG(units_sold) as avg_monthly_units,
STDDEV(units_sold) as units_volatility,
SUM(revenue) as total_revenue
FROM monthly_sales
GROUP BY product_id, product_name, EXTRACT(MONTH FROM sale_month)
),
product_rankings AS (
-- Step 3: Rank products by seasonal performance
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY month_num
ORDER BY avg_monthly_units DESC
) as monthly_rank,
CASE
WHEN units_volatility > avg_monthly_units * 0.5 THEN 'High Volatility'
WHEN units_volatility > avg_monthly_units * 0.2 THEN 'Medium Volatility'
ELSE 'Stable'
END as demand_pattern
FROM seasonal_patterns
)
-- Step 4: Final analysis for inventory planning
SELECT
month_num,
TO_CHAR(TO_DATE(month_num::text, 'MM'), 'Month') as month_name,
product_name,
monthly_rank,
ROUND(avg_monthly_units, 0) as projected_demand,
demand_pattern,
ROUND(total_revenue, 2) as seasonal_revenue
FROM product_rankings
WHERE monthly_rank <= 10 -- Top 10 products per month
ORDER BY month_num, monthly_rank;
This multi-CTE query transforms raw transactional data into actionable inventory insights, showing seasonal demand patterns and volatility for planning purposes.
The choice between subqueries and CTEs often comes down to readability and reusability:
Use subqueries when:
Use CTEs when:
Here's a practical comparison. This subquery approach works but becomes hard to follow:
-- Subquery approach - harder to read and maintain
SELECT
customer_id,
customer_name,
order_frequency,
avg_order_value
FROM (
SELECT
c.customer_id,
c.customer_name,
COUNT(o.order_id) as order_frequency,
AVG(o.order_total) as avg_order_value
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= (SELECT DATE_SUB(CURDATE(), INTERVAL 12 MONTH))
GROUP BY c.customer_id, c.customer_name
) customer_metrics
WHERE order_frequency > (
SELECT AVG(order_frequency)
FROM (
SELECT COUNT(order_id) as order_frequency
FROM orders
WHERE order_date >= (SELECT DATE_SUB(CURDATE(), INTERVAL 12 MONTH))
GROUP BY customer_id
) freq_calc
);
The same logic with CTEs is much clearer:
-- CTE approach - clear and maintainable
WITH analysis_period AS (
SELECT DATE_SUB(CURDATE(), INTERVAL 12 MONTH) as start_date
),
customer_metrics AS (
SELECT
c.customer_id,
c.customer_name,
COUNT(o.order_id) as order_frequency,
AVG(o.order_total) as avg_order_value
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
CROSS JOIN analysis_period ap
WHERE o.order_date >= ap.start_date
GROUP BY c.customer_id, c.customer_name
),
frequency_benchmark AS (
SELECT AVG(order_frequency) as avg_frequency
FROM customer_metrics
)
SELECT
cm.customer_id,
cm.customer_name,
cm.order_frequency,
cm.avg_order_value
FROM customer_metrics cm
CROSS JOIN frequency_benchmark fb
WHERE cm.order_frequency > fb.avg_frequency;
Recursive CTEs handle hierarchical data like organizational charts, product categories, or customer referral chains. Here's how to analyze referral networks:
WITH RECURSIVE referral_chain AS (
-- Anchor: Find customers who weren't referred by anyone
SELECT
customer_id,
customer_name,
referred_by_customer_id,
1 as referral_level,
CAST(customer_name AS VARCHAR(1000)) as chain_path
FROM customers
WHERE referred_by_customer_id IS NULL
UNION ALL
-- Recursive: Find customers referred by those in the chain
SELECT
c.customer_id,
c.customer_name,
c.referred_by_customer_id,
rc.referral_level + 1,
CONCAT(rc.chain_path, ' -> ', c.customer_name)
FROM customers c
JOIN referral_chain rc ON c.referred_by_customer_id = rc.customer_id
WHERE rc.referral_level < 5 -- Prevent infinite recursion
)
SELECT
referral_level,
COUNT(*) as customers_at_level,
AVG(
SELECT SUM(order_total)
FROM orders o
WHERE o.customer_id = rc.customer_id
) as avg_revenue_per_customer
FROM referral_chain rc
GROUP BY referral_level
ORDER BY referral_level;
Recursion Warning: Always include a termination condition in recursive CTEs to prevent infinite loops. The
WHERE rc.referral_level < 5clause serves this purpose.
Let's build a comprehensive customer segmentation analysis using both subqueries and CTEs. You'll create a query that segments customers into tiers based on multiple factors: spending, order frequency, and product diversity.
Scenario: Your marketing team needs to create targeted campaigns. They want customers segmented into "VIP", "Regular", and "Occasional" tiers based on:
Here's your starting point with sample data structure:
-- Your solution should use this data structure
-- Tables: customers, orders, order_items, products
-- Goal: Create customer segments with explanatory metrics
WITH customer_behavior AS (
-- Build this CTE: aggregate customer metrics
-- Include: total_spent, order_count, unique_categories, days_since_last_order
SELECT
c.customer_id,
c.customer_name,
-- Add your aggregation logic here
COUNT(DISTINCT o.order_id) as order_count,
SUM(o.order_total) as total_spent,
COUNT(DISTINCT p.category_id) as unique_categories,
DATEDIFF(CURDATE(), MAX(o.order_date)) as days_since_last_order
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
LEFT JOIN order_items oi ON o.order_id = oi.order_id
LEFT JOIN products p ON oi.product_id = p.product_id
WHERE o.order_date >= DATE_SUB(CURDATE(), INTERVAL 12 MONTH)
GROUP BY c.customer_id, c.customer_name
),
spending_percentiles AS (
-- Build this CTE: calculate spending thresholds
-- Use PERCENTILE_CONT or similar function
SELECT
PERCENTILE_CONT(0.8) WITHIN GROUP (ORDER BY total_spent) as vip_threshold,
AVG(order_count) as avg_order_frequency,
AVG(unique_categories) as avg_category_diversity
FROM customer_behavior
WHERE total_spent > 0
)
-- Build final SELECT: assign tiers and include supporting metrics
SELECT
cb.customer_id,
cb.customer_name,
cb.total_spent,
cb.order_count,
cb.unique_categories,
cb.days_since_last_order,
CASE
WHEN cb.total_spent >= sp.vip_threshold
AND cb.order_count > sp.avg_order_frequency
AND cb.unique_categories >= sp.avg_category_diversity
THEN 'VIP'
WHEN cb.total_spent > 0
AND cb.order_count >= sp.avg_order_frequency
AND cb.days_since_last_order <= 90
THEN 'Regular'
ELSE 'Occasional'
END as customer_tier
FROM customer_behavior cb
CROSS JOIN spending_percentiles sp
ORDER BY cb.total_spent DESC;
Challenge Extension: Modify your query to include a subquery that identifies customers whose spending has increased month-over-month in the last quarter. Add this as a "trending_up" flag to your segmentation.
-- Slow: Correlated subquery executes for each row
SELECT product_name, price
FROM products p1
WHERE price > (
SELECT AVG(price)
FROM products p2
WHERE p2.category_id = p1.category_id
);
-- Faster: Use window functions instead
SELECT product_name, price
FROM (
SELECT
product_name,
price,
AVG(price) OVER (PARTITION BY category_id) as category_avg_price
FROM products
) p
WHERE price > category_avg_price;
-- Problem: Subquery returns NULL, making entire WHERE condition fail
SELECT customer_name
FROM customers
WHERE customer_id IN (
SELECT customer_id
FROM orders
WHERE shipping_address IS NULL -- This might return no rows
);
-- Solution: Use EXISTS or add explicit NULL handling
SELECT customer_name
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
AND shipping_address IS NULL
);
-- Wrong: Trying to reference CTE outside its query
WITH recent_orders AS (
SELECT * FROM orders WHERE order_date >= '2024-01-01'
)
SELECT * FROM recent_orders;
-- This won't work in a separate statement:
-- SELECT * FROM recent_orders; -- Error: CTE doesn't exist here
-- Solution: CTEs only exist within their statement
WITH recent_orders AS (
SELECT * FROM orders WHERE order_date >= '2024-01-01'
),
order_summary AS (
SELECT customer_id, COUNT(*) as order_count
FROM recent_orders -- This works: same statement
GROUP BY customer_id
)
SELECT * FROM order_summary;
When your nested queries aren't returning expected results:
WITH step1 AS (
SELECT customer_id, SUM(order_total) as total
FROM orders
GROUP BY customer_id
-- Add this while debugging:
-- LIMIT 100
),
step1_debug AS (
SELECT *, COUNT(*) OVER () as total_rows
FROM step1
)
SELECT * FROM step1_debug;
Ensure your subqueries can leverage indexes effectively:
-- Good: Uses index on order_date
SELECT customer_id
FROM customers
WHERE customer_id IN (
SELECT customer_id
FROM orders
WHERE order_date >= '2024-01-01' -- Make sure there's an index here
);
-- Better: Join might be faster with proper indexes
SELECT DISTINCT c.customer_id
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01';
Some databases materialize CTEs (store results temporarily), others optimize them as subqueries. Know your database behavior:
-- In PostgreSQL, you can force materialization
WITH MATERIALIZED customer_stats AS (
SELECT customer_id, AVG(order_total) as avg_order
FROM orders
GROUP BY customer_id
)
SELECT * FROM customer_stats WHERE avg_order > 100;
Instead of repeating the same subquery:
-- Inefficient: Calculates average twice
SELECT
product_name,
price,
price - (SELECT AVG(price) FROM products) as price_diff,
CASE
WHEN price > (SELECT AVG(price) FROM products) THEN 'Above Average'
ELSE 'Below Average'
END as price_category
FROM products;
-- Efficient: Calculate once with CTE
WITH avg_price AS (
SELECT AVG(price) as market_avg FROM products
)
SELECT
product_name,
price,
price - ap.market_avg as price_diff,
CASE
WHEN price > ap.market_avg THEN 'Above Average'
ELSE 'Below Average'
END as price_category
FROM products
CROSS JOIN avg_price ap;
WITH monthly_revenue AS (
SELECT
DATE_TRUNC('month', order_date) as month,
SUM(order_total) as monthly_total
FROM orders
WHERE order_date >= '2023-01-01'
GROUP BY DATE_TRUNC('month', order_date)
),
rolling_metrics AS (
SELECT
month,
monthly_total,
AVG(monthly_total) OVER (
ORDER BY month
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) as three_month_avg,
LAG(monthly_total, 1) OVER (ORDER BY month) as prev_month,
LAG(monthly_total, 12) OVER (ORDER BY month) as year_ago
FROM monthly_revenue
)
SELECT
month,
monthly_total,
three_month_avg,
ROUND(
(monthly_total - prev_month) / prev_month * 100, 2
) as month_over_month_pct,
ROUND(
(monthly_total - year_ago) / year_ago * 100, 2
) as year_over_year_pct
FROM rolling_metrics
WHERE prev_month IS NOT NULL
ORDER BY month;
WITH product_velocity AS (
SELECT
p.product_id,
p.product_name,
SUM(oi.quantity) as total_sold_3mo,
AVG(oi.quantity) as avg_order_qty,
COUNT(DISTINCT o.order_id) as order_frequency
FROM products p
LEFT JOIN order_items oi ON p.product_id = oi.product_id
LEFT JOIN orders o ON oi.order_id = o.order_id
WHERE o.order_date >= DATE_SUB(CURDATE(), INTERVAL 3 MONTH)
GROUP BY p.product_id, p.product_name
),
inventory_status AS (
SELECT
pv.*,
i.current_stock,
CASE
WHEN pv.total_sold_3mo = 0 THEN 999
ELSE i.current_stock / (pv.total_sold_3mo / 90)
END as days_of_inventory,
i.reorder_point,
i.reorder_quantity
FROM product_velocity pv
JOIN inventory i ON pv.product_id = i.product_id
)
SELECT
product_name,
current_stock,
ROUND(days_of_inventory, 1) as days_supply,
CASE
WHEN days_of_inventory < 30 THEN 'Urgent Reorder'
WHEN days_of_inventory < 60 THEN 'Plan Reorder'
WHEN days_of_inventory > 180 THEN 'Overstock Risk'
ELSE 'Normal'
END as inventory_status,
reorder_quantity as suggested_order
FROM inventory_status
WHERE total_sold_3mo > 0 -- Only active products
ORDER BY days_of_inventory ASC;
You've now mastered the fundamental techniques for building complex, multi-layered SQL queries using subqueries and CTEs. These tools transform unwieldy business questions into elegant, maintainable SQL that your colleagues can understand and modify.
Key takeaways:
Practice opportunities:
Next in your SQL journey: With these complex query techniques mastered, you're ready to dive into advanced analytics with window functions, or explore query optimization and execution planning to make your sophisticated queries run efficiently at scale.
The combination of subqueries and CTEs gives you the power to answer virtually any question your data can support—all within the elegance and performance of SQL.
Learning Path: SQL Fundamentals