You're staring at a complex data problem: you need to analyze customer purchase patterns, but the data spans multiple tables and requires calculations that build on each other. Your first instinct might be to write a massive, nested query that becomes increasingly unreadable—or worse, create temporary tables that clutter your database. There's a better way.
Subqueries and Common Table Expressions (CTEs) are SQL's answer to breaking down complex analytical problems into manageable, readable pieces. While subqueries let you embed one query inside another, CTEs take this concept further by creating named, temporary result sets that you can reference multiple times within a single statement. Think of CTEs as creating temporary "views" that exist only for the duration of your query.
By the end of this lesson, you'll be writing sophisticated analytical queries that would have seemed impossible before, all while keeping your code clean and maintainable.
What you'll learn:
You should be comfortable with basic SQL operations (SELECT, JOIN, GROUP BY, aggregate functions) and understand fundamental database concepts like primary keys and relationships. If you can write a multi-table JOIN with filtering and grouping, you're ready for this lesson.
A subquery is simply a query nested inside another query. While this sounds straightforward, subqueries unlock powerful analytical capabilities that would otherwise require multiple steps or complex joins.
Let's work with a realistic e-commerce dataset to explore these concepts:
-- Sample tables structure for our examples
-- customers: customer_id, customer_name, registration_date, tier (bronze/silver/gold)
-- orders: order_id, customer_id, order_date, total_amount, status
-- order_items: order_item_id, order_id, product_id, quantity, unit_price
-- products: product_id, product_name, category, supplier_id
The simplest subqueries return a single value, which you can use in WHERE clauses or SELECT statements:
-- Find customers who spent more than the average order amount
SELECT customer_id, customer_name
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.total_amount > (
SELECT AVG(total_amount)
FROM orders
WHERE status = 'completed'
)
);
This query uses two subquery techniques: a scalar subquery to calculate the average order amount, and an EXISTS subquery to filter customers. The scalar subquery (SELECT AVG(total_amount) FROM orders WHERE status = 'completed') runs once and returns a single number.
Performance Tip: Scalar subqueries in WHERE clauses are evaluated once per query execution, making them efficient for filtering operations.
The key distinction in subqueries is whether they reference the outer query:
-- Non-correlated: runs once, independent of outer query
SELECT product_name, category
FROM products
WHERE category IN (
SELECT DISTINCT category
FROM products
WHERE supplier_id = 100
);
-- Correlated: runs once for each row in outer query
SELECT c.customer_name, c.tier,
(SELECT COUNT(*)
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.order_date >= '2023-01-01') as orders_this_year
FROM customers c
WHERE c.tier = 'gold';
The correlated subquery accesses c.customer_id from the outer query, so it must execute once for each customer. This can be expensive but provides row-by-row calculations that would be difficult to achieve otherwise.
Subqueries excel at filtering based on aggregated or calculated conditions:
-- Find products that have above-average sales volume
WITH product_sales AS (
SELECT p.product_id, p.product_name, p.category,
SUM(oi.quantity) as total_sold
FROM products p
JOIN order_items oi ON p.product_id = oi.product_id
JOIN orders o ON oi.order_id = o.order_id
WHERE o.status = 'completed'
GROUP BY p.product_id, p.product_name, p.category
)
SELECT product_name, category, total_sold
FROM product_sales
WHERE total_sold > (
SELECT AVG(total_sold)
FROM product_sales
);
Wait—this example jumped ahead to CTEs! That's because once you understand the concept, you'll find CTEs often provide cleaner solutions than nested subqueries.
CTEs solve the readability problem that complex subqueries create. They let you define named result sets at the beginning of your query, then reference them multiple times:
WITH high_value_customers AS (
SELECT customer_id, customer_name, tier,
SUM(total_amount) as lifetime_value
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.status = 'completed'
GROUP BY customer_id, customer_name, tier
HAVING SUM(total_amount) > 10000
),
recent_orders AS (
SELECT customer_id, COUNT(*) as recent_order_count,
AVG(total_amount) as avg_recent_amount
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '90 days'
AND status = 'completed'
GROUP BY customer_id
)
SELECT hvc.customer_name, hvc.tier, hvc.lifetime_value,
COALESCE(ro.recent_order_count, 0) as recent_orders,
COALESCE(ro.avg_recent_amount, 0) as avg_recent_amount
FROM high_value_customers hvc
LEFT JOIN recent_orders ro ON hvc.customer_id = ro.customer_id
ORDER BY hvc.lifetime_value DESC;
This query defines two CTEs and then joins them together. Try writing this with traditional subqueries and you'll quickly appreciate the clarity CTEs provide.
CTEs can reference earlier CTEs in the same statement, enabling step-by-step analysis:
WITH monthly_sales AS (
SELECT DATE_TRUNC('month', order_date) as month,
SUM(total_amount) as monthly_revenue
FROM orders
WHERE status = 'completed'
GROUP BY DATE_TRUNC('month', order_date)
),
sales_with_growth AS (
SELECT month, monthly_revenue,
LAG(monthly_revenue) OVER (ORDER BY month) as prev_month_revenue
FROM monthly_sales
),
growth_analysis AS (
SELECT month, monthly_revenue, prev_month_revenue,
CASE
WHEN prev_month_revenue IS NULL THEN NULL
ELSE ROUND(((monthly_revenue - prev_month_revenue) / prev_month_revenue * 100), 2)
END as growth_rate
FROM sales_with_growth
)
SELECT month, monthly_revenue, growth_rate,
AVG(growth_rate) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as avg_growth_3_months
FROM growth_analysis
WHERE month >= '2023-01-01'
ORDER BY month;
Each CTE builds on the previous ones: calculating monthly sales, adding previous month data, computing growth rates, and finally calculating rolling averages.
Recursive CTEs handle tree-like data structures, such as organizational charts, category hierarchies, or network relationships:
-- Assuming we have a category hierarchy: categories table with parent_category_id
WITH RECURSIVE category_hierarchy AS (
-- Base case: top-level categories
SELECT category_id, category_name, parent_category_id,
category_name as full_path, 0 as level
FROM categories
WHERE parent_category_id IS NULL
UNION ALL
-- Recursive case: child categories
SELECT c.category_id, c.category_name, c.parent_category_id,
ch.full_path || ' > ' || c.category_name as full_path,
ch.level + 1 as level
FROM categories c
JOIN category_hierarchy ch ON c.parent_category_id = ch.category_id
)
SELECT category_id, category_name, full_path, level
FROM category_hierarchy
ORDER BY full_path;
This recursive CTE starts with top-level categories (base case) and then recursively finds all child categories, building the complete hierarchy path and tracking depth levels.
Warning: Always include a termination condition in recursive CTEs to prevent infinite loops. Most databases also have recursion depth limits as a safety measure.
CTEs work beautifully with window functions for advanced analytics:
WITH customer_order_patterns AS (
SELECT c.customer_id, c.customer_name, c.tier,
o.order_id, o.order_date, o.total_amount,
ROW_NUMBER() OVER (PARTITION BY c.customer_id ORDER BY o.order_date) as order_sequence,
COUNT(*) OVER (PARTITION BY c.customer_id) as total_orders,
SUM(o.total_amount) OVER (PARTITION BY c.customer_id ORDER BY o.order_date
ROWS UNBOUNDED PRECEDING) as running_total
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.status = 'completed'
),
customer_segments AS (
SELECT customer_id, customer_name, tier, total_orders,
MAX(running_total) as lifetime_value,
CASE
WHEN MAX(running_total) > 20000 THEN 'VIP'
WHEN MAX(running_total) > 5000 THEN 'Premium'
WHEN COUNT(*) > 10 THEN 'Frequent'
ELSE 'Standard'
END as customer_segment
FROM customer_order_patterns
GROUP BY customer_id, customer_name, tier, total_orders
)
SELECT customer_segment, COUNT(*) as customer_count,
AVG(lifetime_value) as avg_lifetime_value,
AVG(total_orders) as avg_order_count
FROM customer_segments
GROUP BY customer_segment
ORDER BY avg_lifetime_value DESC;
This example combines CTEs with window functions to analyze customer ordering patterns and create behavioral segments.
Understanding when to use subqueries vs CTEs vs other approaches is crucial for production systems.
Subqueries are often optimal for:
-- Efficient: uses index on order_date and customer_id
SELECT customer_name
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.order_date >= '2023-01-01'
);
CTEs provide advantages for:
Different databases handle CTE materialization differently:
-- This CTE might be materialized (good) or inlined (potentially bad)
WITH expensive_calculation AS (
SELECT product_id,
SUM(quantity * unit_price) as total_revenue,
COUNT(DISTINCT order_id) as order_count,
-- Complex analytical functions here
FROM order_items oi
JOIN orders o ON oi.order_id = o.order_id
WHERE o.status = 'completed'
GROUP BY product_id
)
SELECT * FROM expensive_calculation WHERE total_revenue > 10000
UNION ALL
SELECT * FROM expensive_calculation WHERE order_count > 100;
In PostgreSQL, this CTE is typically materialized once and reused. In other systems, it might be executed twice. Know your database's behavior for production queries.
Performance Tip: For databases that don't automatically materialize CTEs, consider using temporary tables for expensive calculations that are referenced multiple times.
Let's build a comprehensive customer analysis that combines multiple techniques:
Scenario: Create a report showing customer retention analysis with the following requirements:
Here's the solution broken down:
WITH customer_first_purchase AS (
-- Find each customer's first purchase date
SELECT customer_id, MIN(order_date) as first_purchase_date
FROM orders
WHERE status = 'completed'
GROUP BY customer_id
),
monthly_cohorts AS (
-- Group customers into monthly cohorts
SELECT customer_id,
DATE_TRUNC('month', first_purchase_date) as cohort_month,
first_purchase_date
FROM customer_first_purchase
),
customer_monthly_activity AS (
-- Track which months each customer was active
SELECT o.customer_id,
DATE_TRUNC('month', o.order_date) as activity_month,
SUM(o.total_amount) as monthly_spend,
COUNT(*) as monthly_orders
FROM orders o
WHERE o.status = 'completed'
AND o.order_date >= CURRENT_DATE - INTERVAL '12 months'
GROUP BY o.customer_id, DATE_TRUNC('month', o.order_date)
),
cohort_analysis AS (
-- Calculate months since first purchase for each activity
SELECT mc.customer_id, mc.cohort_month,
cma.activity_month, cma.monthly_spend, cma.monthly_orders,
EXTRACT(MONTH FROM AGE(cma.activity_month, mc.cohort_month)) as months_since_first
FROM monthly_cohorts mc
JOIN customer_monthly_activity cma ON mc.customer_id = cma.customer_id
),
retention_rates AS (
-- Calculate retention rates by cohort
SELECT cohort_month, months_since_first,
COUNT(DISTINCT customer_id) as active_customers,
FIRST_VALUE(COUNT(DISTINCT customer_id)) OVER (
PARTITION BY cohort_month ORDER BY months_since_first
) as cohort_size,
ROUND(
COUNT(DISTINCT customer_id) * 100.0 /
FIRST_VALUE(COUNT(DISTINCT customer_id)) OVER (
PARTITION BY cohort_month ORDER BY months_since_first
), 2
) as retention_rate
FROM cohort_analysis
GROUP BY cohort_month, months_since_first
),
churn_risk_customers AS (
-- Identify customers at risk of churning
SELECT c.customer_id, c.customer_name,
mc.cohort_month,
MAX(o.order_date) as last_order_date,
CURRENT_DATE - MAX(o.order_date) as days_since_last_order,
COUNT(o.order_id) as total_orders,
SUM(o.total_amount) as lifetime_value
FROM customers c
JOIN monthly_cohorts mc ON c.customer_id = mc.customer_id
LEFT JOIN orders o ON c.customer_id = o.customer_id AND o.status = 'completed'
GROUP BY c.customer_id, c.customer_name, mc.cohort_month
HAVING MAX(o.order_date) < CURRENT_DATE - INTERVAL '60 days'
AND COUNT(o.order_id) >= 3 -- Only customers who were previously active
)
-- Main query: combine insights
SELECT 'Retention Analysis' as report_section,
cohort_month::text as metric_name,
retention_rate::text as metric_value
FROM retention_rates
WHERE months_since_first <= 6
UNION ALL
SELECT 'Churn Risk Customers' as report_section,
customer_name as metric_name,
(days_since_last_order || ' days since last order')::text as metric_value
FROM churn_risk_customers
ORDER BY report_section, metric_name;
This query demonstrates several advanced concepts:
-- Inefficient: correlated subquery runs for every row
SELECT c.customer_name,
(SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id) as order_count
FROM customers c;
-- Better: use a JOIN or CTE
WITH customer_orders AS (
SELECT customer_id, COUNT(*) as order_count
FROM orders
GROUP BY customer_id
)
SELECT c.customer_name, COALESCE(co.order_count, 0) as order_count
FROM customers c
LEFT JOIN customer_orders co ON c.customer_id = co.customer_id;
-- Problem: subquery returns NULL, causing unexpected results
SELECT product_name
FROM products
WHERE category = (SELECT category FROM products WHERE product_id = 999); -- Might not exist
-- Better: handle potential NULLs
SELECT product_name
FROM products
WHERE category = (
SELECT category
FROM products
WHERE product_id = 999
) AND EXISTS (SELECT 1 FROM products WHERE product_id = 999);
-- Dangerous: no proper termination condition
WITH RECURSIVE bad_example AS (
SELECT 1 as n
UNION ALL
SELECT n + 1 FROM bad_example -- Will run forever!
)
SELECT * FROM bad_example;
-- Safe: include termination condition
WITH RECURSIVE safe_example AS (
SELECT 1 as n
UNION ALL
SELECT n + 1 FROM safe_example WHERE n < 100
)
SELECT * FROM safe_example;
When your CTE-based queries aren't returning expected results:
-- Debug approach: test one CTE at a time
WITH debug_step AS (
SELECT customer_id, COUNT(*) as order_count
FROM orders
WHERE status = 'completed'
GROUP BY customer_id
)
SELECT * FROM debug_step LIMIT 10; -- Examine this result first
You've now mastered the tools for breaking down complex analytical problems into manageable pieces. Subqueries give you the power to embed calculations and filters within larger queries, while CTEs provide a clean, readable way to structure multi-step analysis.
Key takeaways:
What's next: With these techniques mastered, you're ready for advanced topics like query optimization, database-specific features (like PostgreSQL's LATERAL joins or SQL Server's APPLY operators), and more sophisticated analytical functions. You're also prepared to tackle real-world data engineering challenges where complex transformations and analysis are essential.
Practice these patterns with your own datasets, and you'll find that problems that once seemed impossibly complex become straightforward exercises in breaking down requirements into logical steps.
Learning Path: SQL Fundamentals