
You're staring at a complex business question: "Which customers made their largest purchase in Q4 2023, and how does that purchase compare to their historical average?" Your first instinct might be to export data to Excel or write multiple queries and manually piece together the results. But experienced SQL practitioners know there's a more elegant path—one that keeps all the logic in SQL and makes your analysis both readable and maintainable.
This is where subqueries and Common Table Expressions (CTEs) transform from mere syntax features into powerful analytical weapons. They let you break down complex problems into digestible chunks, create reusable intermediate results, and build sophisticated analyses that would otherwise require multiple round trips to the database or complex application logic.
By mastering these techniques, you'll write queries that colleagues can actually understand, debug analyses faster, and tackle business questions that seemed impossibly complex. More importantly, you'll think differently about data problems—seeing them as a series of logical steps rather than monolithic challenges.
What you'll learn:
You should be comfortable with intermediate SQL concepts including joins, window functions, and aggregate operations. Experience with query execution plans and basic performance tuning will help you understand the optimization discussions, though we'll explain the key concepts as we go.
Before diving into syntax, let's establish how subqueries and CTEs fit into the SQL execution model. This foundation will guide our decisions throughout the lesson.
A subquery is a query nested inside another query. It can appear in SELECT, FROM, WHERE, or HAVING clauses, and the database engine decides how to execute it—sometimes materializing results, sometimes transforming it into joins, sometimes executing it repeatedly.
A CTE is a named temporary result set that exists for the duration of a single query. Think of it as a way to define a view that only lives for one statement. CTEs make complex queries readable by letting you build up results step by step.
Here's a concrete example that illustrates the difference. Suppose we're analyzing customer purchase patterns in an e-commerce database:
-- Schema context
-- customers: customer_id, name, registration_date, tier
-- orders: order_id, customer_id, order_date, total_amount
-- order_items: order_id, product_id, quantity, unit_price
Using a subquery approach:
SELECT
c.name,
c.tier,
(SELECT COUNT(*)
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.order_date >= '2023-01-01') as orders_2023,
(SELECT AVG(total_amount)
FROM orders o
WHERE o.customer_id = c.customer_id) as avg_order_value
FROM customers c
WHERE c.tier = 'premium'
AND EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.total_amount > 1000
);
Using a CTE approach:
WITH customer_stats AS (
SELECT
customer_id,
COUNT(CASE WHEN order_date >= '2023-01-01' THEN 1 END) as orders_2023,
AVG(total_amount) as avg_order_value,
MAX(total_amount) as max_order_value
FROM orders
GROUP BY customer_id
)
SELECT
c.name,
c.tier,
cs.orders_2023,
cs.avg_order_value
FROM customers c
JOIN customer_stats cs ON c.customer_id = cs.customer_id
WHERE c.tier = 'premium'
AND cs.max_order_value > 1000;
The CTE version has several advantages: it calculates statistics once per customer rather than multiple times, it's easier to modify and debug, and it makes the main query's logic clearer. However, the subquery version might perform better in some database systems if only a few customers meet the criteria, since it can short-circuit calculations.
Subqueries come in several flavors, each with distinct performance characteristics and use cases. Let's explore them systematically.
Scalar subqueries return a single value and can appear anywhere an expression is valid. They're particularly useful for calculations that depend on aggregate values from other tables.
-- Find customers whose latest order was above their personal average
SELECT
c.name,
latest_order.order_date,
latest_order.total_amount,
(SELECT AVG(total_amount)
FROM orders o2
WHERE o2.customer_id = c.customer_id) as personal_avg
FROM customers c
JOIN (
SELECT DISTINCT customer_id,
FIRST_VALUE(order_date) OVER (PARTITION BY customer_id ORDER BY order_date DESC) as order_date,
FIRST_VALUE(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date DESC) as total_amount
FROM orders
) latest_order ON c.customer_id = latest_order.customer_id
WHERE latest_order.total_amount > (
SELECT AVG(total_amount)
FROM orders o3
WHERE o3.customer_id = c.customer_id
);
Notice we're calculating the personal average twice—once in SELECT and once in WHERE. This is inefficient, but sometimes necessary when the same calculation appears in different clauses. A CTE would solve this duplication.
Correlated subqueries reference columns from the outer query. They're powerful but potentially expensive since they execute once for each row in the outer query.
-- Find products that are above average price in their category
SELECT
p.product_name,
p.category,
p.price,
(SELECT AVG(price) FROM products p2 WHERE p2.category = p.category) as category_avg
FROM products p
WHERE p.price > (
SELECT AVG(price)
FROM products p2
WHERE p2.category = p.category
);
This query executes the average calculation once per product. Modern optimizers often transform this into a window function or join, but understanding the logical execution helps you write better queries.
Here's the equivalent using window functions:
WITH product_stats AS (
SELECT
product_name,
category,
price,
AVG(price) OVER (PARTITION BY category) as category_avg
FROM products
)
SELECT
product_name,
category,
price,
category_avg
FROM product_stats
WHERE price > category_avg;
The window function version calculates each category average once, making it more efficient for large datasets.
EXISTS subqueries are optimized for boolean logic—they stop executing as soon as they find a matching row. This makes them excellent for filtering.
-- Customers who have never made a purchase over $500
SELECT c.name, c.registration_date
FROM customers c
WHERE NOT EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.total_amount > 500
);
-- Customers who made purchases in every quarter of 2023
SELECT c.name
FROM customers c
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id AND EXTRACT(quarter FROM o.order_date) = 1 AND EXTRACT(year FROM o.order_date) = 2023)
AND EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id AND EXTRACT(quarter FROM o.order_date) = 2 AND EXTRACT(year FROM o.order_date) = 2023)
AND EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id AND EXTRACT(quarter FROM o.order_date) = 3 AND EXTRACT(year FROM o.order_date) = 2023)
AND EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id AND EXTRACT(quarter FROM o.order_date) = 4 AND EXTRACT(year FROM o.order_date) = 2023);
The multiple EXISTS approach works but is verbose. Here's a more elegant solution using aggregation:
-- More elegant version using HAVING
SELECT c.name
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
AND EXTRACT(year FROM o.order_date) = 2023
GROUP BY EXTRACT(quarter FROM o.order_date)
HAVING COUNT(DISTINCT EXTRACT(quarter FROM o.order_date)) = 4
);
Actually, that's not quite right either. Let me fix it:
-- Correct version
SELECT c.name
FROM customers c
WHERE (
SELECT COUNT(DISTINCT EXTRACT(quarter FROM o.order_date))
FROM orders o
WHERE o.customer_id = c.customer_id
AND EXTRACT(year FROM o.order_date) = 2023
) = 4;
Subqueries in FROM clauses create derived tables—temporary result sets you can join with other tables. These are often the most readable way to handle multi-step calculations.
-- Monthly customer acquisition and their first purchase patterns
SELECT
acq.registration_month,
acq.new_customers,
fp.avg_days_to_first_purchase,
fp.avg_first_purchase_amount
FROM (
SELECT
DATE_TRUNC('month', registration_date) as registration_month,
COUNT(*) as new_customers
FROM customers
GROUP BY DATE_TRUNC('month', registration_date)
) acq
LEFT JOIN (
SELECT
DATE_TRUNC('month', c.registration_date) as registration_month,
AVG(o.order_date - c.registration_date) as avg_days_to_first_purchase,
AVG(o.total_amount) as avg_first_purchase_amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN (
SELECT customer_id, MIN(order_date) as first_order_date
FROM orders
GROUP BY customer_id
) first_orders ON o.customer_id = first_orders.customer_id
AND o.order_date = first_orders.first_order_date
GROUP BY DATE_TRUNC('month', c.registration_date)
) fp ON acq.registration_month = fp.registration_month
ORDER BY acq.registration_month;
This query has three levels of nesting and multiple derived tables. It works, but it's getting hard to follow. This is where CTEs shine.
CTEs transform the previous query into something much more maintainable:
WITH monthly_acquisitions AS (
SELECT
DATE_TRUNC('month', registration_date) as registration_month,
COUNT(*) as new_customers
FROM customers
GROUP BY DATE_TRUNC('month', registration_date)
),
first_orders AS (
SELECT
customer_id,
MIN(order_date) as first_order_date
FROM orders
GROUP BY customer_id
),
first_purchase_stats AS (
SELECT
DATE_TRUNC('month', c.registration_date) as registration_month,
AVG(o.order_date - c.registration_date) as avg_days_to_first_purchase,
AVG(o.total_amount) as avg_first_purchase_amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN first_orders fo ON o.customer_id = fo.customer_id
AND o.order_date = fo.first_order_date
GROUP BY DATE_TRUNC('month', c.registration_date)
)
SELECT
ma.registration_month,
ma.new_customers,
fps.avg_days_to_first_purchase,
fps.avg_first_purchase_amount
FROM monthly_acquisitions ma
LEFT JOIN first_purchase_stats fps ON ma.registration_month = fps.registration_month
ORDER BY ma.registration_month;
Now each logical step is clearly defined and reusable. You can easily add more analysis by creating additional CTEs that build on these foundations.
CTEs can reference earlier CTEs in the same statement, letting you build complex analyses incrementally:
WITH customer_segments AS (
SELECT
customer_id,
CASE
WHEN COUNT(*) >= 10 THEN 'frequent'
WHEN COUNT(*) >= 5 THEN 'regular'
ELSE 'occasional'
END as segment,
SUM(total_amount) as lifetime_value,
AVG(total_amount) as avg_order_value,
MIN(order_date) as first_order,
MAX(order_date) as last_order
FROM orders
GROUP BY customer_id
),
segment_benchmarks AS (
SELECT
segment,
AVG(lifetime_value) as avg_lifetime_value,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY lifetime_value) as median_lifetime_value,
PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY lifetime_value) as p90_lifetime_value
FROM customer_segments
GROUP BY segment
),
customer_analysis AS (
SELECT
cs.*,
sb.avg_lifetime_value as segment_avg_ltv,
sb.median_lifetime_value as segment_median_ltv,
CASE
WHEN cs.lifetime_value >= sb.p90_lifetime_value THEN 'top_10_percent'
WHEN cs.lifetime_value >= sb.median_lifetime_value THEN 'above_median'
ELSE 'below_median'
END as segment_performance,
EXTRACT(days FROM (CURRENT_DATE - cs.last_order)) as days_since_last_order
FROM customer_segments cs
JOIN segment_benchmarks sb ON cs.segment = sb.segment
)
SELECT
segment,
segment_performance,
COUNT(*) as customer_count,
AVG(lifetime_value) as avg_ltv,
AVG(days_since_last_order) as avg_days_inactive,
COUNT(CASE WHEN days_since_last_order > 90 THEN 1 END) as at_risk_customers
FROM customer_analysis
GROUP BY segment, segment_performance
ORDER BY segment, segment_performance;
This analysis creates customer segments, calculates benchmarks for each segment, classifies customers within their segments, then summarizes the results. Each step builds on the previous ones, making the logic easy to follow and modify.
Performance Tip: Most database systems materialize CTE results, meaning they're calculated once and stored temporarily. This can be more efficient than repeated subqueries, but it also means every CTE result is fully computed even if only partially used.
Recursive CTEs handle hierarchical or graph-like data by repeatedly applying the same logic until no new rows are generated. They're essential for organizational charts, bill-of-materials, social networks, and similar structures.
-- Build organizational hierarchy from employee table
-- employees: employee_id, name, manager_id, department, salary
WITH RECURSIVE org_hierarchy AS (
-- Base case: top-level managers (no manager_id)
SELECT
employee_id,
name,
manager_id,
department,
salary,
0 as level,
CAST(name AS VARCHAR(1000)) as hierarchy_path
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive case: employees with managers
SELECT
e.employee_id,
e.name,
e.manager_id,
e.department,
e.salary,
oh.level + 1,
CAST(oh.hierarchy_path || ' > ' || e.name AS VARCHAR(1000))
FROM employees e
JOIN org_hierarchy oh ON e.manager_id = oh.employee_id
)
SELECT
level,
REPEAT(' ', level) || name as indented_name,
hierarchy_path,
department,
salary,
COUNT(*) OVER (PARTITION BY level) as peers_at_level
FROM org_hierarchy
ORDER BY hierarchy_path;
This query starts with top-level managers, then repeatedly finds their direct reports, building the complete organizational tree. The hierarchy_path column shows the chain of command, and level indicates how many management layers down each employee sits.
Recursive CTEs can also solve more complex problems like finding shortest paths or calculating running totals with dependencies:
-- Calculate cumulative impact of product recommendations
-- recommendations: from_product_id, to_product_id, influence_score
-- sales: product_id, sales_amount
WITH RECURSIVE recommendation_impact AS (
-- Base case: direct product sales
SELECT
s.product_id,
s.product_id as source_product_id,
s.sales_amount as direct_impact,
s.sales_amount as total_impact,
0 as depth
FROM sales s
UNION ALL
-- Recursive case: indirect impact through recommendations
SELECT
ri.product_id,
s.product_id as source_product_id,
s.sales_amount * r.influence_score as direct_impact,
ri.total_impact + (s.sales_amount * r.influence_score) as total_impact,
ri.depth + 1
FROM recommendation_impact ri
JOIN recommendations r ON ri.source_product_id = r.from_product_id
JOIN sales s ON r.to_product_id = s.product_id
WHERE ri.depth < 3 -- Prevent infinite recursion
)
SELECT
product_id,
SUM(direct_impact) as total_direct_sales,
SUM(total_impact) as total_influence_impact,
COUNT(DISTINCT source_product_id) as influenced_products
FROM recommendation_impact
GROUP BY product_id
ORDER BY total_influence_impact DESC;
Warning: Always include a termination condition in recursive CTEs. Most databases have recursion limits, but runaway recursion can consume significant resources before hitting those limits.
CTEs work beautifully with window functions, letting you create sophisticated analytical queries that would be nearly impossible with subqueries alone.
-- Customer lifecycle analysis with cohort retention
WITH customer_cohorts AS (
SELECT
customer_id,
DATE_TRUNC('month', MIN(order_date)) as cohort_month
FROM orders
GROUP BY customer_id
),
monthly_activity AS (
SELECT
o.customer_id,
cc.cohort_month,
DATE_TRUNC('month', o.order_date) as activity_month,
SUM(o.total_amount) as monthly_revenue
FROM orders o
JOIN customer_cohorts cc ON o.customer_id = cc.customer_id
GROUP BY o.customer_id, cc.cohort_month, DATE_TRUNC('month', o.order_date)
),
cohort_sizes AS (
SELECT
cohort_month,
COUNT(DISTINCT customer_id) as cohort_size
FROM customer_cohorts
GROUP BY cohort_month
),
retention_data AS (
SELECT
ma.cohort_month,
ma.activity_month,
EXTRACT(month FROM age(ma.activity_month, ma.cohort_month)) as months_since_first,
COUNT(DISTINCT ma.customer_id) as active_customers,
SUM(ma.monthly_revenue) as cohort_revenue
FROM monthly_activity ma
GROUP BY ma.cohort_month, ma.activity_month
)
SELECT
rd.cohort_month,
rd.months_since_first,
cs.cohort_size,
rd.active_customers,
ROUND(100.0 * rd.active_customers / cs.cohort_size, 2) as retention_rate,
rd.cohort_revenue,
SUM(rd.cohort_revenue) OVER (
PARTITION BY rd.cohort_month
ORDER BY rd.months_since_first
) as cumulative_revenue_per_cohort
FROM retention_data rd
JOIN cohort_sizes cs ON rd.cohort_month = cs.cohort_month
ORDER BY rd.cohort_month, rd.months_since_first;
This analysis tracks how customers from each monthly cohort behave over time, showing retention rates and cumulative revenue generation. Each CTE handles one logical step, making the complex analysis manageable.
CTEs can create dynamic pivot tables by building the query structure programmatically:
-- Product performance across different time periods
WITH date_periods AS (
SELECT
'last_30_days' as period,
CURRENT_DATE - INTERVAL '30 days' as start_date,
CURRENT_DATE as end_date
UNION ALL
SELECT
'last_90_days',
CURRENT_DATE - INTERVAL '90 days',
CURRENT_DATE
UNION ALL
SELECT
'last_365_days',
CURRENT_DATE - INTERVAL '365 days',
CURRENT_DATE
),
product_performance AS (
SELECT
p.product_id,
p.product_name,
p.category,
dp.period,
COUNT(oi.order_id) as total_orders,
SUM(oi.quantity) as total_quantity,
SUM(oi.quantity * oi.unit_price) as total_revenue
FROM products p
CROSS JOIN date_periods dp
LEFT JOIN order_items oi ON p.product_id = oi.product_id
LEFT JOIN orders o ON oi.order_id = o.order_id
AND o.order_date >= dp.start_date
AND o.order_date < dp.end_date
GROUP BY p.product_id, p.product_name, p.category, dp.period
)
SELECT
product_name,
category,
SUM(CASE WHEN period = 'last_30_days' THEN total_revenue ELSE 0 END) as revenue_30d,
SUM(CASE WHEN period = 'last_90_days' THEN total_revenue ELSE 0 END) as revenue_90d,
SUM(CASE WHEN period = 'last_365_days' THEN total_revenue ELSE 0 END) as revenue_365d,
SUM(CASE WHEN period = 'last_30_days' THEN total_orders ELSE 0 END) as orders_30d,
SUM(CASE WHEN period = 'last_90_days' THEN total_orders ELSE 0 END) as orders_90d,
SUM(CASE WHEN period = 'last_365_days' THEN total_orders ELSE 0 END) as orders_365d
FROM product_performance
GROUP BY product_id, product_name, category
ORDER BY revenue_90d DESC;
This pattern lets you analyze the same metrics across different time windows in a single query, making it easy to spot trends and seasonal patterns.
Understanding when CTEs and subqueries perform well is crucial for production systems. Let's examine the key factors and optimization strategies.
Different databases handle CTEs and subqueries differently. PostgreSQL materializes CTEs by default, meaning they're computed once and stored. SQL Server and Oracle are more flexible, sometimes inlining simple CTEs as subqueries.
-- This CTE will likely be materialized
WITH expensive_calculation AS (
SELECT
customer_id,
COUNT(*) as order_count,
AVG(total_amount) as avg_amount,
STDDEV(total_amount) as amount_stddev,
-- Complex calculation that benefits from being done once
EXP(SUM(LN(CASE WHEN total_amount > 0 THEN total_amount ELSE 1 END))) as geometric_mean
FROM orders
WHERE order_date >= '2023-01-01'
GROUP BY customer_id
HAVING COUNT(*) >= 5
)
SELECT
c.name,
ec.order_count,
ec.avg_amount,
ec.amount_stddev
FROM customers c
JOIN expensive_calculation ec ON c.customer_id = ec.customer_id
WHERE c.tier IN ('premium', 'gold')
AND ec.avg_amount > 100;
If only 10% of customers are premium/gold, this CTE does unnecessary work. A subquery might perform better:
-- Subquery version that can be optimized with predicate pushdown
SELECT
c.name,
stats.order_count,
stats.avg_amount,
stats.amount_stddev
FROM customers c
JOIN (
SELECT
customer_id,
COUNT(*) as order_count,
AVG(total_amount) as avg_amount,
STDDEV(total_amount) as amount_stddev
FROM orders
WHERE order_date >= '2023-01-01'
GROUP BY customer_id
HAVING COUNT(*) >= 5
) stats ON c.customer_id = stats.customer_id
WHERE c.tier IN ('premium', 'gold')
AND stats.avg_amount > 100;
Modern optimizers might push the customer tier filter down to reduce the work done in the subquery.
CTEs that return large result sets can consume significant memory or spill to disk. Monitor your queries' memory usage, especially in environments with limited resources.
-- This CTE might create a very large intermediate result
WITH daily_product_stats AS (
SELECT
product_id,
order_date,
COUNT(*) as daily_orders,
SUM(quantity) as daily_quantity,
AVG(unit_price) as daily_avg_price,
-- Adding more aggregations increases memory usage
STDDEV(unit_price) as price_stddev,
MIN(unit_price) as min_price,
MAX(unit_price) as max_price
FROM order_items oi
JOIN orders o ON oi.order_id = o.order_id
WHERE o.order_date >= '2020-01-01' -- Large date range
GROUP BY product_id, order_date
)
-- If only using a small subset, consider filtering earlier
SELECT product_id, AVG(daily_orders) as avg_daily_orders
FROM daily_product_stats
WHERE product_id IN (101, 102, 103) -- Only 3 products needed
GROUP BY product_id;
For better performance, filter early:
WITH daily_product_stats AS (
SELECT
product_id,
order_date,
COUNT(*) as daily_orders,
SUM(quantity) as daily_quantity
FROM order_items oi
JOIN orders o ON oi.order_id = o.order_id
WHERE o.order_date >= '2020-01-01'
AND oi.product_id IN (101, 102, 103) -- Filter moved here
GROUP BY product_id, order_date
)
SELECT product_id, AVG(daily_orders) as avg_daily_orders
FROM daily_product_stats
GROUP BY product_id;
Correlated subqueries put heavy pressure on indexes. Make sure the columns referenced in the correlation condition are well-indexed.
-- This query needs indexes on orders(customer_id, order_date)
-- and customers(tier, registration_date)
SELECT
c.customer_id,
c.name,
(SELECT COUNT(*)
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.order_date >= '2023-01-01') as recent_orders
FROM customers c
WHERE c.tier = 'premium'
AND c.registration_date >= '2022-01-01';
Without proper indexes, this query executes a full table scan of orders for each premium customer—potentially millions of operations.
Index Tip: For correlated subqueries, create covering indexes that include all referenced columns. An index on
orders(customer_id, order_date, total_amount)would be better than separate indexes on each column.
Let's build a comprehensive customer segmentation analysis using both subqueries and CTEs. We'll analyze a fictional e-commerce database to identify customer segments and their characteristics.
Scenario: You're analyzing customer behavior to design a targeted marketing campaign. You need to segment customers based on their purchasing patterns and identify the most valuable segments.
Database Schema:
-- Create sample tables for the exercise
CREATE TABLE customers (
customer_id INTEGER PRIMARY KEY,
name VARCHAR(100),
email VARCHAR(100),
registration_date DATE,
tier VARCHAR(20)
);
CREATE TABLE orders (
order_id INTEGER PRIMARY KEY,
customer_id INTEGER,
order_date DATE,
total_amount DECIMAL(10,2),
FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);
CREATE TABLE order_items (
order_id INTEGER,
product_id INTEGER,
quantity INTEGER,
unit_price DECIMAL(10,2),
FOREIGN KEY (order_id) REFERENCES orders(order_id)
);
CREATE TABLE products (
product_id INTEGER PRIMARY KEY,
product_name VARCHAR(100),
category VARCHAR(50),
price DECIMAL(10,2)
);
Exercise Tasks:
Customer Lifetime Value Analysis: Create a query using CTEs that calculates each customer's lifetime value, average order value, and purchase frequency. Segment customers into quintiles based on lifetime value.
Product Affinity Analysis: Use correlated subqueries to find products that are frequently bought together. For each product, find the top 3 products most commonly purchased in the same order.
Churn Prediction Features: Build a comprehensive analysis using multiple CTEs that calculates features for churn prediction: days since last order, purchase frequency trend, seasonal purchasing patterns, and average order value trend.
Solution Approach:
-- Task 1: Customer Lifetime Value Analysis
WITH customer_metrics AS (
SELECT
c.customer_id,
c.name,
c.registration_date,
c.tier,
COUNT(DISTINCT o.order_id) as total_orders,
SUM(o.total_amount) as lifetime_value,
AVG(o.total_amount) as avg_order_value,
MIN(o.order_date) as first_order_date,
MAX(o.order_date) as last_order_date,
EXTRACT(days FROM (MAX(o.order_date) - MIN(o.order_date))) as customer_lifetime_days
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name, c.registration_date, c.tier
),
ltv_quintiles AS (
SELECT
*,
NTILE(5) OVER (ORDER BY lifetime_value) as ltv_quintile,
CASE
WHEN total_orders = 0 THEN 0
ELSE lifetime_value / NULLIF(customer_lifetime_days, 0) * 365
END as annualized_value
FROM customer_metrics
)
SELECT
ltv_quintile,
COUNT(*) as customer_count,
AVG(lifetime_value) as avg_lifetime_value,
AVG(total_orders) as avg_total_orders,
AVG(avg_order_value) as avg_order_value,
AVG(annualized_value) as avg_annualized_value,
MIN(lifetime_value) as min_ltv_in_quintile,
MAX(lifetime_value) as max_ltv_in_quintile
FROM ltv_quintiles
WHERE total_orders > 0 -- Exclude customers who never purchased
GROUP BY ltv_quintile
ORDER BY ltv_quintile;
-- Task 2: Product Affinity Analysis
SELECT
p1.product_name as product,
p1.category,
affinity_products.companion_product,
affinity_products.companion_category,
affinity_products.co_occurrence_count,
affinity_products.affinity_strength
FROM products p1
CROSS APPLY (
SELECT TOP 3
p2.product_name as companion_product,
p2.category as companion_category,
COUNT(*) as co_occurrence_count,
ROUND(
COUNT(*) * 100.0 / (
SELECT COUNT(DISTINCT oi1.order_id)
FROM order_items oi1
WHERE oi1.product_id = p1.product_id
), 2
) as affinity_strength
FROM order_items oi1
JOIN order_items oi2 ON oi1.order_id = oi2.order_id
AND oi1.product_id != oi2.product_id
JOIN products p2 ON oi2.product_id = p2.product_id
WHERE oi1.product_id = p1.product_id
GROUP BY p2.product_id, p2.product_name, p2.category
ORDER BY COUNT(*) DESC
) affinity_products
WHERE EXISTS (
SELECT 1 FROM order_items oi WHERE oi.product_id = p1.product_id
)
ORDER BY p1.category, p1.product_name;
-- Task 3: Churn Prediction Features
WITH customer_order_history AS (
SELECT
customer_id,
order_date,
total_amount,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) as order_sequence,
LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date) as prev_order_date,
EXTRACT(month FROM order_date) as order_month,
EXTRACT(year FROM order_date) as order_year
FROM orders
),
customer_patterns AS (
SELECT
customer_id,
COUNT(*) as total_orders,
AVG(total_amount) as avg_order_value,
STDDEV(total_amount) as order_value_stddev,
MAX(order_date) as last_order_date,
AVG(EXTRACT(days FROM (order_date - prev_order_date))) as avg_days_between_orders,
-- Seasonal pattern: percentage of orders in each quarter
100.0 * COUNT(CASE WHEN EXTRACT(month FROM order_date) IN (1,2,3) THEN 1 END) / COUNT(*) as q1_percentage,
100.0 * COUNT(CASE WHEN EXTRACT(month FROM order_date) IN (4,5,6) THEN 1 END) / COUNT(*) as q2_percentage,
100.0 * COUNT(CASE WHEN EXTRACT(month FROM order_date) IN (7,8,9) THEN 1 END) / COUNT(*) as q3_percentage,
100.0 * COUNT(CASE WHEN EXTRACT(month FROM order_date) IN (10,11,12) THEN 1 END) / COUNT(*) as q4_percentage
FROM customer_order_history
WHERE prev_order_date IS NOT NULL OR order_sequence = 1
GROUP BY customer_id
),
recent_trends AS (
SELECT
customer_id,
-- Recent vs historical average order value
AVG(CASE WHEN order_date >= CURRENT_DATE - INTERVAL '90 days'
THEN total_amount END) as recent_avg_order_value,
AVG(CASE WHEN order_date < CURRENT_DATE - INTERVAL '90 days'
THEN total_amount END) as historical_avg_order_value,
-- Recent order frequency
COUNT(CASE WHEN order_date >= CURRENT_DATE - INTERVAL '90 days'
THEN 1 END) as recent_order_count
FROM orders
GROUP BY customer_id
)
SELECT
c.customer_id,
c.name,
c.tier,
cp.total_orders,
cp.avg_order_value,
cp.order_value_stddev,
EXTRACT(days FROM (CURRENT_DATE - cp.last_order_date)) as days_since_last_order,
cp.avg_days_between_orders,
rt.recent_avg_order_value,
rt.historical_avg_order_value,
CASE
WHEN rt.historical_avg_order_value > 0
THEN (rt.recent_avg_order_value - rt.historical_avg_order_value) / rt.historical_avg_order_value * 100
ELSE 0
END as order_value_trend_pct,
rt.recent_order_count,
cp.q1_percentage,
cp.q2_percentage,
cp.q3_percentage,
cp.q4_percentage,
-- Churn risk score (simple heuristic)
CASE
WHEN EXTRACT(days FROM (CURRENT_DATE - cp.last_order_date)) > 180 THEN 'high'
WHEN EXTRACT(days FROM (CURRENT_DATE - cp.last_order_date)) > 90 THEN 'medium'
ELSE 'low'
END as churn_risk
FROM customers c
LEFT JOIN customer_patterns cp ON c.customer_id = cp.customer_id
LEFT JOIN recent_trends rt ON c.customer_id = rt.customer_id
ORDER BY days_since_last_order DESC;
Try running these queries on your own dataset, modifying the time windows and metrics to match your specific business needs.
Even experienced developers encounter pitfalls when working with complex subqueries and CTEs. Let's examine the most common issues and their solutions.
Mistake: Writing correlated subqueries that execute unnecessarily often.
-- PROBLEMATIC: This subquery executes once per row
SELECT
product_id,
product_name,
price,
(SELECT AVG(price) FROM products p2 WHERE p2.category = p1.category) as category_avg,
(SELECT COUNT(*) FROM products p3 WHERE p3.category = p1.category) as category_count
FROM products p1
ORDER BY category, price;
Solution: Use window functions or CTEs to calculate once per category.
-- BETTER: Window functions calculate once per category
SELECT
product_id,
product_name,
price,
AVG(price) OVER (PARTITION BY category) as category_avg,
COUNT(*) OVER (PARTITION BY category) as category_count
FROM products
ORDER BY category, price;
Mistake: Trying to reference CTEs out of order or incorrectly.
-- WRONG: second_cte tries to use third_cte before it's defined
WITH second_cte AS (
SELECT customer_id, total_value
FROM third_cte -- Error: third_cte not yet defined
WHERE total_value > 1000
),
first_cte AS (
SELECT customer_id, SUM(total_amount) as total_value
FROM orders
GROUP BY customer_id
),
third_cte AS (
SELECT customer_id, total_value
FROM first_cte
)
SELECT * FROM second_cte;
Solution: Define CTEs in dependency order and understand scoping rules.
-- CORRECT: CTEs defined in proper order
WITH first_cte AS (
SELECT customer_id, SUM(total_amount) as total_value
FROM orders
GROUP BY customer_id
),
second_cte AS (
SELECT customer_id, total_value
FROM first_cte
WHERE total_value > 1000
)
SELECT * FROM second_cte;
Mistake: Forgetting termination conditions in recursive CTEs.
-- DANGEROUS: No termination condition
WITH RECURSIVE employee_hierarchy AS (
SELECT employee_id, manager_id, name, 0 as level
FROM employees
WHERE manager_id IS NULL
UNION ALL
SELECT e.employee_id, e.manager_id, e.name, eh.level + 1
FROM employees e
JOIN employee_hierarchy eh ON e.manager_id = eh.employee_id
-- Missing: WHERE eh.level < some_limit
)
SELECT * FROM employee_hierarchy;
Solution: Always include explicit termination conditions.
-- SAFE: Multiple termination conditions
WITH RECURSIVE employee_hierarchy AS (
SELECT employee_id, manager_id, name, 0 as level
FROM employees
WHERE manager_id IS NULL
UNION ALL
SELECT e.employee_id, e.manager_id, e.name, eh.level + 1
FROM employees e
JOIN employee_hierarchy eh ON e.manager_id = eh.employee_id
WHERE eh.level < 10 -- Prevent runaway recursion
AND e.employee_id != eh.employee_id -- Prevent self-references
)
SELECT * FROM employee_hierarchy;
Mistake: Forgetting how NULLs affect subquery results and aggregations.
-- PROBLEMATIC: What happens when a customer has no orders?
SELECT
c.name,
(SELECT AVG(total_amount) FROM orders o WHERE o.customer_id = c.customer_id) as avg_order
FROM customers c
WHERE (
SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id
) > 5;
If a customer has no orders, the AVG() returns NULL, and the COUNT() returns 0. The WHERE condition correctly filters them out, but the SELECT will show NULL for avg_order.
Solution: Use explicit NULL handling and consider edge cases.
-- BETTER: Explicit NULL handling
SELECT
c.name,
COALESCE(
(SELECT AVG(total_amount) FROM orders o WHERE o.customer_id = c.customer_id),
0
) as avg_order,
COALESCE(
(SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id),
0
) as order_count
FROM customers c
WHERE (
SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id
) > 5;
Mistake: Mixing incompatible data types in recursive or UNION queries.
-- ERROR: Different data types in UNION
WITH customer_events AS (
SELECT customer_id, 'registration' as event_type, registration_date as event_date
FROM customers
UNION ALL
SELECT customer_id, 'order', order_id -- order_id is INTEGER, not DATE
FROM orders
)
SELECT * FROM customer_events;
Solution: Ensure compatible data types and explicit casting when necessary.
-- CORRECT: Compatible data types
WITH customer_events AS (
SELECT customer_id, 'registration' as event_type, registration_date as event_date
FROM customers
UNION ALL
SELECT customer_id, 'order', order_date
FROM orders
)
SELECT * FROM customer_events
ORDER BY customer_id, event_date;
When CTE queries produce unexpected results, debug systematically:
-- Add debug information to troubleshoot
WITH customer_stats AS (
SELECT
customer_id,
COUNT(*) as order_count,
SUM(total_amount) as total_spent,
-- Debug: show individual components
COUNT(*) as debug_order_count,
SUM(CASE WHEN total_amount IS NULL THEN 1 ELSE 0 END) as null_amounts
FROM orders
GROUP BY customer_id
)
SELECT
customer_id,
order_count,
total_spent,
-- Debug columns help verify logic
debug_order_count,
null_amounts,
CASE WHEN order_count != debug_order_count THEN 'MISMATCH' ELSE 'OK' END as count_check
FROM customer_stats
WHERE customer_id IN (101, 102, 103) -- Limit to specific customers for testing
ORDER BY customer_id;
You've now mastered the fundamental patterns for subqueries and CTEs, along with their performance characteristics and common pitfalls. These tools transform SQL from a simple data retrieval language into a powerful analytical platform capable of sophisticated multi-step reasoning.
Key takeaways:
Advanced patterns you've learned:
Next steps for deeper expertise:
The patterns you've learned here form the foundation for advanced SQL analytics. Whether you're building business intelligence dashboards, performing ad-hoc analysis, or designing data pipelines, these techniques will help you write queries that are both powerful and maintainable.
Remember: the best SQL query is one that solves the business problem correctly, performs well with your data volumes, and can be understood by the next person who needs to modify it. Subqueries and CTEs are tools in service of those goals, not ends in themselves.
Learning Path: SQL Fundamentals