You're working on a complex business intelligence report that requires analyzing customer behavior across multiple time periods, calculating running totals, and comparing performance metrics against historical averages. As you start building the SQL query, you quickly realize that simple SELECT statements won't cut it. You need to nest queries within queries, reference intermediate results multiple times, and create readable, maintainable code that your team can understand and modify.
This is where subqueries and Common Table Expressions (CTEs) become essential tools in your SQL arsenal. These powerful constructs allow you to break down complex analytical problems into manageable, logical components while maintaining optimal performance and code clarity.
By the end of this lesson, you'll have mastery over both subqueries and CTEs, understanding not just their syntax but their strategic application in real-world data scenarios.
What you'll learn:
You should be comfortable with intermediate SQL concepts including JOIN operations, aggregate functions, window functions, and query execution plans. Experience with business intelligence or analytical workloads will help you appreciate the real-world applications we'll explore.
A subquery is a complete SELECT statement nested within another SQL statement. While this sounds simple, subqueries are the backbone of sophisticated data analysis, enabling you to perform multi-step calculations, dynamic filtering, and complex comparisons that would otherwise require multiple separate queries or application logic.
Let's start with a realistic scenario. You're analyzing an e-commerce platform's sales data and need to identify customers who made purchases above the average order value in each product category during the last quarter.
-- Sample data structure we'll work with
-- orders: order_id, customer_id, order_date, total_amount
-- order_items: order_item_id, order_id, product_id, quantity, unit_price
-- products: product_id, product_name, category_id, supplier_id
-- categories: category_id, category_name
Here's how a subquery-based solution might look:
SELECT DISTINCT
o.customer_id,
o.order_id,
o.total_amount,
c.category_name
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN categories c ON p.category_id = c.category_id
WHERE o.order_date >= DATEADD(month, -3, GETDATE())
AND o.total_amount > (
SELECT AVG(o2.total_amount)
FROM orders o2
JOIN order_items oi2 ON o2.order_id = oi2.order_id
JOIN products p2 ON oi2.product_id = p2.product_id
WHERE p2.category_id = p.category_id
AND o2.order_date >= DATEADD(month, -3, GETDATE())
)
ORDER BY c.category_name, o.total_amount DESC;
This query demonstrates a correlated subquery – the inner query references columns from the outer query (p.category_id). Each row in the outer query triggers a separate execution of the subquery, making this pattern both powerful and potentially expensive.
Understanding the four fundamental types of subqueries is crucial for choosing the right approach for each analytical challenge.
Scalar subqueries return exactly one value and are commonly used in SELECT lists, WHERE clauses, and HAVING clauses. They're ideal for dynamic comparisons and calculated metrics.
-- Find products with sales above the company-wide average
SELECT
product_id,
product_name,
(SELECT COUNT(*)
FROM order_items oi
WHERE oi.product_id = p.product_id) as total_orders,
(SELECT AVG(total_amount) FROM orders) as company_avg_order
FROM products p
WHERE (
SELECT SUM(oi.quantity * oi.unit_price)
FROM order_items oi
WHERE oi.product_id = p.product_id
) > (SELECT AVG(total_amount) FROM orders);
Row subqueries return a single row with multiple columns, enabling complex tuple-based comparisons:
-- Find orders that match the highest quantity and unit price combination
SELECT order_id, customer_id, order_date
FROM orders o
WHERE (
SELECT MAX(quantity), MAX(unit_price)
FROM order_items oi
WHERE oi.order_id = o.order_id
) = (
SELECT quantity, unit_price
FROM order_items oi2
WHERE oi2.order_id = o.order_id
ORDER BY quantity * unit_price DESC
LIMIT 1
);
Column subqueries return multiple rows with a single column, typically used with IN, ANY, ALL, and EXISTS operators:
-- Find customers who have ordered products from more than 3 categories
SELECT customer_id, COUNT(DISTINCT category_id) as category_count
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE customer_id IN (
SELECT customer_id
FROM orders o2
JOIN order_items oi2 ON o2.order_id = oi2.order_id
JOIN products p2 ON oi2.product_id = p2.product_id
GROUP BY customer_id
HAVING COUNT(DISTINCT p2.category_id) > 3
)
GROUP BY customer_id;
Table subqueries return multiple rows and columns, often used in FROM clauses to create derived tables:
-- Analyze monthly sales trends with year-over-year comparison
SELECT
current_month.month_year,
current_month.monthly_revenue,
current_month.order_count,
COALESCE(previous_year.monthly_revenue, 0) as prev_year_revenue,
CASE
WHEN previous_year.monthly_revenue > 0
THEN ((current_month.monthly_revenue - previous_year.monthly_revenue) / previous_year.monthly_revenue) * 100
ELSE NULL
END as yoy_growth_percent
FROM (
SELECT
FORMAT(order_date, 'yyyy-MM') as month_year,
SUM(total_amount) as monthly_revenue,
COUNT(*) as order_count,
YEAR(order_date) as year,
MONTH(order_date) as month
FROM orders
WHERE order_date >= DATEADD(year, -2, GETDATE())
GROUP BY FORMAT(order_date, 'yyyy-MM'), YEAR(order_date), MONTH(order_date)
) current_month
LEFT JOIN (
SELECT
YEAR(order_date) + 1 as target_year,
MONTH(order_date) as target_month,
SUM(total_amount) as monthly_revenue
FROM orders
WHERE order_date >= DATEADD(year, -2, GETDATE())
GROUP BY YEAR(order_date), MONTH(order_date)
) previous_year ON current_month.year = previous_year.target_year
AND current_month.month = previous_year.target_month
ORDER BY current_month.month_year;
As your analytical requirements become more sophisticated, you'll encounter scenarios where basic subquery patterns need enhancement for performance and maintainability.
The choice between EXISTS and IN can significantly impact both performance and correctness, especially when dealing with NULL values:
-- EXISTS approach - generally faster and NULL-safe
SELECT customer_id, customer_name
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.order_date >= '2024-01-01'
AND o.total_amount > 100
);
-- IN approach - can be slower with large result sets
SELECT customer_id, customer_name
FROM customers c
WHERE customer_id IN (
SELECT customer_id
FROM orders
WHERE order_date >= '2024-01-01'
AND total_amount > 100
AND customer_id IS NOT NULL -- Required for correct NULL handling
);
Critical Insight: EXISTS typically performs better because it can stop at the first matching row, while IN must evaluate the entire subquery result set. Additionally, EXISTS handles NULL values correctly by default, while IN can produce unexpected results with NULL values.
Understanding when to use correlated versus non-correlated subqueries is essential for optimal performance:
-- Non-correlated subquery - executes once
SELECT product_id, product_name, unit_price
FROM products
WHERE unit_price > (
SELECT AVG(unit_price) * 1.2
FROM products
);
-- Correlated subquery - executes once per outer row
SELECT p1.product_id, p1.product_name, p1.unit_price
FROM products p1
WHERE p1.unit_price > (
SELECT AVG(p2.unit_price) * 1.2
FROM products p2
WHERE p2.category_id = p1.category_id
);
The correlated version provides more granular analysis but at a significant performance cost. For large datasets, consider materializing intermediate results or using window functions instead.
Common Table Expressions (CTEs) revolutionize how we approach complex SQL by providing a way to define temporary, named result sets that exist only for the duration of a single query. Think of CTEs as creating temporary views that you can reference multiple times within your main query, dramatically improving readability and maintainability.
Let's refactor our earlier complex subquery example using CTEs:
WITH quarterly_category_averages AS (
SELECT
c.category_id,
c.category_name,
AVG(o.total_amount) as avg_order_value,
COUNT(*) as total_orders,
SUM(o.total_amount) as total_revenue
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN categories c ON p.category_id = c.category_id
WHERE o.order_date >= DATEADD(month, -3, GETDATE())
GROUP BY c.category_id, c.category_name
),
above_average_orders AS (
SELECT
o.customer_id,
o.order_id,
o.total_amount,
p.category_id
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN quarterly_category_averages qca ON p.category_id = qca.category_id
WHERE o.order_date >= DATEADD(month, -3, GETDATE())
AND o.total_amount > qca.avg_order_value
)
SELECT
aao.customer_id,
aao.order_id,
aao.total_amount,
qca.category_name,
qca.avg_order_value,
ROUND(((aao.total_amount - qca.avg_order_value) / qca.avg_order_value) * 100, 2) as percent_above_average
FROM above_average_orders aao
JOIN quarterly_category_averages qca ON aao.category_id = qca.category_id
ORDER BY qca.category_name, aao.total_amount DESC;
This CTE-based approach offers several advantages:
quarterly_category_averages CTE is referenced twiceCTEs excel at breaking down complex analytical problems into logical steps:
WITH monthly_sales AS (
SELECT
YEAR(order_date) as year,
MONTH(order_date) as month,
DATE_TRUNC('month', order_date) as month_start,
SUM(total_amount) as monthly_revenue,
COUNT(DISTINCT customer_id) as unique_customers,
COUNT(*) as total_orders
FROM orders
WHERE order_date >= DATEADD(year, -2, GETDATE())
GROUP BY YEAR(order_date), MONTH(order_date), DATE_TRUNC('month', order_date)
),
monthly_metrics AS (
SELECT
*,
LAG(monthly_revenue, 1) OVER (ORDER BY year, month) as prev_month_revenue,
LAG(monthly_revenue, 12) OVER (ORDER BY year, month) as same_month_prev_year,
AVG(monthly_revenue) OVER (
ORDER BY year, month
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) as three_month_avg
FROM monthly_sales
),
growth_analysis AS (
SELECT
*,
CASE
WHEN prev_month_revenue > 0
THEN ((monthly_revenue - prev_month_revenue) / prev_month_revenue) * 100
ELSE NULL
END as month_over_month_growth,
CASE
WHEN same_month_prev_year > 0
THEN ((monthly_revenue - same_month_prev_year) / same_month_prev_year) * 100
ELSE NULL
END as year_over_year_growth,
CASE
WHEN three_month_avg > 0
THEN ((monthly_revenue - three_month_avg) / three_month_avg) * 100
ELSE NULL
END as vs_three_month_avg
FROM monthly_metrics
)
SELECT
FORMAT(month_start, 'MMM yyyy') as month_display,
FORMAT(monthly_revenue, 'C0') as revenue,
unique_customers,
total_orders,
ROUND(monthly_revenue / total_orders, 2) as avg_order_value,
ROUND(month_over_month_growth, 1) as mom_growth_pct,
ROUND(year_over_year_growth, 1) as yoy_growth_pct,
ROUND(vs_three_month_avg, 1) as vs_3mo_avg_pct
FROM growth_analysis
WHERE month_start >= DATEADD(year, -1, GETDATE())
ORDER BY year DESC, month DESC;
Recursive CTEs are incredibly powerful for processing hierarchical data structures like organizational charts, product categories, or geographical regions:
-- Process hierarchical category structure with recursive CTE
WITH RECURSIVE category_hierarchy AS (
-- Anchor: Start with root categories (no parent)
SELECT
category_id,
category_name,
parent_category_id,
0 as level,
CAST(category_name AS VARCHAR(1000)) as path,
CAST(category_id AS VARCHAR(100)) as id_path
FROM categories
WHERE parent_category_id IS NULL
UNION ALL
-- Recursive: Add child categories
SELECT
c.category_id,
c.category_name,
c.parent_category_id,
ch.level + 1,
CAST(ch.path + ' > ' + c.category_name AS VARCHAR(1000)),
CAST(ch.id_path + '/' + CAST(c.category_id AS VARCHAR(10)) AS VARCHAR(100))
FROM categories c
JOIN category_hierarchy ch ON c.parent_category_id = ch.category_id
WHERE ch.level < 10 -- Prevent infinite recursion
),
category_sales AS (
SELECT
ch.category_id,
ch.category_name,
ch.level,
ch.path,
ch.id_path,
COALESCE(SUM(oi.quantity * oi.unit_price), 0) as direct_sales,
COUNT(DISTINCT o.order_id) as order_count
FROM category_hierarchy ch
LEFT JOIN products p ON ch.category_id = p.category_id
LEFT JOIN order_items oi ON p.product_id = oi.product_id
LEFT JOIN orders o ON oi.order_id = o.order_id
AND o.order_date >= DATEADD(month, -6, GETDATE())
GROUP BY ch.category_id, ch.category_name, ch.level, ch.path, ch.id_path
),
hierarchical_totals AS (
SELECT
cs.*,
-- Calculate total sales including all descendant categories
(SELECT SUM(cs2.direct_sales)
FROM category_sales cs2
WHERE cs2.id_path LIKE cs.id_path + '%'
) as total_sales_with_children
FROM category_sales cs
)
SELECT
REPLICATE(' ', level) + category_name as indented_name,
level,
FORMAT(direct_sales, 'C0') as direct_sales,
FORMAT(total_sales_with_children, 'C0') as total_including_children,
order_count,
path
FROM hierarchical_totals
ORDER BY id_path;
This recursive CTE pattern is essential for:
Performance Warning: Recursive CTEs can become expensive with deep hierarchies or circular references. Always include a level limit and consider materializing results for frequently-accessed hierarchical data.
Understanding how CTEs are executed is crucial for performance optimization. Unlike views or temporary tables, CTEs are generally inlined by the query optimizer, meaning they may be executed multiple times if referenced multiple times.
-- This CTE may be executed twice if the optimizer doesn't materialize it
WITH expensive_calculation AS (
SELECT
customer_id,
COUNT(*) as order_count,
AVG(total_amount) as avg_order,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY total_amount) as median_order,
-- Complex statistical calculations...
STDEV(total_amount) as order_stdev
FROM orders
WHERE order_date >= DATEADD(year, -1, GETDATE())
GROUP BY customer_id
)
SELECT 'High Value' as segment, COUNT(*) as customer_count
FROM expensive_calculation
WHERE avg_order > 200
UNION ALL
SELECT 'Standard' as segment, COUNT(*) as customer_count
FROM expensive_calculation -- Potentially executed again here
WHERE avg_order BETWEEN 50 AND 200
UNION ALL
SELECT 'Low Value' as segment, COUNT(*) as customer_count
FROM expensive_calculation -- And potentially again here
WHERE avg_order < 50;
To force materialization when beneficial:
-- Alternative approach using temporary tables for expensive calculations
CREATE TEMPORARY TABLE customer_metrics AS
SELECT
customer_id,
COUNT(*) as order_count,
AVG(total_amount) as avg_order,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY total_amount) as median_order,
STDEV(total_amount) as order_stdev
FROM orders
WHERE order_date >= DATEADD(year, -1, GETDATE())
GROUP BY customer_id;
-- Now the expensive calculation runs only once
SELECT
CASE
WHEN avg_order > 200 THEN 'High Value'
WHEN avg_order >= 50 THEN 'Standard'
ELSE 'Low Value'
END as segment,
COUNT(*) as customer_count
FROM customer_metrics
GROUP BY CASE
WHEN avg_order > 200 THEN 'High Value'
WHEN avg_order >= 50 THEN 'Standard'
ELSE 'Low Value'
END;
DROP TEMPORARY TABLE customer_metrics;
CTEs become even more powerful when combined with other advanced SQL features like window functions, pivot operations, and analytical functions.
WITH daily_sales AS (
SELECT
DATE(order_date) as sale_date,
SUM(total_amount) as daily_revenue,
COUNT(*) as daily_orders,
COUNT(DISTINCT customer_id) as unique_customers
FROM orders
WHERE order_date >= DATEADD(month, -6, GETDATE())
GROUP BY DATE(order_date)
),
sales_with_trends AS (
SELECT
*,
-- Moving averages
AVG(daily_revenue) OVER (
ORDER BY sale_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as seven_day_avg_revenue,
-- Ranking and percentiles
PERCENT_RANK() OVER (ORDER BY daily_revenue) as revenue_percentile,
DENSE_RANK() OVER (ORDER BY daily_revenue DESC) as revenue_rank,
-- Period-over-period comparisons
LAG(daily_revenue, 7) OVER (ORDER BY sale_date) as same_day_prev_week,
LAG(daily_revenue, 1) OVER (ORDER BY sale_date) as prev_day_revenue,
-- Running totals and counts
SUM(daily_revenue) OVER (
ORDER BY sale_date
ROWS UNBOUNDED PRECEDING
) as running_total_revenue,
-- Identify anomalies using standard deviation
AVG(daily_revenue) OVER () as overall_avg,
STDEV(daily_revenue) OVER () as overall_stdev
FROM daily_sales
),
anomaly_detection AS (
SELECT
*,
CASE
WHEN ABS(daily_revenue - overall_avg) > 2 * overall_stdev
THEN 'Anomaly'
WHEN ABS(daily_revenue - overall_avg) > 1 * overall_stdev
THEN 'Outlier'
ELSE 'Normal'
END as anomaly_flag,
CASE
WHEN same_day_prev_week > 0
THEN ((daily_revenue - same_day_prev_week) / same_day_prev_week) * 100
ELSE NULL
END as week_over_week_growth
FROM sales_with_trends
)
SELECT
sale_date,
FORMAT(daily_revenue, 'C0') as revenue,
daily_orders,
unique_customers,
ROUND(seven_day_avg_revenue, 2) as seven_day_avg,
ROUND(revenue_percentile * 100, 1) as percentile_rank,
revenue_rank,
anomaly_flag,
ROUND(week_over_week_growth, 1) as wow_growth_pct,
ROUND((daily_revenue - overall_avg) / overall_stdev, 2) as z_score
FROM anomaly_detection
WHERE sale_date >= DATEADD(month, -1, GETDATE())
ORDER BY sale_date DESC;
WITH monthly_category_sales AS (
SELECT
FORMAT(o.order_date, 'yyyy-MM') as month_year,
c.category_name,
SUM(oi.quantity * oi.unit_price) as category_sales
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN categories c ON p.category_id = c.category_id
WHERE o.order_date >= DATEADD(month, -12, GETDATE())
GROUP BY FORMAT(o.order_date, 'yyyy-MM'), c.category_name
),
pivot_prep AS (
SELECT
month_year,
SUM(CASE WHEN category_name = 'Electronics' THEN category_sales ELSE 0 END) as Electronics,
SUM(CASE WHEN category_name = 'Clothing' THEN category_sales ELSE 0 END) as Clothing,
SUM(CASE WHEN category_name = 'Home & Garden' THEN category_sales ELSE 0 END) as Home_Garden,
SUM(CASE WHEN category_name = 'Sports' THEN category_sales ELSE 0 END) as Sports,
SUM(category_sales) as Total_Sales
FROM monthly_category_sales
GROUP BY month_year
),
with_growth_rates AS (
SELECT
*,
LAG(Electronics, 1) OVER (ORDER BY month_year) as prev_electronics,
LAG(Clothing, 1) OVER (ORDER BY month_year) as prev_clothing,
LAG(Home_Garden, 1) OVER (ORDER BY month_year) as prev_home_garden,
LAG(Sports, 1) OVER (ORDER BY month_year) as prev_sports,
LAG(Total_Sales, 1) OVER (ORDER BY month_year) as prev_total
FROM pivot_prep
)
SELECT
month_year,
FORMAT(Electronics, 'C0') as Electronics,
FORMAT(Clothing, 'C0') as Clothing,
FORMAT(Home_Garden, 'C0') as Home_Garden,
FORMAT(Sports, 'C0') as Sports,
FORMAT(Total_Sales, 'C0') as Total,
-- Growth rates
CASE
WHEN prev_electronics > 0
THEN ROUND(((Electronics - prev_electronics) / prev_electronics) * 100, 1)
ELSE NULL
END as Electronics_Growth,
CASE
WHEN prev_total > 0
THEN ROUND(((Total_Sales - prev_total) / prev_total) * 100, 1)
ELSE NULL
END as Total_Growth
FROM with_growth_rates
ORDER BY month_year DESC;
Understanding the performance implications of CTEs versus alternatives is crucial for building scalable analytical systems.
Use CTEs when:
Use temporary tables when:
-- Performance comparison example
-- CTE approach - may be less efficient for large datasets
WITH large_customer_analysis AS (
SELECT
customer_id,
COUNT(*) as total_orders,
SUM(total_amount) as lifetime_value,
AVG(total_amount) as avg_order_value,
MIN(order_date) as first_order,
MAX(order_date) as last_order,
DATEDIFF(day, MIN(order_date), MAX(order_date)) as customer_lifespan_days
FROM orders
WHERE order_date >= '2020-01-01'
GROUP BY customer_id
HAVING COUNT(*) >= 5 -- Only customers with 5+ orders
)
SELECT
segment,
COUNT(*) as customer_count,
AVG(lifetime_value) as avg_lifetime_value
FROM (
SELECT
*,
CASE
WHEN lifetime_value >= 10000 THEN 'VIP'
WHEN lifetime_value >= 5000 THEN 'High Value'
WHEN lifetime_value >= 1000 THEN 'Medium Value'
ELSE 'Low Value'
END as segment
FROM large_customer_analysis
WHERE customer_lifespan_days >= 90
) segmented
GROUP BY segment;
-- Temporary table approach - better for large datasets
CREATE TEMPORARY TABLE customer_analysis_temp AS
SELECT
customer_id,
COUNT(*) as total_orders,
SUM(total_amount) as lifetime_value,
AVG(total_amount) as avg_order_value,
MIN(order_date) as first_order,
MAX(order_date) as last_order,
DATEDIFF(day, MIN(order_date), MAX(order_date)) as customer_lifespan_days
FROM orders
WHERE order_date >= '2020-01-01'
GROUP BY customer_id
HAVING COUNT(*) >= 5;
-- Add indexes for better performance
CREATE INDEX idx_customer_analysis_lifetime ON customer_analysis_temp(lifetime_value);
CREATE INDEX idx_customer_analysis_lifespan ON customer_analysis_temp(customer_lifespan_days);
-- Now the segmentation query runs more efficiently
SELECT
CASE
WHEN lifetime_value >= 10000 THEN 'VIP'
WHEN lifetime_value >= 5000 THEN 'High Value'
WHEN lifetime_value >= 1000 THEN 'Medium Value'
ELSE 'Low Value'
END as segment,
COUNT(*) as customer_count,
AVG(lifetime_value) as avg_lifetime_value,
MIN(customer_lifespan_days) as min_lifespan,
MAX(customer_lifespan_days) as max_lifespan
FROM customer_analysis_temp
WHERE customer_lifespan_days >= 90
GROUP BY CASE
WHEN lifetime_value >= 10000 THEN 'VIP'
WHEN lifetime_value >= 5000 THEN 'High Value'
WHEN lifetime_value >= 1000 THEN 'Medium Value'
ELSE 'Low Value'
END;
DROP TEMPORARY TABLE customer_analysis_temp;
Different database engines handle CTEs differently, impacting your optimization strategies:
SQL Server:
PostgreSQL:
MySQL:
-- PostgreSQL-specific CTE optimization
WITH MATERIALIZED expensive_aggregation AS (
-- Force materialization for reuse
SELECT customer_id, SUM(total_amount) as total_spent
FROM orders
GROUP BY customer_id
),
NOT MATERIALIZED simple_filter AS (
-- Force inlining for simple operations
SELECT * FROM customers WHERE region = 'North America'
)
SELECT sf.customer_name, ea.total_spent
FROM simple_filter sf
JOIN expensive_aggregation ea ON sf.customer_id = ea.customer_id;
In production systems, CTEs and subqueries often integrate with other architectural components including stored procedures, data pipeline tools, and business intelligence platforms.
-- Example: Data quality reporting CTE used in ETL monitoring
CREATE PROCEDURE sp_daily_data_quality_report
AS
BEGIN
WITH source_data_quality AS (
SELECT
'orders' as table_name,
COUNT(*) as total_records,
COUNT(CASE WHEN customer_id IS NULL THEN 1 END) as null_customer_ids,
COUNT(CASE WHEN total_amount <= 0 THEN 1 END) as invalid_amounts,
COUNT(CASE WHEN order_date > GETDATE() THEN 1 END) as future_dates,
MIN(order_date) as earliest_date,
MAX(order_date) as latest_date
FROM orders
WHERE order_date = CAST(GETDATE() AS DATE)
UNION ALL
SELECT
'customers',
COUNT(*),
COUNT(CASE WHEN email IS NULL OR email = '' THEN 1 END),
COUNT(CASE WHEN customer_name IS NULL OR customer_name = '' THEN 1 END),
0,
MIN(created_date),
MAX(created_date)
FROM customers
WHERE created_date = CAST(GETDATE() AS DATE)
),
quality_summary AS (
SELECT
table_name,
total_records,
null_customer_ids + invalid_amounts + future_dates as total_quality_issues,
CASE
WHEN total_records = 0 THEN 0
ELSE ROUND((CAST(null_customer_ids + invalid_amounts + future_dates AS FLOAT) / total_records) * 100, 2)
END as error_percentage,
earliest_date,
latest_date
FROM source_data_quality
)
SELECT
table_name,
total_records,
total_quality_issues,
error_percentage,
CASE
WHEN error_percentage > 5 THEN 'CRITICAL'
WHEN error_percentage > 1 THEN 'WARNING'
ELSE 'GOOD'
END as quality_status,
earliest_date,
latest_date,
GETDATE() as report_timestamp
FROM quality_summary
ORDER BY error_percentage DESC;
END;
-- Comprehensive dashboard query using multiple CTEs
WITH time_periods AS (
SELECT
DATEADD(day, -30, CAST(GETDATE() AS DATE)) as thirty_days_ago,
DATEADD(day, -7, CAST(GETDATE() AS DATE)) as seven_days_ago,
DATEADD(day, -1, CAST(GETDATE() AS DATE)) as yesterday,
CAST(GETDATE() AS DATE) as today
),
kpi_calculations AS (
SELECT
tp.*,
-- Revenue metrics
(SELECT SUM(total_amount) FROM orders
WHERE order_date >= tp.thirty_days_ago) as revenue_30d,
(SELECT SUM(total_amount) FROM orders
WHERE order_date >= tp.seven_days_ago) as revenue_7d,
(SELECT SUM(total_amount) FROM orders
WHERE order_date = tp.yesterday) as revenue_yesterday,
-- Customer metrics
(SELECT COUNT(DISTINCT customer_id) FROM orders
WHERE order_date >= tp.thirty_days_ago) as active_customers_30d,
(SELECT COUNT(DISTINCT customer_id) FROM orders
WHERE order_date >= tp.seven_days_ago) as active_customers_7d,
-- Order metrics
(SELECT COUNT(*) FROM orders
WHERE order_date >= tp.thirty_days_ago) as orders_30d,
(SELECT COUNT(*) FROM orders
WHERE order_date >= tp.seven_days_ago) as orders_7d,
(SELECT COUNT(*) FROM orders
WHERE order_date = tp.yesterday) as orders_yesterday
FROM time_periods tp
),
comparative_metrics AS (
SELECT
*,
-- Calculate averages and growth rates
revenue_30d / 30.0 as avg_daily_revenue_30d,
revenue_7d / 7.0 as avg_daily_revenue_7d,
orders_30d / 30.0 as avg_daily_orders_30d,
orders_7d / 7.0 as avg_daily_orders_7d,
-- Revenue per customer
CASE WHEN active_customers_30d > 0
THEN revenue_30d / active_customers_30d
ELSE 0 END as revenue_per_customer_30d,
-- Average order value
CASE WHEN orders_30d > 0
THEN revenue_30d / orders_30d
ELSE 0 END as avg_order_value_30d
FROM kpi_calculations
),
dashboard_output AS (
SELECT
'Revenue' as metric_category,
'Last 30 Days' as period,
FORMAT(revenue_30d, 'C0') as value,
FORMAT(avg_daily_revenue_30d, 'C0') as daily_average,
NULL as growth_rate
FROM comparative_metrics
UNION ALL
SELECT
'Revenue',
'Last 7 Days',
FORMAT(revenue_7d, 'C0'),
FORMAT(avg_daily_revenue_7d, 'C0'),
CASE
WHEN avg_daily_revenue_30d > 0
THEN FORMAT(((avg_daily_revenue_7d - avg_daily_revenue_30d) / avg_daily_revenue_30d) * 100, 'N1') + '%'
ELSE NULL
END
FROM comparative_metrics
UNION ALL
SELECT
'Orders',
'Last 30 Days',
FORMAT(orders_30d, 'N0'),
FORMAT(avg_daily_orders_30d, 'N1'),
NULL
FROM comparative_metrics
UNION ALL
SELECT
'Customers',
'Last 30 Days',
FORMAT(active_customers_30d, 'N0'),
FORMAT(active_customers_30d / 30.0, 'N1'),
NULL
FROM comparative_metrics
UNION ALL
SELECT
'AOV',
'Last 30 Days',
FORMAT(avg_order_value_30d, 'C2'),
NULL,
NULL
FROM comparative_metrics
)
SELECT * FROM dashboard_output
ORDER BY metric_category, period;
Let's build a comprehensive customer segmentation analysis that combines multiple advanced SQL concepts we've covered. This exercise will help you practice complex CTEs, subqueries, and analytical functions in a realistic business scenario.
Scenario: You're analyzing customer behavior for an e-commerce platform to create dynamic customer segments based on purchase patterns, frequency, and value. The marketing team needs detailed insights to personalize campaigns.
Your task: Create a query that segments customers using RFM analysis (Recency, Frequency, Monetary) with the following requirements:
Here's the starter code with sample data structure:
-- Sample data setup (you can adapt this to your own dataset)
/*
Tables available:
- orders: order_id, customer_id, order_date, total_amount, status
- customers: customer_id, customer_name, email, registration_date, region
- order_items: order_item_id, order_id, product_id, quantity, unit_price
*/
-- Your solution should produce output similar to:
-- customer_id | rfm_segment | recency_days | frequency | monetary | churn_risk | predicted_ltv | segment_description
WITH analysis_date AS (
-- Define the analysis date as today
SELECT CAST('2024-01-15' AS DATE) as analysis_date
),
-- Your CTE logic goes here
-- Step 1: Calculate base RFM metrics
customer_rfm_raw AS (
-- Calculate recency, frequency, monetary for each customer
-- Your code here
),
-- Step 2: Calculate RFM scores using quintiles
customer_rfm_scores AS (
-- Create 1-5 scores for each RFM dimension
-- Your code here
),
-- Step 3: Create business segments
customer_segments AS (
-- Generate meaningful segment names based on RFM scores
-- Your code here
),
-- Step 4: Add predictive metrics
final_analysis AS (
-- Add churn risk and lifetime value predictions
-- Your code here
)
-- Final SELECT with summary
SELECT * FROM final_analysis
ORDER BY monetary DESC;
Challenge Extensions:
Take 30-45 minutes to work through this exercise. The solution involves multiple complex CTEs, window functions, conditional logic, and statistical calculations.
Understanding common pitfalls with subqueries and CTEs can save hours of debugging and prevent performance issues in production systems.
Mistake 1: Using correlated subqueries when joins would be more efficient
-- Inefficient: Correlated subquery executes for each row
SELECT
product_id,
product_name,
(SELECT COUNT(*) FROM order_items oi
WHERE oi.product_id = p.product_id) as times_ordered
FROM products p
WHERE category_id = 'electronics';
-- Better: Use JOIN or window function
SELECT
p.product_id,
p.product_name,
COUNT(oi.order_item_id) as times_ordered
FROM products p
LEFT JOIN order_items oi ON p.product_id = oi.product_id
WHERE p.category_id = 'electronics'
GROUP BY p.product_id, p.product_name;
Mistake 2: Not handling NULL values properly in subqueries
-- Dangerous: Can return unexpected results with NULLs
SELECT customer_id, customer_name
FROM customers
WHERE customer_id NOT IN (
SELECT customer_id FROM orders WHERE total_amount > 100
-- If any customer_id is NULL, this returns no rows!
);
-- Safe: Use EXISTS or handle NULLs explicitly
SELECT customer_id, customer_name
FROM customers c
WHERE NOT EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.customer_id
AND o.total_amount > 100
);
Mistake 3: Overusing scalar subqueries in SELECT lists
-- Inefficient: Multiple scalar subqueries
SELECT
customer_id,
(SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id) as order_count,
(SELECT SUM(total_amount) FROM orders o WHERE o.customer_id = c.customer_id) as total_spent,
(SELECT MAX(order_date) FROM orders o WHERE o.customer_id = c.customer_id) as last_order_date
FROM customers c;
-- Better: Single JOIN with aggregation
SELECT
c.customer_id,
COALESCE(o.order_count, 0) as order_count,
COALESCE(o.total_spent, 0) as total_spent,
o.last_order_date
FROM customers c
LEFT JOIN (
SELECT
customer_id,
COUNT(*) as order_count,
SUM(total_amount) as total_spent,
MAX(order_date) as last_order_date
FROM orders
GROUP BY customer_id
) o ON c.customer_id = o.customer_id;
Mistake 4: Creating overly complex CTE hierarchies
-- Hard to maintain: Too many dependent CTEs
WITH step1 AS (...),
step2 AS (SELECT * FROM step1 WHERE ...),
step3 AS (SELECT * FROM step2 JOIN ...),
step4 AS (SELECT * FROM step3 WHERE ...),
step5 AS (SELECT * FROM step4 GROUP BY ...),
step6 AS (SELECT * FROM step5 HAVING ...),
final_result AS (SELECT * FROM step6 ORDER BY ...)
SELECT * FROM final_result;
-- Better: Combine related logic and use descriptive names
WITH filtered_orders AS (
SELECT o.*, c.region
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= DATEADD(month, -6, GETDATE())
AND o.status = 'completed'
),
regional_aggregates AS (
SELECT
region,
COUNT(*) as order_count,
SUM(total_amount) as total_revenue,
AVG(total_amount) as avg_order_value
FROM filtered_orders
GROUP BY region
HAVING COUNT(*) >= 100 -- Only regions with significant volume
)
SELECT * FROM regional_aggregates
ORDER BY total_revenue DESC;
Mistake 5: Forgetting that CTEs can't be indexed
-- Problem: Large CTE result set used multiple times
WITH large_customer_dataset AS (
SELECT
customer_id,
-- Complex calculations on millions of rows
SUM(total_amount) as lifetime_value,
COUNT(*) as order_count,
-- More expensive operations...
FROM orders
WHERE order_date >= '2020-01-01'
GROUP BY customer_id
)
SELECT 'High Value', COUNT(*) FROM large_customer_dataset WHERE lifetime_value > 10000
UNION ALL
SELECT 'Medium Value', COUNT(*) FROM large_customer_dataset WHERE lifetime_value BETWEEN 1000 AND 10000
UNION ALL
SELECT 'Low Value', COUNT(*) FROM large_customer_dataset WHERE lifetime_value < 1000;
-- Solution: Use temporary table with indexes
CREATE TEMPORARY TABLE customer_lifetime_values AS
SELECT
customer_id,
SUM(total_amount) as lifetime_value,
COUNT(*) as order_count
FROM orders
WHERE order_date >= '2020-01-01'
GROUP BY customer_id;
CREATE INDEX idx_clv_lifetime_value ON customer_lifetime_values(lifetime_value);
SELECT
CASE
WHEN lifetime_value > 10000 THEN 'High Value'
WHEN lifetime_value >= 1000 THEN 'Medium Value'
ELSE 'Low Value'
END as segment,
COUNT(*) as customer_count
FROM customer_lifetime_values
GROUP BY CASE
WHEN lifetime_value > 10000 THEN 'High Value'
WHEN lifetime_value >= 1000 THEN 'Medium Value'
ELSE 'Low Value'
END;
Strategy 1: Test CTEs individually
-- Instead of running the entire complex query, test each CTE separately
WITH monthly_sales AS (
SELECT
FORMAT(order_date, 'yyyy-MM') as month_year,
SUM(total_amount) as monthly_revenue
FROM orders
WHERE order_date >= DATEADD(month, -12, GETDATE())
GROUP BY FORMAT(order_date, 'yyyy-MM')
)
-- Test this CTE first
SELECT * FROM monthly_sales ORDER BY month_year;
-- Then add the next CTE and test again
Strategy 2: Use execution plans to identify bottlenecks
-- Enable actual execution plan to see where time is spent
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
-- Your complex query here
WITH expensive_analysis AS (...)
SELECT * FROM expensive_analysis;
SET STATISTICS IO OFF;
SET STATISTICS TIME OFF;
Strategy 3: Add row count checkpoints
WITH base_data AS (
SELECT customer_id, order_date, total_amount
FROM orders
WHERE order_date >= DATEADD(year, -1, GETDATE())
),
row_count_checkpoint AS (
SELECT *, COUNT(*) OVER() as total_rows_checkpoint1
FROM base_data
),
filtered_data AS (
SELECT customer_id, SUM(total_amount) as total_spent
FROM base_data
GROUP BY customer_id
HAVING SUM(total_amount) > 1000
),
final_checkpoint AS (
SELECT *, COUNT(*) OVER() as total_rows_final
FROM filtered_data
)
SELECT
COUNT(*) as customers_over_1000,
MIN(total_spent) as min_spent,
MAX(total_spent) as max_spent,
AVG(total_spent) as avg_spent
FROM final_checkpoint;
You've now mastered both the tactical and strategic aspects of subqueries and CTEs in SQL. These powerful constructs form the foundation of advanced analytical SQL, enabling you to break down complex business problems into manageable, readable, and performant queries.
Key takeaways from this lesson:
Immediate next steps:
Advanced learning path:
The analytical patterns you've learned here will serve as building blocks for more advanced topics like machine learning feature engineering in SQL, real-time analytics processing, and enterprise-scale data warehouse query optimization.
Remember: the goal isn't just to write functional SQL, but to create maintainable, performant, and understandable analytical systems that can evolve with your business needs. CTEs and subqueries, when used strategically, are essential tools in achieving that goal.
Learning Path: SQL Fundamentals