
Picture this: You're analyzing customer data for an e-commerce company and need to find customers who've made above-average purchases in the last quarter, but only in product categories where the company achieved at least 15% profit margin. Your first instinct might be to create multiple queries and manually combine the results in a spreadsheet—but there's a better way.
This scenario perfectly illustrates why subqueries and Common Table Expressions (CTEs) are essential tools in your SQL arsenal. They allow you to break down complex analytical problems into logical, manageable pieces while keeping everything within a single, powerful query. By the end of this lesson, you'll transform from writing basic SELECT statements to crafting sophisticated analytical queries that would impress any data team.
What you'll learn:
You should be comfortable with basic SQL operations (SELECT, WHERE, JOIN, GROUP BY) and understand how relational databases organize data into tables. Familiarity with aggregate functions and basic performance concepts will help you get the most from this lesson.
A subquery is simply a query nested inside another query. Think of it as asking a question to help answer a bigger question. Let's start with our e-commerce scenario using realistic sample data.
We'll work with three tables that mirror real-world e-commerce data:
-- Customers table
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
customer_name VARCHAR(100),
registration_date DATE,
country VARCHAR(50)
);
-- Orders table
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
order_date DATE,
total_amount DECIMAL(10,2),
category VARCHAR(50)
);
-- Products table
CREATE TABLE products (
product_id INT PRIMARY KEY,
product_name VARCHAR(100),
category VARCHAR(50),
cost_price DECIMAL(8,2),
selling_price DECIMAL(8,2)
);
The most common use of subqueries is filtering results based on calculations from other tables. Let's find customers who've spent more than the average order amount:
SELECT customer_id, customer_name, country
FROM customers
WHERE customer_id IN (
SELECT customer_id
FROM orders
WHERE total_amount > (
SELECT AVG(total_amount)
FROM orders
)
);
This query contains two subqueries:
SELECT AVG(total_amount) FROM orders calculates the average order valueNotice how we built this step by step. Each subquery answers a specific question, making the logic easy to follow and debug.
Understanding this distinction is crucial for both performance and functionality. The previous example used non-correlated subqueries—they can run independently of the outer query.
Correlated subqueries reference columns from the outer query, creating a dependency. Here's an example finding customers whose latest order was above their personal average:
SELECT c.customer_id, c.customer_name,
(SELECT MAX(o1.order_date)
FROM orders o1
WHERE o1.customer_id = c.customer_id) as last_order_date
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o2
WHERE o2.customer_id = c.customer_id
AND o2.total_amount > (
SELECT AVG(o3.total_amount)
FROM orders o3
WHERE o3.customer_id = c.customer_id
)
);
The correlated subquery WHERE o3.customer_id = c.customer_id creates a dependency—it must re-execute for each customer in the outer query. This makes correlated subqueries slower but enables more sophisticated logic.
Performance Tip: Non-correlated subqueries execute once; correlated subqueries execute once per outer query row. Use EXISTS instead of IN when possible with correlated subqueries for better performance.
Subqueries in the SELECT clause add calculated columns to your results. This is powerful for creating analytical reports:
SELECT
c.customer_name,
c.country,
(SELECT COUNT(*)
FROM orders o
WHERE o.customer_id = c.customer_id) as total_orders,
(SELECT AVG(o.total_amount)
FROM orders o
WHERE o.customer_id = c.customer_id) as avg_order_value,
(SELECT MAX(o.order_date)
FROM orders o
WHERE o.customer_id = c.customer_id) as last_order_date
FROM customers c
WHERE c.registration_date >= '2023-01-01';
Each subquery in the SELECT clause must return exactly one value (scalar subquery). This creates a rich analytical view combining customer demographics with behavioral metrics.
Using subqueries in the FROM clause treats the subquery result as a temporary table. This is excellent for multi-step calculations:
SELECT
monthly_sales.order_month,
monthly_sales.total_revenue,
monthly_sales.order_count,
ROUND(monthly_sales.total_revenue / monthly_sales.order_count, 2) as avg_order_value
FROM (
SELECT
DATE_FORMAT(order_date, '%Y-%m') as order_month,
SUM(total_amount) as total_revenue,
COUNT(*) as order_count
FROM orders
WHERE order_date >= '2023-01-01'
GROUP BY DATE_FORMAT(order_date, '%Y-%m')
) as monthly_sales
ORDER BY monthly_sales.order_month;
The subquery aggregates monthly data, then the outer query calculates derived metrics. This pattern is invaluable for building complex analytical queries step by step.
While subqueries are powerful, they can become difficult to read and maintain in complex scenarios. Common Table Expressions (CTEs) solve this by letting you define temporary named result sets at the beginning of your query.
Let's rewrite our monthly sales analysis using a CTE:
WITH monthly_sales AS (
SELECT
DATE_FORMAT(order_date, '%Y-%m') as order_month,
SUM(total_amount) as total_revenue,
COUNT(*) as order_count
FROM orders
WHERE order_date >= '2023-01-01'
GROUP BY DATE_FORMAT(order_date, '%Y-%m')
)
SELECT
order_month,
total_revenue,
order_count,
ROUND(total_revenue / order_count, 2) as avg_order_value
FROM monthly_sales
ORDER BY order_month;
The logic is identical to our subquery version, but notice how much more readable this is. The CTE acts like a temporary view that exists only for this query.
Real analytical work often requires multiple calculation steps. CTEs excel here by letting you chain logical steps together:
WITH profitable_categories AS (
-- Step 1: Find categories with >15% profit margin
SELECT DISTINCT category
FROM products
WHERE (selling_price - cost_price) / cost_price > 0.15
),
customer_totals AS (
-- Step 2: Calculate total spending per customer in profitable categories
SELECT
o.customer_id,
SUM(o.total_amount) as total_spent,
COUNT(*) as order_count
FROM orders o
INNER JOIN profitable_categories pc ON o.category = pc.category
WHERE o.order_date >= DATE_SUB(CURRENT_DATE, INTERVAL 3 MONTH)
GROUP BY o.customer_id
),
spending_stats AS (
-- Step 3: Calculate average spending across all customers
SELECT AVG(total_spent) as avg_customer_spending
FROM customer_totals
)
-- Step 4: Find above-average customers
SELECT
c.customer_name,
c.country,
ct.total_spent,
ct.order_count,
ROUND(ct.total_spent / ct.order_count, 2) as avg_order_value
FROM customers c
INNER JOIN customer_totals ct ON c.customer_id = ct.customer_id
CROSS JOIN spending_stats ss
WHERE ct.total_spent > ss.avg_customer_spending
ORDER BY ct.total_spent DESC;
This query solves our original problem: finding customers with above-average spending in high-profit categories. Each CTE handles one logical step, making the query self-documenting and easy to modify.
Recursive CTEs handle hierarchical or graph-like data structures. Common use cases include organizational charts, category trees, or social networks. Here's an example with a product category hierarchy:
-- Category hierarchy table
CREATE TABLE category_hierarchy (
category_id INT,
category_name VARCHAR(50),
parent_category_id INT
);
-- Recursive CTE to find all subcategories under 'Electronics'
WITH RECURSIVE category_tree AS (
-- Base case: Find the root category
SELECT
category_id,
category_name,
parent_category_id,
0 as level,
CAST(category_name AS CHAR(1000)) as path
FROM category_hierarchy
WHERE category_name = 'Electronics'
UNION ALL
-- Recursive case: Find children of current level
SELECT
ch.category_id,
ch.category_name,
ch.parent_category_id,
ct.level + 1,
CONCAT(ct.path, ' > ', ch.category_name)
FROM category_hierarchy ch
INNER JOIN category_tree ct ON ch.parent_category_id = ct.category_id
)
SELECT
REPEAT(' ', level) || category_name as indented_category,
level,
path
FROM category_tree
ORDER BY path;
Recursive CTEs have two parts: the base case (starting point) and the recursive case (how to find the next level). The database engine automatically handles the iteration until no more rows are found.
CTEs work beautifully with window functions for sophisticated analytical queries. Let's create a customer segmentation analysis:
WITH customer_metrics AS (
SELECT
c.customer_id,
c.customer_name,
c.registration_date,
COUNT(o.order_id) as total_orders,
SUM(o.total_amount) as total_spent,
AVG(o.total_amount) as avg_order_value,
MAX(o.order_date) as last_order_date,
DATEDIFF(CURRENT_DATE, MAX(o.order_date)) as days_since_last_order
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name, c.registration_date
),
customer_percentiles AS (
SELECT
*,
NTILE(4) OVER (ORDER BY total_spent) as spending_quartile,
NTILE(4) OVER (ORDER BY total_orders) as frequency_quartile,
NTILE(4) OVER (ORDER BY days_since_last_order DESC) as recency_quartile
FROM customer_metrics
WHERE total_orders > 0
)
SELECT
customer_name,
total_spent,
total_orders,
days_since_last_order,
spending_quartile,
frequency_quartile,
recency_quartile,
CASE
WHEN spending_quartile = 4 AND frequency_quartile >= 3 AND recency_quartile >= 3
THEN 'Champion'
WHEN spending_quartile >= 3 AND frequency_quartile >= 3 AND recency_quartile >= 2
THEN 'Loyal Customer'
WHEN spending_quartile >= 3 AND recency_quartile >= 3
THEN 'Potential Loyalist'
WHEN recency_quartile >= 3
THEN 'New Customer'
WHEN spending_quartile >= 2 AND frequency_quartile <= 2 AND recency_quartile <= 2
THEN 'At Risk'
ELSE 'Needs Attention'
END as customer_segment
FROM customer_percentiles
ORDER BY total_spent DESC;
This creates an RFM (Recency, Frequency, Monetary) analysis using multiple CTEs to build customer segments—a common real-world analytics pattern.
CTEs are excellent for data quality checks before analysis:
WITH data_quality_checks AS (
SELECT
'orders' as table_name,
'duplicate_orders' as check_type,
COUNT(*) as issue_count
FROM (
SELECT customer_id, order_date, total_amount
FROM orders
GROUP BY customer_id, order_date, total_amount
HAVING COUNT(*) > 1
) duplicates
UNION ALL
SELECT
'orders',
'negative_amounts',
COUNT(*)
FROM orders
WHERE total_amount < 0
UNION ALL
SELECT
'orders',
'future_dates',
COUNT(*)
FROM orders
WHERE order_date > CURRENT_DATE
UNION ALL
SELECT
'customers',
'missing_names',
COUNT(*)
FROM customers
WHERE customer_name IS NULL OR TRIM(customer_name) = ''
),
summary_stats AS (
SELECT
SUM(issue_count) as total_issues,
COUNT(*) as total_checks
FROM data_quality_checks
)
SELECT
dqc.*,
ROUND(dqc.issue_count * 100.0 / (SELECT COUNT(*) FROM orders WHERE dqc.table_name = 'orders'), 2) as percentage_affected
FROM data_quality_checks dqc
CROSS JOIN summary_stats ss
ORDER BY dqc.issue_count DESC;
This pattern systematically checks for common data quality issues and provides metrics for each problem type.
Now let's put everything together in a comprehensive analysis. You'll build a Customer Lifetime Value (CLV) calculation that segments customers and identifies the most valuable segments.
Create a query that:
WITH customer_base_metrics AS (
SELECT
c.customer_id,
c.customer_name,
c.country,
c.registration_date,
DATE_FORMAT(c.registration_date, '%Y-%m') as registration_month,
DATEDIFF(CURRENT_DATE, c.registration_date) as customer_age_days,
COALESCE(COUNT(o.order_id), 0) as total_orders,
COALESCE(SUM(o.total_amount), 0) as total_revenue,
COALESCE(AVG(o.total_amount), 0) as avg_order_value,
CASE
WHEN COUNT(o.order_id) > 0
THEN DATEDIFF(CURRENT_DATE, MAX(o.order_date))
ELSE NULL
END as days_since_last_order
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name, c.country, c.registration_date
)
SELECT * FROM customer_base_metrics LIMIT 10;
WITH customer_base_metrics AS (
-- [Previous CTE code here]
),
customer_clv AS (
SELECT
*,
-- Simple CLV: total revenue adjusted for customer lifespan
CASE
WHEN customer_age_days > 0 AND total_orders > 0
THEN (total_revenue / customer_age_days) * 365 * 2 -- Project 2 years forward
ELSE total_revenue
END as projected_clv,
-- Order frequency (orders per month)
CASE
WHEN customer_age_days > 30 AND total_orders > 0
THEN (total_orders * 30.0) / customer_age_days
ELSE 0
END as monthly_order_frequency
FROM customer_base_metrics
),
customer_segments AS (
SELECT
*,
NTILE(10) OVER (ORDER BY projected_clv) as clv_decile,
CASE
WHEN days_since_last_order IS NULL THEN 'Never Purchased'
WHEN days_since_last_order <= 30 THEN 'Active'
WHEN days_since_last_order <= 90 THEN 'Lapsing'
ELSE 'Churned'
END as lifecycle_stage
FROM customer_clv
)
SELECT
customer_name,
country,
registration_month,
total_orders,
ROUND(total_revenue, 2) as total_revenue,
ROUND(projected_clv, 2) as projected_clv,
clv_decile,
lifecycle_stage,
ROUND(monthly_order_frequency, 2) as monthly_order_frequency
FROM customer_segments
WHERE clv_decile >= 8 -- Top 30% of customers
ORDER BY projected_clv DESC;
-- [Previous CTEs]
segment_analysis AS (
SELECT
country,
lifecycle_stage,
COUNT(*) as customer_count,
AVG(projected_clv) as avg_clv,
AVG(monthly_order_frequency) as avg_frequency,
SUM(total_revenue) as total_segment_revenue
FROM customer_segments
GROUP BY country, lifecycle_stage
)
SELECT
country,
lifecycle_stage,
customer_count,
ROUND(avg_clv, 2) as avg_clv,
ROUND(avg_frequency, 3) as avg_monthly_frequency,
ROUND(total_segment_revenue, 2) as segment_revenue,
ROUND(total_segment_revenue / SUM(total_segment_revenue) OVER() * 100, 1) as revenue_percentage
FROM segment_analysis
WHERE customer_count >= 5 -- Focus on meaningful segments
ORDER BY avg_clv DESC;
This exercise demonstrates how CTEs let you build complex analytical logic step by step, making sophisticated calculations manageable and readable.
Understanding performance implications helps you choose the right tool for each situation.
-- Subquery approach - potentially slower for large datasets
SELECT customer_name
FROM customers
WHERE customer_id IN (
SELECT customer_id
FROM orders
WHERE total_amount > 1000
);
-- JOIN approach - typically faster
SELECT DISTINCT c.customer_name
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
WHERE o.total_amount > 1000;
Use subqueries when:
Use JOINs when:
CTEs exist only for the duration of a single query, while temporary tables persist for the session. For complex multi-step processes, consider:
-- CTE approach - good for single query
WITH complex_calculation AS (
-- Expensive calculation here
)
SELECT * FROM complex_calculation
UNION ALL
SELECT * FROM complex_calculation WHERE condition; -- CTE recalculated!
-- Temporary table approach - better for reuse
CREATE TEMPORARY TABLE temp_complex_calculation AS (
-- Expensive calculation here
);
SELECT * FROM temp_complex_calculation
UNION ALL
SELECT * FROM temp_complex_calculation WHERE condition; -- Reuses stored result
DROP TEMPORARY TABLE temp_complex_calculation;
Performance Warning: CTEs are recalculated each time they're referenced in a query. If you reference a CTE multiple times and it contains expensive operations, consider using a temporary table instead.
-- Wrong - this subquery isn't correlated properly
SELECT customer_name
FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE total_amount > 1000 -- Missing: AND o.customer_id = c.customer_id
);
-- Correct
SELECT customer_name
FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.customer_id AND total_amount > 1000
);
-- Wrong - subquery might return multiple rows
SELECT customer_name,
(SELECT total_amount FROM orders WHERE customer_id = c.customer_id)
FROM customers c;
-- Correct - explicitly handle multiple values
SELECT customer_name,
(SELECT MAX(total_amount) FROM orders WHERE customer_id = c.customer_id)
FROM customers c;
-- Wrong - ambiguous column references
WITH orders AS (
SELECT customer_id, SUM(total_amount) as total
FROM orders -- This creates confusion!
GROUP BY customer_id
)
SELECT * FROM orders;
-- Correct - clear naming
WITH customer_totals AS (
SELECT customer_id, SUM(total_amount) as total_spent
FROM orders
GROUP BY customer_id
)
SELECT * FROM customer_totals;
When debugging multi-CTE queries:
-- Debug technique: Test intermediate results
WITH step1 AS (
SELECT customer_id, COUNT(*) as order_count
FROM orders
GROUP BY customer_id
)
-- SELECT * FROM step1 LIMIT 10; -- Uncomment to debug
, step2 AS (
SELECT customer_id, order_count,
CASE WHEN order_count > 5 THEN 'Frequent' ELSE 'Occasional' END as segment
FROM step1
)
SELECT * FROM step2;
You've now mastered the essential tools for writing complex analytical SQL queries. Subqueries let you embed logic directly within other queries, while CTEs provide a readable way to structure multi-step calculations. Understanding when to use each approach—and when alternatives like JOINs might be better—makes you a more effective data professional.
Key takeaways:
What's next: Now that you can structure complex analytical logic, you're ready to tackle advanced topics like window functions for sophisticated ranking and analytical calculations, or query optimization techniques for handling large datasets efficiently. You might also explore database-specific extensions like PostgreSQL's advanced CTE features or SQL Server's MERGE statements that build on these foundations.
Practice these patterns with your own datasets, and you'll find that problems that once seemed impossible become manageable, step-by-step analytical challenges.
Learning Path: SQL Fundamentals