
You're staring at a spreadsheet with thousands of rows of sales data, trying to answer what should be a simple question: "Which products sold above our average price last month?" In a spreadsheet, you'd need multiple steps—first calculate the average, then filter based on that result. But in SQL, you can answer complex questions like this in a single, elegant query using subqueries and Common Table Expressions (CTEs).
These powerful SQL features let you break down complicated data problems into manageable pieces, creating queries within queries that build upon each other logically. Think of them as a way to create temporary "scratch work" that helps you solve multi-step analytical problems without creating permanent tables or running multiple separate queries.
By the end of this lesson, you'll transform from someone who struggles with complex data questions into someone who can tackle sophisticated analytical challenges with confidence and clarity.
What you'll learn:
You should be comfortable with basic SQL SELECT statements, including filtering with WHERE clauses, sorting with ORDER BY, and using aggregate functions like COUNT, SUM, and AVG. If you can write queries that join tables together, you're ready for this lesson.
A subquery is simply a SQL query nested inside another SQL query. It's like asking a question that depends on the answer to another question. Let's start with a concrete example using a realistic e-commerce dataset.
Imagine you're analyzing an online store's data with these tables:
-- Products table
CREATE TABLE products (
product_id INT,
product_name VARCHAR(100),
category VARCHAR(50),
price DECIMAL(10,2)
);
-- Orders table
CREATE TABLE orders (
order_id INT,
product_id INT,
customer_id INT,
quantity INT,
order_date DATE,
total_amount DECIMAL(10,2)
);
Now, suppose you want to find all products that cost more than the average product price. Without subqueries, you'd need to:
With a subquery, you can do this in one step:
SELECT product_name, price
FROM products
WHERE price > (
SELECT AVG(price)
FROM products
);
The query inside the parentheses (SELECT AVG(price) FROM products) runs first, calculating the average price. Then the outer query uses that result to filter products.
Key Insight: Subqueries always execute from the inside out. The innermost query runs first, then its results feed into the outer query.
Subqueries can appear in three main places in a SQL statement, each serving different purposes.
WHERE clause subqueries are perfect for filtering based on calculated values or conditions from other tables. Here's a practical example: finding customers who placed orders above the average order value.
SELECT DISTINCT customer_id
FROM orders
WHERE total_amount > (
SELECT AVG(total_amount)
FROM orders
);
You can also use subqueries with operators like IN, NOT IN, EXISTS, and NOT EXISTS:
-- Find products that have never been ordered
SELECT product_name
FROM products
WHERE product_id NOT IN (
SELECT DISTINCT product_id
FROM orders
WHERE product_id IS NOT NULL
);
Important: Always include
WHERE product_id IS NOT NULLwhen using NOT IN with subqueries. If the subquery returns any NULL values, NOT IN will return no results at all—a common source of confusion.
SELECT clause subqueries let you add calculated columns based on related data. This is useful for adding context or comparisons to your main results:
SELECT
product_name,
price,
(SELECT AVG(price) FROM products) AS avg_price,
price - (SELECT AVG(price) FROM products) AS price_difference
FROM products
ORDER BY price_difference DESC;
This query shows each product's price alongside the overall average and the difference from that average.
FROM clause subqueries create temporary result sets that you can query like tables. This is powerful for multi-step analysis:
SELECT
category,
AVG(price) AS category_avg_price
FROM (
SELECT category, price
FROM products
WHERE price > 50
) AS expensive_products
GROUP BY category;
Here, the subquery first filters to expensive products, then the outer query calculates averages by category for just those expensive items.
While subqueries work well for simple cases, they can become hard to read and maintain as your logic gets more complex. Common Table Expressions (CTEs) solve this problem by letting you define temporary named result sets that exist for the duration of your query.
Think of a CTE as giving a name to a subquery and defining it at the top of your statement, making your overall query much more readable.
Here's the basic syntax:
WITH cte_name AS (
-- Your subquery here
SELECT column1, column2
FROM some_table
WHERE some_condition
)
SELECT *
FROM cte_name;
Let's rewrite our expensive products analysis using a CTE:
WITH expensive_products AS (
SELECT category, price
FROM products
WHERE price > 50
)
SELECT
category,
AVG(price) AS category_avg_price
FROM expensive_products
GROUP BY category;
The logic is identical, but notice how much clearer the intent becomes. You can immediately see that we're working with "expensive products" without having to parse through nested parentheses.
Let's work through some realistic scenarios where CTEs shine.
Suppose you want to analyze month-over-month sales growth. This requires multiple steps:
WITH monthly_sales AS (
SELECT
YEAR(order_date) AS year,
MONTH(order_date) AS month,
SUM(total_amount) AS monthly_total
FROM orders
GROUP BY YEAR(order_date), MONTH(order_date)
),
sales_with_previous AS (
SELECT
year,
month,
monthly_total,
LAG(monthly_total) OVER (ORDER BY year, month) AS previous_month_total
FROM monthly_sales
)
SELECT
year,
month,
monthly_total,
previous_month_total,
CASE
WHEN previous_month_total IS NOT NULL
THEN ((monthly_total - previous_month_total) / previous_month_total) * 100
ELSE NULL
END AS growth_rate_percent
FROM sales_with_previous
ORDER BY year, month;
This query uses two CTEs:
monthly_sales aggregates orders by monthsales_with_previous adds the previous month's total using a window functionHere's a more complex example that segments customers based on their purchasing behavior:
WITH customer_stats AS (
SELECT
customer_id,
COUNT(*) AS total_orders,
SUM(total_amount) AS total_spent,
AVG(total_amount) AS avg_order_value,
MAX(order_date) AS last_order_date,
MIN(order_date) AS first_order_date
FROM orders
GROUP BY customer_id
),
customer_segments AS (
SELECT
customer_id,
total_orders,
total_spent,
avg_order_value,
DATEDIFF(CURRENT_DATE, last_order_date) AS days_since_last_order,
CASE
WHEN total_spent > 1000 AND total_orders > 10 THEN 'VIP'
WHEN total_spent > 500 AND total_orders > 5 THEN 'High Value'
WHEN days_since_last_order < 30 THEN 'Active'
WHEN days_since_last_order < 90 THEN 'At Risk'
ELSE 'Inactive'
END AS segment
FROM customer_stats
)
SELECT
segment,
COUNT(*) AS customer_count,
AVG(total_spent) AS avg_customer_value,
AVG(total_orders) AS avg_orders_per_customer
FROM customer_segments
GROUP BY segment
ORDER BY customer_count DESC;
This analysis would be nearly impossible to read as nested subqueries, but with CTEs, each step is clear and logical.
Some database systems (like PostgreSQL, SQL Server, and newer versions of MySQL) support recursive CTEs, which can call themselves. This is useful for hierarchical data like organizational charts or category trees.
Here's a simple example with an employee hierarchy:
WITH RECURSIVE employee_hierarchy AS (
-- Base case: top-level managers
SELECT employee_id, employee_name, manager_id, 1 as level
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive case: employees reporting to someone in the hierarchy
SELECT e.employee_id, e.employee_name, e.manager_id, h.level + 1
FROM employees e
INNER JOIN employee_hierarchy h ON e.manager_id = h.employee_id
)
SELECT employee_name, level
FROM employee_hierarchy
ORDER BY level, employee_name;
Note: Recursive CTEs are an advanced feature not supported in all database systems. Check your database documentation before using them.
Choosing between subqueries and CTEs depends on several factors:
Use subqueries when:
Use CTEs when:
Here's the same analysis written both ways to illustrate the difference:
With subqueries (harder to read):
SELECT p.product_name, p.price
FROM products p
WHERE p.price > (
SELECT AVG(o.total_amount / o.quantity)
FROM orders o
WHERE o.product_id IN (
SELECT product_id
FROM products
WHERE category = p.category
)
);
With CTEs (much clearer):
WITH category_avg_prices AS (
SELECT
p.category,
AVG(o.total_amount / o.quantity) AS avg_selling_price
FROM products p
JOIN orders o ON p.product_id = o.product_id
GROUP BY p.category
)
SELECT p.product_name, p.price
FROM products p
JOIN category_avg_prices cap ON p.category = cap.category
WHERE p.price > cap.avg_selling_price;
Let's practice with a realistic scenario. You're analyzing a subscription business with these tables:
-- Create sample data
CREATE TABLE subscriptions (
subscription_id INT,
customer_id INT,
plan_type VARCHAR(50),
monthly_fee DECIMAL(10,2),
start_date DATE,
end_date DATE
);
CREATE TABLE payments (
payment_id INT,
subscription_id INT,
payment_date DATE,
amount DECIMAL(10,2),
status VARCHAR(20)
);
Challenge: Write a query using CTEs to find:
Try writing this yourself first, then compare with this solution:
WITH successful_payments AS (
SELECT
s.customer_id,
p.amount,
p.payment_date
FROM subscriptions s
JOIN payments p ON s.subscription_id = p.subscription_id
WHERE p.status = 'successful'
),
customer_totals AS (
SELECT
customer_id,
SUM(amount) AS total_paid,
COUNT(*) AS payment_count,
AVG(amount) AS avg_payment_amount
FROM successful_payments
GROUP BY customer_id
),
overall_avg AS (
SELECT AVG(total_paid) AS avg_customer_value
FROM customer_totals
)
SELECT
ct.customer_id,
ct.total_paid,
ct.payment_count,
ct.avg_payment_amount
FROM customer_totals ct
CROSS JOIN overall_avg oa
WHERE ct.total_paid > oa.avg_customer_value
ORDER BY ct.total_paid DESC;
-- This might return no results if any product_id in orders is NULL
SELECT product_name
FROM products
WHERE product_id NOT IN (SELECT product_id FROM orders);
-- Fix: Handle NULLs explicitly
SELECT product_name
FROM products
WHERE product_id NOT IN (
SELECT product_id
FROM orders
WHERE product_id IS NOT NULL
);
-- Slow and hard to read
SELECT
product_name,
(SELECT COUNT(*) FROM orders WHERE product_id = p.product_id) AS order_count
FROM products p
WHERE (SELECT COUNT(*) FROM orders WHERE product_id = p.product_id) > 5;
-- Better with CTE
WITH product_order_counts AS (
SELECT
product_id,
COUNT(*) AS order_count
FROM orders
GROUP BY product_id
)
SELECT p.product_name, poc.order_count
FROM products p
JOIN product_order_counts poc ON p.product_id = poc.product_id
WHERE poc.order_count > 5;
-- This will cause errors if column names conflict
WITH sales_data AS (
SELECT customer_id, SUM(amount) -- Missing alias!
FROM orders
GROUP BY customer_id
)
SELECT * FROM sales_data; -- Which column is the sum?
-- Fix: Always alias calculated columns
WITH sales_data AS (
SELECT
customer_id,
SUM(amount) AS total_amount
FROM orders
GROUP BY customer_id
)
SELECT * FROM sales_data;
If your CTEs or subqueries are running slowly:
Check if your database optimizes CTEs: Some databases treat CTEs as materialized (stored temporarily), while others optimize them like views. This affects performance differently.
Consider indexes: Make sure columns used in WHERE clauses and JOINs are indexed.
Break down complex CTEs: If a CTE is doing too much work, split it into multiple steps.
Use EXPLAIN: Most databases have an EXPLAIN command that shows how your query will execute:
EXPLAIN WITH my_cte AS (...) SELECT ...
You've now mastered two of SQL's most powerful features for handling complex analytical questions. Subqueries let you embed queries within queries for straightforward multi-step logic, while CTEs provide a clean, readable way to break down complex problems into manageable pieces.
Key takeaways:
What to explore next:
Practice these concepts with your own datasets. Start with simple subqueries, then graduate to multi-step CTEs as you encounter more complex analytical challenges. The combination of clear thinking about your data problems and these powerful SQL tools will make you significantly more effective at extracting insights from data.
Learning Path: SQL Fundamentals