Mastering SQL Aggregate Functions: Advanced GROUP BY Patterns and Performance Optimization

Picture this: you're analyzing customer behavior for an e-commerce platform with 50 million transactions. Marketing needs to know which product categories generate the most revenue by geographic region, but only for regions where average order values exceed $150. Finance wants monthly cohort analysis showing customer lifetime value trends. Operations needs real-time inventory alerts when stock levels fall below dynamic thresholds based on seasonal sales patterns.

Each of these scenarios requires sophisticated aggregation logic that goes far beyond basic SUM and COUNT operations. While many data professionals can write simple aggregate queries, mastering the nuanced interplay between GROUP BY, HAVING, and various aggregate functions—along with their performance implications at scale—separates competent analysts from true SQL experts.

This lesson will transform your understanding of SQL aggregation from basic summarization to advanced analytical patterns. You'll learn to design queries that handle complex business logic, optimize performance for large datasets, and avoid the subtle pitfalls that plague production systems.

What you'll learn:

Advanced GROUP BY patterns including ROLLUP, CUBE, and GROUPING SETS for multi-dimensional analysis
When and how to use HAVING versus WHERE for optimal query performance
Sophisticated aggregate function combinations and window function integration
Performance optimization techniques for aggregate queries on large datasets
Common anti-patterns that cause performance degradation and incorrect results
Advanced filtering patterns and conditional aggregation strategies

Prerequisites

This lesson assumes you're comfortable with:

Basic SQL SELECT statements and JOIN operations
Understanding of database indexes and query execution plans
Familiarity with at least one major SQL database system (PostgreSQL, SQL Server, MySQL, or Oracle)
Basic performance tuning concepts

Understanding Aggregation Architecture

Before diving into advanced patterns, let's establish how database engines actually process aggregate queries. This understanding is crucial for writing efficient code and diagnosing performance issues.

When you execute an aggregate query, the database engine follows a specific execution sequence:

Filtering Phase: WHERE clause filters are applied first, reducing the dataset before aggregation
Grouping Phase: Records are organized into groups based on GROUP BY columns
Aggregation Phase: Aggregate functions are computed for each group
Post-Aggregation Filtering: HAVING clause filters are applied to the aggregated results
Ordering Phase: ORDER BY is applied to the final result set

This sequence has profound implications for query performance and correctness. Consider this poorly optimized query:

-- Inefficient: filtering after aggregation
SELECT 
    customer_region,
    COUNT(*) as total_orders,
    AVG(order_amount) as avg_order_value
FROM orders
GROUP BY customer_region
HAVING customer_region IN ('North America', 'Europe');

Versus this optimized version:

-- Efficient: filtering before aggregation
SELECT 
    customer_region,
    COUNT(*) as total_orders,
    AVG(order_amount) as avg_order_value
FROM orders
WHERE customer_region IN ('North America', 'Europe')
GROUP BY customer_region;

The second query processes significantly fewer records during the expensive grouping phase, potentially improving performance by orders of magnitude on large datasets.

Advanced GROUP BY Patterns

Multi-Dimensional Analysis with ROLLUP and CUBE

Standard GROUP BY creates flat groupings, but business analytics often requires hierarchical or multi-dimensional analysis. SQL provides several extensions to handle these scenarios elegantly.

ROLLUP creates hierarchical subtotals, rolling up from the most granular level to increasingly higher levels:

-- Hierarchical sales analysis: product -> category -> total
SELECT 
    product_category,
    product_subcategory,
    product_name,
    SUM(revenue) as total_revenue,
    COUNT(DISTINCT order_id) as unique_orders,
    AVG(unit_price) as avg_unit_price
FROM sales_fact sf
JOIN product_dim pd ON sf.product_id = pd.product_id
WHERE order_date >= '2023-01-01'
GROUP BY ROLLUP(product_category, product_subcategory, product_name)
ORDER BY product_category, product_subcategory, product_name;

This single query generates multiple aggregation levels:

Individual product performance
Subcategory totals
Category totals
Grand total across all products

CUBE generates all possible combinations of the specified grouping columns:

-- Multi-dimensional customer analysis
SELECT 
    customer_segment,
    geographic_region,
    acquisition_channel,
    COUNT(DISTINCT customer_id) as customer_count,
    SUM(lifetime_value) as total_ltv,
    AVG(lifetime_value) as avg_ltv
FROM customer_analytics
WHERE signup_date >= '2023-01-01'
GROUP BY CUBE(customer_segment, geographic_region, acquisition_channel)
ORDER BY customer_segment, geographic_region, acquisition_channel;

CUBE generates subtotals for every possible combination: by segment only, by region only, by channel only, by segment and region, by segment and channel, by region and channel, by all three dimensions, and a grand total.

GROUPING SETS for Custom Aggregation Combinations

When you need specific aggregation combinations without the full CUBE overhead, GROUPING SETS provides precise control:

-- Custom aggregation combinations for executive dashboard
SELECT 
    COALESCE(sales_rep_name, 'ALL REPS') as sales_rep,
    COALESCE(product_category, 'ALL CATEGORIES') as category,
    COALESCE(quarter, 'ALL QUARTERS') as quarter,
    SUM(revenue) as total_revenue,
    COUNT(DISTINCT customer_id) as unique_customers
FROM sales_performance
WHERE year = 2023
GROUP BY GROUPING SETS (
    (sales_rep_name),                    -- By sales rep
    (product_category),                  -- By category
    (quarter),                          -- By quarter
    (sales_rep_name, product_category), -- Rep-category combinations
    ()                                  -- Grand total
)
ORDER BY sales_rep, category, quarter;

Performance Tip: GROUPING SETS is often more efficient than UNION ALL of multiple GROUP BY queries because it processes the base data only once.

Sophisticated Aggregate Function Usage

Statistical Aggregations and Percentiles

Beyond basic COUNT, SUM, and AVG, modern SQL databases provide statistical functions crucial for advanced analytics:

-- Comprehensive statistical analysis of customer orders
SELECT 
    customer_segment,
    COUNT(*) as order_count,
    SUM(order_amount) as total_revenue,
    AVG(order_amount) as mean_order_value,
    
    -- Robust central tendency measures
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY order_amount) as median_order_value,
    MODE() WITHIN GROUP (ORDER BY payment_method) as most_common_payment,
    
    -- Variability measures
    STDDEV(order_amount) as amount_std_dev,
    VAR_POP(order_amount) as amount_variance,
    
    -- Distribution analysis
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY order_amount) as q1_order_value,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY order_amount) as q3_order_value,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY order_amount) as p95_order_value,
    
    -- Range analysis
    MIN(order_amount) as min_order,
    MAX(order_amount) as max_order,
    MAX(order_amount) - MIN(order_amount) as order_range
    
FROM customer_orders
WHERE order_date >= CURRENT_DATE - INTERVAL '12 months'
GROUP BY customer_segment
HAVING COUNT(*) >= 100  -- Ensure statistical significance
ORDER BY total_revenue DESC;

Conditional Aggregation Patterns

Often you need to aggregate different subsets of data within the same result row. Conditional aggregation using CASE statements within aggregate functions provides elegant solutions:

-- Revenue analysis with conditional breakdowns
SELECT 
    product_category,
    
    -- Total metrics
    COUNT(*) as total_orders,
    SUM(order_amount) as total_revenue,
    
    -- Channel-specific metrics
    COUNT(CASE WHEN channel = 'online' THEN 1 END) as online_orders,
    COUNT(CASE WHEN channel = 'retail' THEN 1 END) as retail_orders,
    COUNT(CASE WHEN channel = 'mobile' THEN 1 END) as mobile_orders,
    
    SUM(CASE WHEN channel = 'online' THEN order_amount END) as online_revenue,
    SUM(CASE WHEN channel = 'retail' THEN order_amount END) as retail_revenue,
    SUM(CASE WHEN channel = 'mobile' THEN order_amount END) as mobile_revenue,
    
    -- Performance metrics by channel
    AVG(CASE WHEN channel = 'online' THEN order_amount END) as avg_online_order,
    AVG(CASE WHEN channel = 'retail' THEN order_amount END) as avg_retail_order,
    AVG(CASE WHEN channel = 'mobile' THEN order_amount END) as avg_mobile_order,
    
    -- Conversion metrics
    COUNT(CASE WHEN order_amount > 100 THEN 1 END) as high_value_orders,
    COUNT(CASE WHEN order_amount > 100 THEN 1 END) * 100.0 / COUNT(*) as high_value_rate
    
FROM order_analytics
WHERE order_date >= '2023-01-01'
GROUP BY product_category
ORDER BY total_revenue DESC;

String Aggregation for Denormalization

String aggregation functions allow you to concatenate values within groups, useful for creating comma-separated lists or JSON-like structures:

-- Customer preference analysis with aggregated product lists
SELECT 
    customer_segment,
    COUNT(DISTINCT customer_id) as customer_count,
    
    -- PostgreSQL syntax
    STRING_AGG(DISTINCT product_name, ', ' ORDER BY purchase_frequency DESC) as top_products,
    
    -- SQL Server syntax alternative
    -- STRING_AGG(product_name, ', ') WITHIN GROUP (ORDER BY purchase_frequency DESC) as top_products,
    
    ARRAY_AGG(DISTINCT category ORDER BY category) as purchased_categories,
    
    -- JSON aggregation (PostgreSQL)
    JSON_AGG(
        JSON_BUILD_OBJECT(
            'product', product_name,
            'frequency', purchase_frequency,
            'avg_rating', avg_rating
        ) ORDER BY purchase_frequency DESC
    ) as product_details
    
FROM customer_product_analytics
WHERE last_purchase_date >= CURRENT_DATE - INTERVAL '6 months'
GROUP BY customer_segment
HAVING COUNT(DISTINCT customer_id) >= 50
ORDER BY customer_count DESC;

Advanced HAVING Clause Patterns

The HAVING clause operates on grouped results, enabling complex post-aggregation filtering that's impossible with WHERE alone.

Statistical Filtering and Outlier Detection

-- Identify unusual sales patterns requiring investigation
SELECT 
    sales_territory,
    salesperson_name,
    COUNT(*) as deal_count,
    SUM(deal_amount) as total_revenue,
    AVG(deal_amount) as avg_deal_size,
    STDDEV(deal_amount) as deal_size_variation
FROM sales_deals
WHERE deal_date >= CURRENT_DATE - INTERVAL '3 months'
GROUP BY sales_territory, salesperson_name
HAVING 
    -- Minimum activity threshold
    COUNT(*) >= 10
    
    -- Flag territories with highly variable deal sizes
    AND STDDEV(deal_amount) > AVG(deal_amount) * 0.5
    
    -- Identify potential outlier performance
    AND (
        AVG(deal_amount) > (
            SELECT AVG(deal_amount) * 1.5 
            FROM sales_deals 
            WHERE deal_date >= CURRENT_DATE - INTERVAL '3 months'
        )
        OR AVG(deal_amount) < (
            SELECT AVG(deal_amount) * 0.5 
            FROM sales_deals 
            WHERE deal_date >= CURRENT_DATE - INTERVAL '3 months'
        )
    )
ORDER BY deal_size_variation DESC;

Complex Business Logic in HAVING

-- Sophisticated customer segmentation based on multiple criteria
SELECT 
    customer_id,
    customer_name,
    COUNT(DISTINCT order_id) as total_orders,
    SUM(order_amount) as total_spent,
    AVG(order_amount) as avg_order_value,
    MAX(order_date) as last_order_date,
    MIN(order_date) as first_order_date,
    
    -- Customer lifecycle calculation
    EXTRACT(DAYS FROM MAX(order_date) - MIN(order_date)) as customer_lifespan_days,
    
    -- Purchase frequency
    COUNT(DISTINCT order_id) / NULLIF(EXTRACT(DAYS FROM MAX(order_date) - MIN(order_date)), 0) * 30 as orders_per_month
    
FROM customer_orders
WHERE order_date >= '2022-01-01'
GROUP BY customer_id, customer_name
HAVING 
    -- High-value customers
    SUM(order_amount) >= 10000
    
    -- Regular purchasers (not one-time buyers)
    AND COUNT(DISTINCT order_id) >= 5
    
    -- Recent activity
    AND MAX(order_date) >= CURRENT_DATE - INTERVAL '60 days'
    
    -- Consistent purchasing pattern
    AND COUNT(DISTINCT order_id) / NULLIF(EXTRACT(DAYS FROM MAX(order_date) - MIN(order_date)), 0) * 30 >= 2
    
    -- Diverse purchasing (multiple product categories)
    AND COUNT(DISTINCT product_category) >= 3
    
ORDER BY total_spent DESC, orders_per_month DESC;

Performance Optimization for Aggregate Queries

Index Strategy for Aggregation

Aggregate query performance heavily depends on proper indexing. The optimal index structure varies based on your specific GROUP BY and filtering patterns:

-- For this common pattern:
SELECT 
    customer_segment, 
    product_category,
    COUNT(*), 
    SUM(revenue)
FROM sales
WHERE order_date >= '2023-01-01' 
AND region = 'North America'
GROUP BY customer_segment, product_category;

-- Optimal index covers filter columns first, then GROUP BY columns:
CREATE INDEX idx_sales_optimized 
ON sales (order_date, region, customer_segment, product_category, revenue);

The index column order matters significantly:

Filter columns first: Columns in WHERE clauses should come first to enable efficient filtering
GROUP BY columns next: Enable efficient sorting for grouping operations
Aggregate columns last: Include aggregated columns to enable covering index scans

Partial Aggregation and Materialized Views

For frequently-executed aggregate queries on large datasets, pre-computed aggregations can dramatically improve performance:

-- Create materialized view for daily sales summaries
CREATE MATERIALIZED VIEW daily_sales_summary AS
SELECT 
    order_date,
    product_category,
    customer_segment,
    sales_channel,
    COUNT(*) as order_count,
    COUNT(DISTINCT customer_id) as unique_customers,
    SUM(order_amount) as total_revenue,
    AVG(order_amount) as avg_order_value,
    SUM(profit_margin) as total_profit
FROM detailed_sales_fact
GROUP BY order_date, product_category, customer_segment, sales_channel;

-- Create indexes on the materialized view
CREATE INDEX idx_daily_summary_date_category 
ON daily_sales_summary (order_date, product_category);

-- Now complex queries run much faster:
SELECT 
    product_category,
    SUM(total_revenue) as category_revenue,
    AVG(avg_order_value) as weighted_avg_order_value,
    SUM(unique_customers) as total_unique_customers
FROM daily_sales_summary
WHERE order_date >= '2023-01-01'
GROUP BY product_category
ORDER BY category_revenue DESC;

Incremental Aggregation Patterns

For real-time analytics on streaming data, incremental aggregation maintains running totals without recomputing the entire dataset:

-- Upsert pattern for maintaining running aggregates
INSERT INTO customer_lifetime_stats (
    customer_id,
    total_orders,
    total_spent,
    avg_order_value,
    last_order_date,
    last_updated
)
SELECT 
    customer_id,
    COUNT(*),
    SUM(order_amount),
    AVG(order_amount),
    MAX(order_date),
    CURRENT_TIMESTAMP
FROM new_orders
WHERE processed_timestamp > (
    SELECT COALESCE(MAX(last_updated), '1900-01-01'::timestamp)
    FROM customer_lifetime_stats
)
GROUP BY customer_id

ON CONFLICT (customer_id) DO UPDATE SET
    total_orders = customer_lifetime_stats.total_orders + EXCLUDED.total_orders,
    total_spent = customer_lifetime_stats.total_spent + EXCLUDED.total_spent,
    avg_order_value = (customer_lifetime_stats.total_spent + EXCLUDED.total_spent) / 
                      (customer_lifetime_stats.total_orders + EXCLUDED.total_orders),
    last_order_date = GREATEST(customer_lifetime_stats.last_order_date, EXCLUDED.last_order_date),
    last_updated = EXCLUDED.last_updated;

Integration with Window Functions

Combining aggregate functions with window functions enables sophisticated analytical patterns that answer complex business questions:

-- Advanced customer cohort analysis combining aggregation and windowing
WITH monthly_cohorts AS (
    SELECT 
        customer_id,
        DATE_TRUNC('month', first_purchase_date) as cohort_month,
        DATE_TRUNC('month', order_date) as order_month,
        order_amount
    FROM customer_orders co
    JOIN (
        SELECT 
            customer_id, 
            MIN(order_date) as first_purchase_date
        FROM customer_orders 
        GROUP BY customer_id
    ) first_orders USING (customer_id)
    WHERE order_date >= '2022-01-01'
),

cohort_data AS (
    SELECT 
        cohort_month,
        order_month,
        COUNT(DISTINCT customer_id) as active_customers,
        SUM(order_amount) as monthly_revenue,
        AVG(order_amount) as avg_order_value
    FROM monthly_cohorts
    GROUP BY cohort_month, order_month
),

cohort_analysis AS (
    SELECT 
        cohort_month,
        order_month,
        active_customers,
        monthly_revenue,
        avg_order_value,
        
        -- Window functions for cohort analysis
        FIRST_VALUE(active_customers) OVER (
            PARTITION BY cohort_month 
            ORDER BY order_month
        ) as initial_cohort_size,
        
        active_customers * 100.0 / FIRST_VALUE(active_customers) OVER (
            PARTITION BY cohort_month 
            ORDER BY order_month
        ) as retention_rate,
        
        SUM(monthly_revenue) OVER (
            PARTITION BY cohort_month 
            ORDER BY order_month
        ) as cumulative_revenue,
        
        -- Period-over-period analysis
        LAG(active_customers) OVER (
            PARTITION BY cohort_month 
            ORDER BY order_month
        ) as prev_month_customers,
        
        active_customers - LAG(active_customers) OVER (
            PARTITION BY cohort_month 
            ORDER BY order_month
        ) as customer_change
        
    FROM cohort_data
)

SELECT 
    cohort_month,
    order_month,
    EXTRACT(MONTH FROM AGE(order_month, cohort_month)) as months_since_acquisition,
    initial_cohort_size,
    active_customers,
    retention_rate,
    monthly_revenue,
    cumulative_revenue,
    avg_order_value,
    customer_change
FROM cohort_analysis
WHERE cohort_month >= '2023-01-01'
ORDER BY cohort_month, order_month;

Advanced Filtering and Conditional Logic

Dynamic Aggregation with FILTER Clause

The FILTER clause (available in PostgreSQL and some other databases) provides cleaner conditional aggregation than CASE statements:

-- Product performance analysis with FILTER clause
SELECT 
    product_category,
    
    -- Total metrics
    COUNT(*) as total_sales,
    SUM(sale_amount) as total_revenue,
    
    -- Channel-specific metrics using FILTER
    COUNT(*) FILTER (WHERE channel = 'online') as online_sales,
    COUNT(*) FILTER (WHERE channel = 'retail') as retail_sales,
    COUNT(*) FILTER (WHERE channel = 'mobile') as mobile_sales,
    
    SUM(sale_amount) FILTER (WHERE channel = 'online') as online_revenue,
    SUM(sale_amount) FILTER (WHERE channel = 'retail') as retail_revenue,
    SUM(sale_amount) FILTER (WHERE channel = 'mobile') as mobile_revenue,
    
    -- Performance metrics
    AVG(sale_amount) FILTER (WHERE customer_type = 'premium') as premium_avg_sale,
    AVG(sale_amount) FILTER (WHERE customer_type = 'standard') as standard_avg_sale,
    
    -- Conversion metrics
    COUNT(*) FILTER (WHERE discount_applied > 0) as discounted_sales,
    COUNT(*) FILTER (WHERE sale_amount > 500) as high_value_sales,
    
    -- Quality metrics
    COUNT(*) FILTER (WHERE return_flag = true) as returned_items,
    COUNT(*) FILTER (WHERE return_flag = true) * 100.0 / COUNT(*) as return_rate
    
FROM product_sales
WHERE sale_date >= CURRENT_DATE - INTERVAL '6 months'
GROUP BY product_category
HAVING COUNT(*) >= 100  -- Ensure statistical significance
ORDER BY total_revenue DESC;

Complex Multi-Level Aggregation

Sometimes business requirements demand aggregation at multiple levels within the same query:

-- Territory performance with multiple aggregation levels
WITH sales_hierarchy AS (
    SELECT 
        region,
        territory,
        salesperson,
        customer_id,
        sale_amount,
        sale_date
    FROM sales_data
    WHERE sale_date >= CURRENT_DATE - INTERVAL '1 year'
),

-- Individual salesperson performance
salesperson_metrics AS (
    SELECT 
        region,
        territory,
        salesperson,
        COUNT(DISTINCT customer_id) as unique_customers,
        COUNT(*) as total_sales,
        SUM(sale_amount) as total_revenue,
        AVG(sale_amount) as avg_sale_amount
    FROM sales_hierarchy
    GROUP BY region, territory, salesperson
),

-- Territory-level aggregations
territory_metrics AS (
    SELECT 
        region,
        territory,
        COUNT(DISTINCT salesperson) as salesperson_count,
        SUM(unique_customers) as territory_customers,
        SUM(total_sales) as territory_sales,
        SUM(total_revenue) as territory_revenue,
        AVG(avg_sale_amount) as territory_avg_sale
    FROM salesperson_metrics
    GROUP BY region, territory
),

-- Regional aggregations  
region_metrics AS (
    SELECT 
        region,
        COUNT(DISTINCT territory) as territory_count,
        SUM(salesperson_count) as region_salespeople,
        SUM(territory_customers) as region_customers,
        SUM(territory_sales) as region_sales,
        SUM(territory_revenue) as region_revenue,
        AVG(territory_avg_sale) as region_avg_sale
    FROM territory_metrics
    GROUP BY region
)

-- Final combined analysis
SELECT 
    sm.region,
    sm.territory,
    sm.salesperson,
    
    -- Individual performance
    sm.unique_customers,
    sm.total_sales,
    sm.total_revenue,
    sm.avg_sale_amount,
    
    -- Territory context
    tm.territory_customers,
    sm.unique_customers * 100.0 / tm.territory_customers as customer_share_pct,
    sm.total_revenue * 100.0 / tm.territory_revenue as revenue_share_pct,
    
    -- Regional context
    rm.region_customers,
    sm.unique_customers * 100.0 / rm.region_customers as regional_customer_share_pct,
    
    -- Performance rankings
    RANK() OVER (PARTITION BY sm.territory ORDER BY sm.total_revenue DESC) as territory_rank,
    RANK() OVER (PARTITION BY sm.region ORDER BY sm.total_revenue DESC) as region_rank,
    RANK() OVER (ORDER BY sm.total_revenue DESC) as company_rank
    
FROM salesperson_metrics sm
JOIN territory_metrics tm ON sm.region = tm.region AND sm.territory = tm.territory  
JOIN region_metrics rm ON sm.region = rm.region
WHERE sm.total_revenue > 0
ORDER BY sm.region, sm.territory, sm.total_revenue DESC;

Hands-On Exercise

Let's apply these concepts to a comprehensive real-world scenario. You're analyzing an e-commerce platform's customer behavior and need to generate insights for multiple stakeholders.

Scenario Setup:

-- Create sample tables (adapt to your database system)
CREATE TABLE customers (
    customer_id INTEGER,
    customer_name VARCHAR(100),
    signup_date DATE,
    customer_segment VARCHAR(20),
    geographic_region VARCHAR(30),
    acquisition_channel VARCHAR(20)
);

CREATE TABLE orders (
    order_id INTEGER,
    customer_id INTEGER,
    order_date DATE,
    order_amount DECIMAL(10,2),
    product_category VARCHAR(30),
    sales_channel VARCHAR(20),
    payment_method VARCHAR(20),
    discount_amount DECIMAL(8,2)
);

CREATE TABLE order_items (
    order_id INTEGER,
    product_id INTEGER,
    product_name VARCHAR(100),
    quantity INTEGER,
    unit_price DECIMAL(8,2),
    product_category VARCHAR(30)
);

Exercise Challenge: Write a comprehensive analysis query that provides:

Customer lifetime value by segment and acquisition channel
Monthly cohort retention rates
Product category performance with cross-selling analysis
Seasonal trends with statistical significance testing
Outlier detection for unusual customer behavior

Try to solve this yourself before looking at the solution:

-- Comprehensive e-commerce analytics solution
WITH customer_metrics AS (
    SELECT 
        c.customer_id,
        c.customer_segment,
        c.geographic_region,
        c.acquisition_channel,
        c.signup_date,
        
        -- Order metrics
        COUNT(o.order_id) as total_orders,
        SUM(o.order_amount) as lifetime_value,
        AVG(o.order_amount) as avg_order_value,
        MIN(o.order_date) as first_order_date,
        MAX(o.order_date) as last_order_date,
        
        -- Product diversity
        COUNT(DISTINCT oi.product_category) as categories_purchased,
        COUNT(DISTINCT oi.product_id) as unique_products,
        
        -- Behavioral patterns
        AVG(o.discount_amount) as avg_discount_used,
        COUNT(CASE WHEN o.discount_amount > 0 THEN 1 END) as discounted_orders,
        
        -- Frequency analysis
        EXTRACT(DAYS FROM MAX(o.order_date) - MIN(o.order_date)) as customer_lifespan_days
        
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    LEFT JOIN order_items oi ON o.order_id = oi.order_id
    WHERE c.signup_date >= '2022-01-01'
    GROUP BY c.customer_id, c.customer_segment, c.geographic_region, 
             c.acquisition_channel, c.signup_date
),

segment_analysis AS (
    SELECT 
        customer_segment,
        acquisition_channel,
        geographic_region,
        
        -- Customer counts and percentages
        COUNT(*) as customer_count,
        COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () as segment_percentage,
        
        -- Financial metrics
        SUM(lifetime_value) as total_segment_revenue,
        AVG(lifetime_value) as avg_customer_ltv,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY lifetime_value) as median_ltv,
        
        -- Engagement metrics
        AVG(total_orders) as avg_orders_per_customer,
        AVG(categories_purchased) as avg_categories_per_customer,
        AVG(unique_products) as avg_products_per_customer,
        
        -- Behavioral insights
        AVG(avg_order_value) as segment_avg_order_value,
        AVG(CASE WHEN total_orders > 1 THEN customer_lifespan_days END) as avg_customer_lifespan,
        COUNT(CASE WHEN total_orders > 1 THEN 1 END) * 100.0 / COUNT(*) as repeat_customer_rate,
        
        -- Statistical measures
        STDDEV(lifetime_value) as ltv_std_deviation,
        VAR_POP(lifetime_value) as ltv_variance
        
    FROM customer_metrics
    GROUP BY customer_segment, acquisition_channel, geographic_region
    HAVING COUNT(*) >= 20  -- Ensure statistical significance
),

category_cross_sell AS (
    SELECT 
        primary_category,
        secondary_category,
        COUNT(DISTINCT customer_id) as customers_buying_both,
        AVG(combined_spend) as avg_combined_spend
    FROM (
        SELECT 
            c.customer_id,
            pc.category as primary_category,
            sc.category as secondary_category,
            pc.spend + sc.spend as combined_spend
        FROM customers c
        JOIN (
            SELECT 
                customer_id, 
                product_category as category,
                SUM(order_amount) as spend,
                ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY SUM(order_amount) DESC) as rn
            FROM orders o
            JOIN order_items oi ON o.order_id = oi.order_id
            GROUP BY customer_id, product_category
        ) pc ON c.customer_id = pc.customer_id AND pc.rn = 1
        JOIN (
            SELECT 
                customer_id, 
                product_category as category,
                SUM(order_amount) as spend,
                ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY SUM(order_amount) DESC) as rn
            FROM orders o
            JOIN order_items oi ON o.order_id = oi.order_id
            GROUP BY customer_id, product_category
        ) sc ON c.customer_id = sc.customer_id AND sc.rn = 2
        WHERE pc.category != sc.category
    ) cross_sell_data
    GROUP BY primary_category, secondary_category
    HAVING COUNT(DISTINCT customer_id) >= 10
)

-- Final consolidated report
SELECT 
    'SEGMENT_ANALYSIS' as report_type,
    customer_segment,
    acquisition_channel,
    geographic_region,
    customer_count,
    segment_percentage,
    total_segment_revenue,
    avg_customer_ltv,
    median_ltv,
    repeat_customer_rate,
    ltv_std_deviation
FROM segment_analysis
WHERE total_segment_revenue > 10000

UNION ALL

SELECT 
    'CROSS_SELL_OPPORTUNITIES' as report_type,
    primary_category,
    secondary_category,
    NULL as geographic_region,
    customers_buying_both,
    NULL as segment_percentage,
    NULL as total_segment_revenue,
    avg_combined_spend,
    NULL as median_ltv,
    NULL as repeat_customer_rate,
    NULL as ltv_std_deviation
FROM category_cross_sell
WHERE customers_buying_both >= 15

ORDER BY report_type, total_segment_revenue DESC NULLS LAST, avg_combined_spend DESC NULLS LAST;

Common Mistakes & Troubleshooting

The COUNT(*) vs COUNT(column) Confusion

One of the most common mistakes is misunderstanding the difference between COUNT(*) and COUNT(column):

-- Common mistake: expecting these to be equal
SELECT 
    product_category,
    COUNT(*) as total_rows,           -- Counts all rows, including NULLs
    COUNT(customer_rating) as ratings_count,  -- Only counts non-NULL ratings
    AVG(customer_rating) as avg_rating        -- Averages only non-NULL ratings
FROM product_reviews
GROUP BY product_category;

-- Correct approach: be explicit about NULL handling
SELECT 
    product_category,
    COUNT(*) as total_reviews,
    COUNT(customer_rating) as rated_reviews,
    COUNT(*) - COUNT(customer_rating) as unrated_reviews,
    COUNT(customer_rating) * 100.0 / COUNT(*) as rating_completion_rate,
    AVG(customer_rating) as avg_rating,
    AVG(COALESCE(customer_rating, 0)) as avg_rating_with_zeros
FROM product_reviews
GROUP BY product_category;

Incorrect Aggregation of Aggregates

Another frequent error is trying to aggregate already-aggregated values incorrectly:

-- Wrong: This doesn't give you the overall average
SELECT AVG(avg_order_value)
FROM (
    SELECT customer_id, AVG(order_amount) as avg_order_value
    FROM orders
    GROUP BY customer_id
) customer_averages;

-- Correct: Weight by the number of orders per customer
SELECT 
    SUM(total_order_value) / SUM(order_count) as true_overall_average
FROM (
    SELECT 
        customer_id, 
        COUNT(*) as order_count,
        SUM(order_amount) as total_order_value
    FROM orders
    GROUP BY customer_id
) customer_totals;

-- Or simply calculate directly:
SELECT AVG(order_amount) as overall_average
FROM orders;

Performance Anti-Patterns

Anti-pattern 1: Filtering after aggregation when you could filter before

-- Inefficient: processes all data then filters
SELECT customer_segment, COUNT(*)
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
GROUP BY customer_segment  
HAVING customer_segment = 'Premium';

-- Efficient: filters first
SELECT customer_segment, COUNT(*)
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE customer_segment = 'Premium'
GROUP BY customer_segment;

Anti-pattern 2: Using functions in GROUP BY that prevent index usage

-- Poor: function in GROUP BY prevents index usage
SELECT 
    UPPER(customer_segment),
    COUNT(*)
FROM customers
GROUP BY UPPER(customer_segment);

-- Better: use computed column or functional index
-- Or normalize the data to avoid the function call
SELECT 
    customer_segment,
    COUNT(*)
FROM customers
WHERE customer_segment IS NOT NULL
GROUP BY customer_segment;

Handling Division by Zero and NULL Values

Aggregate functions can produce unexpected results when dealing with NULLs and zero values:

-- Robust aggregation with proper NULL and zero handling
SELECT 
    product_category,
    COUNT(*) as total_products,
    COUNT(sale_price) as priced_products,
    
    -- Safe division avoiding division by zero
    CASE 
        WHEN COUNT(sale_price) > 0 
        THEN SUM(sale_price) / COUNT(sale_price)
        ELSE NULL 
    END as avg_price_safe,
    
    -- Using NULLIF for cleaner syntax
    SUM(sale_price) / NULLIF(COUNT(sale_price), 0) as avg_price_nullif,
    
    -- Percentage calculations with safe division
    COUNT(sale_price) * 100.0 / NULLIF(COUNT(*), 0) as pricing_completion_rate,
    
    -- Handling potential NULL in percentage calculation
    COALESCE(
        COUNT(CASE WHEN discount > 0 THEN 1 END) * 100.0 / NULLIF(COUNT(*), 0),
        0
    ) as discount_rate
    
FROM products
GROUP BY product_category;

Memory and Performance Issues with Large Groups

When working with datasets that have high cardinality GROUP BY columns, you may encounter memory pressure:

-- Problematic: may consume excessive memory with millions of unique customers
SELECT 
    customer_id,
    COUNT(*) as order_count,
    STRING_AGG(product_name, ', ') as all_products  -- Can create huge strings
FROM customer_orders
GROUP BY customer_id;

-- Better: limit string aggregation and add filtering
SELECT 
    customer_id,
    COUNT(*) as order_count,
    STRING_AGG(
        product_name, 
        ', ' 
        ORDER BY order_date DESC
    )[:500] as recent_products  -- Limit string length
FROM customer_orders
WHERE order_date >= CURRENT_DATE - INTERVAL '1 year'
GROUP BY customer_id
HAVING COUNT(*) >= 5  -- Filter out low-activity customers
LIMIT 10000;  -- Limit result set size

Summary & Next Steps

Mastering SQL aggregate functions extends far beyond memorizing syntax—it requires understanding database internals, recognizing performance implications, and designing queries that scale with your data growth. The patterns covered in this lesson form the foundation for advanced analytics, but true expertise comes from applying these concepts to real-world scenarios with messy data and complex business requirements.

Key takeaways from this comprehensive exploration:

Architectural Understanding: The sequence of SQL operations (WHERE → GROUP BY → HAVING → ORDER BY) fundamentally impacts both performance and correctness. Always filter early when possible, and understand when post-aggregation filtering (HAVING) is truly necessary.

Advanced Grouping: ROLLUP, CUBE, and GROUPING SETS enable sophisticated multi-dimensional analysis that would otherwise require complex UNION operations. These tools are essential for building executive dashboards and analytical reports.

Statistical Rigor: Modern analytics demands more than basic averages. Percentiles, standard deviations, and robust statistical measures provide deeper insights and help identify outliers and trends that basic aggregations miss.

Performance Optimization: Proper indexing strategies, materialized views for frequently-accessed aggregations, and incremental aggregation patterns are crucial for production systems. The difference between a 30-second query and a 100-millisecond query often determines whether analytics are useful or ignored.

Integration Patterns: Combining aggregations with window functions, CTEs, and conditional logic enables sophisticated analytical patterns that answer complex business questions in single queries rather than requiring multiple application-level operations.

As your next steps, focus on:

Practice with Real Data: Apply these patterns to your organization's actual datasets. Start with simple aggregations and gradually add complexity as you identify business needs.
Performance Monitoring: Learn to read query execution plans for your database system. Understanding how the optimizer handles your GROUP BY operations will guide your indexing and query design decisions.
Advanced Analytics: Explore your database's specific analytical functions. PostgreSQL's statistical functions, SQL Server's analytical capabilities, and specialized analytical databases like Snowflake or BigQuery offer additional tools beyond standard SQL.
Streaming Aggregation: For real-time analytics, investigate your database's support for streaming aggregation, window aggregations, and incremental materialized view maintenance.

The patterns you've learned here will serve as building blocks for increasingly sophisticated analytical queries. The key is understanding not just what each function does, but when and how to combine them effectively for your specific analytical needs.

Master SQL Aggregate Functions: Advanced GROUP BY, HAVING, and Performance Optimization