Wicked Smart Data
LearnArticlesAbout
Sign InSign Up
LearnArticlesAboutContact
Sign InSign Up
Wicked Smart Data

The go-to platform for professionals who want to master data, automation, and AI — from Excel fundamentals to cutting-edge machine learning.

Platform

  • Learning Paths
  • Articles
  • About
  • Contact

Connect

  • Contact Us
  • RSS Feed

© 2026 Wicked Smart Data. All rights reserved.

Privacy PolicyTerms of Service
All Articles

SQL for Data Analysis: Cohort Analysis, Funnels, and Retention - Complete Guide

SQL⚡ Practitioner17 min readMay 24, 2026Updated May 24, 2026
Table of Contents
  • Prerequisites
  • Understanding the Data Structure
  • Building Your First Cohort Analysis
  • Advanced Retention Analysis
  • Building Conversion Funnels
  • Combining Cohort and Funnel Analysis
  • Hands-On Exercise
  • Common Mistakes & Troubleshooting
  • Summary & Next Steps

SQL for Data Analysis: Cohort Analysis, Funnels, and Retention

You're reviewing your company's user metrics dashboard when your product manager walks over with a familiar look of confusion. "Our monthly active users are up 15%, but revenue per user is trending down. Are we attracting the wrong users, or are our existing users becoming less engaged?" This is exactly the kind of question that basic aggregation queries can't answer—you need to understand user behavior over time, track how different groups of users behave, and identify where users are dropping off in your product experience.

Cohort analysis, funnel analysis, and retention metrics are the powerhouse techniques that transform raw user event data into actionable insights about customer behavior. Instead of looking at snapshots of user activity, these approaches let you track how user behavior evolves, compare different user segments, and pinpoint exactly where your product is succeeding or failing.

By the end of this lesson, you'll be writing sophisticated SQL queries that segment users into meaningful cohorts, calculate retention rates that actually drive product decisions, and build conversion funnels that identify optimization opportunities. These aren't just theoretical concepts—they're the analytical tools that drive growth at data-driven companies.

What you'll learn:

  • Build cohort tables that segment users by acquisition period and track their behavior over time
  • Calculate retention rates using window functions and date arithmetic for actionable insights
  • Design conversion funnels that identify drop-off points in multi-step user journeys
  • Combine cohort and funnel analysis to understand how different user segments convert
  • Optimize these queries for production environments with large datasets

Prerequisites

You should be comfortable with intermediate SQL concepts including JOINs, subqueries, window functions (ROW_NUMBER(), LAG(), LEAD()), and date/time manipulation. We'll build on these foundations to create more sophisticated analytical queries.

Understanding the Data Structure

Before diving into analysis techniques, let's establish the typical data structure you'll encounter. Most cohort and funnel analyses work with event-based data where each row represents a user action:

-- Typical user events table structure
CREATE TABLE user_events (
    user_id INTEGER,
    event_name VARCHAR(50),
    event_timestamp TIMESTAMP,
    session_id VARCHAR(100),
    properties JSON
);

-- Sample data that mirrors real SaaS product usage
INSERT INTO user_events VALUES
(101, 'signup', '2024-01-15 10:30:00', 'sess_001', '{"source": "google", "plan": "free"}'),
(101, 'first_login', '2024-01-15 10:35:00', 'sess_001', '{"device": "desktop"}'),
(101, 'feature_used', '2024-01-15 11:00:00', 'sess_001', '{"feature": "dashboard"}'),
(101, 'feature_used', '2024-01-20 14:20:00', 'sess_002', '{"feature": "reports"}'),
(101, 'upgrade', '2024-02-10 09:15:00', 'sess_003', '{"from_plan": "free", "to_plan": "pro"}'),
(102, 'signup', '2024-01-16 15:45:00', 'sess_004', '{"source": "direct", "plan": "free"}'),
(102, 'first_login', '2024-01-18 08:20:00', 'sess_005', '{"device": "mobile"}'),
(103, 'signup', '2024-02-01 12:00:00', 'sess_006', '{"source": "facebook", "plan": "trial"}');

This structure captures the essential elements: who did what, when they did it, and contextual information about the action. Real production tables often include additional dimensions like geographic location, device type, or A/B test variants.

Building Your First Cohort Analysis

Cohort analysis groups users by a shared characteristic (usually their first interaction with your product) and tracks their behavior over time. This reveals patterns that aggregate metrics hide—like whether users who signed up in January behave differently than those who joined in March.

Let's start with a classic monthly retention cohort:

-- Step 1: Define user cohorts by signup month
WITH user_cohorts AS (
    SELECT 
        user_id,
        DATE_TRUNC('month', MIN(event_timestamp)) AS cohort_month
    FROM user_events 
    WHERE event_name = 'signup'
    GROUP BY user_id
),

-- Step 2: Find all user activity periods
user_periods AS (
    SELECT DISTINCT
        user_id,
        DATE_TRUNC('month', event_timestamp) AS activity_month
    FROM user_events 
    WHERE event_name IN ('login', 'feature_used', 'first_login')
),

-- Step 3: Combine cohort assignments with activity periods
cohort_data AS (
    SELECT 
        c.cohort_month,
        p.activity_month,
        c.user_id,
        EXTRACT(EPOCH FROM (p.activity_month - c.cohort_month)) / 2628000 AS month_number
    FROM user_cohorts c
    LEFT JOIN user_periods p ON c.user_id = p.user_id
)

-- Step 4: Calculate cohort sizes and retention rates
SELECT 
    cohort_month,
    month_number,
    COUNT(DISTINCT user_id) AS users_active,
    FIRST_VALUE(COUNT(DISTINCT user_id)) 
        OVER (PARTITION BY cohort_month ORDER BY month_number) AS cohort_size,
    ROUND(
        100.0 * COUNT(DISTINCT user_id) / 
        FIRST_VALUE(COUNT(DISTINCT user_id)) 
            OVER (PARTITION BY cohort_month ORDER BY month_number), 
        2
    ) AS retention_rate
FROM cohort_data 
WHERE activity_month IS NOT NULL
GROUP BY cohort_month, month_number
ORDER BY cohort_month, month_number;

This query creates a cohort table showing how users from each signup month behave over time. The month_number indicates periods after signup (0 = signup month, 1 = first month after signup, etc.).

Performance tip: For large datasets, consider creating indexes on user_id and event_timestamp. The multiple GROUP BY operations can be expensive, so test with realistic data volumes.

Let's enhance this with more sophisticated cohort analysis that considers user value:

-- Revenue cohort analysis - tracking how different cohorts monetize
WITH user_cohorts AS (
    SELECT 
        user_id,
        DATE_TRUNC('month', MIN(event_timestamp)) AS cohort_month,
        JSON_EXTRACT_PATH_TEXT(properties, 'source') AS acquisition_source
    FROM user_events 
    WHERE event_name = 'signup'
    GROUP BY user_id, JSON_EXTRACT_PATH_TEXT(properties, 'source')
),

monthly_revenue AS (
    SELECT 
        user_id,
        DATE_TRUNC('month', event_timestamp) AS revenue_month,
        CASE 
            WHEN JSON_EXTRACT_PATH_TEXT(properties, 'to_plan') = 'pro' THEN 99
            WHEN JSON_EXTRACT_PATH_TEXT(properties, 'to_plan') = 'enterprise' THEN 299
            ELSE 0
        END AS revenue
    FROM user_events 
    WHERE event_name = 'upgrade'
),

cohort_revenue AS (
    SELECT 
        c.cohort_month,
        c.acquisition_source,
        r.revenue_month,
        EXTRACT(EPOCH FROM (r.revenue_month - c.cohort_month)) / 2628000 AS month_number,
        SUM(r.revenue) AS month_revenue,
        COUNT(DISTINCT c.user_id) AS cohort_size
    FROM user_cohorts c
    LEFT JOIN monthly_revenue r ON c.user_id = r.user_id
    GROUP BY c.cohort_month, c.acquisition_source, r.revenue_month
)

SELECT 
    cohort_month,
    acquisition_source,
    month_number,
    cohort_size,
    month_revenue,
    SUM(month_revenue) OVER (
        PARTITION BY cohort_month, acquisition_source 
        ORDER BY month_number 
        ROWS UNBOUNDED PRECEDING
    ) AS cumulative_revenue,
    ROUND(month_revenue / cohort_size::DECIMAL, 2) AS revenue_per_user
FROM cohort_revenue
WHERE month_number IS NOT NULL
ORDER BY cohort_month, acquisition_source, month_number;

This enhanced version segments cohorts by acquisition source and tracks revenue generation, giving you insight into which marketing channels produce the most valuable users over time.

Advanced Retention Analysis

Basic retention rates only tell part of the story. Advanced retention analysis considers different types of user engagement and identifies patterns that drive long-term value.

Here's a comprehensive retention analysis that goes beyond simple "did they come back":

-- Multi-dimensional retention: engagement levels and feature adoption
WITH user_signup AS (
    SELECT 
        user_id,
        DATE_TRUNC('week', MIN(event_timestamp)) AS cohort_week
    FROM user_events 
    WHERE event_name = 'signup'
    GROUP BY user_id
),

weekly_engagement AS (
    SELECT 
        user_id,
        DATE_TRUNC('week', event_timestamp) AS activity_week,
        COUNT(DISTINCT event_name) AS unique_events,
        COUNT(*) AS total_events,
        COUNT(DISTINCT session_id) AS sessions,
        CASE 
            WHEN COUNT(*) >= 20 THEN 'high'
            WHEN COUNT(*) >= 5 THEN 'medium'
            ELSE 'low'
        END AS engagement_level
    FROM user_events 
    WHERE event_name NOT IN ('signup')
    GROUP BY user_id, DATE_TRUNC('week', event_timestamp)
),

retention_data AS (
    SELECT 
        s.cohort_week,
        e.activity_week,
        s.user_id,
        FLOOR((e.activity_week - s.cohort_week) / INTERVAL '7 days') AS week_number,
        e.engagement_level,
        e.total_events,
        e.sessions
    FROM user_signup s
    LEFT JOIN weekly_engagement e ON s.user_id = e.user_id
),

retention_summary AS (
    SELECT 
        cohort_week,
        week_number,
        COUNT(DISTINCT user_id) AS total_users,
        COUNT(DISTINCT CASE WHEN engagement_level = 'high' THEN user_id END) AS high_engagement_users,
        COUNT(DISTINCT CASE WHEN engagement_level = 'medium' THEN user_id END) AS medium_engagement_users,
        COUNT(DISTINCT CASE WHEN engagement_level = 'low' THEN user_id END) AS low_engagement_users,
        AVG(total_events) AS avg_events_per_user,
        AVG(sessions) AS avg_sessions_per_user
    FROM retention_data
    WHERE activity_week IS NOT NULL
    GROUP BY cohort_week, week_number
)

SELECT 
    cohort_week,
    week_number,
    total_users,
    FIRST_VALUE(total_users) OVER (
        PARTITION BY cohort_week ORDER BY week_number
    ) AS cohort_size,
    ROUND(100.0 * total_users / FIRST_VALUE(total_users) OVER (
        PARTITION BY cohort_week ORDER BY week_number
    ), 2) AS overall_retention_rate,
    ROUND(100.0 * high_engagement_users / total_users, 2) AS high_engagement_rate,
    ROUND(avg_events_per_user, 2) AS avg_events,
    ROUND(avg_sessions_per_user, 2) AS avg_sessions
FROM retention_summary
ORDER BY cohort_week, week_number;

This analysis reveals not just who comes back, but how engaged they are when they return. You might discover that while overall retention drops, the users who do return are becoming more engaged—a pattern that suggests successful user onboarding.

Building Conversion Funnels

Conversion funnels track users through sequential steps to identify where they drop off. Unlike retention analysis that looks at behavior over time, funnels focus on specific user journeys.

Let's build a comprehensive signup-to-paid-user funnel:

-- Complete user journey funnel with time-to-convert tracking
WITH funnel_events AS (
    SELECT 
        user_id,
        event_name,
        event_timestamp,
        session_id,
        ROW_NUMBER() OVER (PARTITION BY user_id, event_name ORDER BY event_timestamp) AS event_rank
    FROM user_events
    WHERE event_name IN ('signup', 'first_login', 'feature_used', 'upgrade')
),

-- Get first occurrence of each funnel step per user
user_funnel_steps AS (
    SELECT 
        user_id,
        MAX(CASE WHEN event_name = 'signup' THEN event_timestamp END) AS signup_time,
        MAX(CASE WHEN event_name = 'first_login' THEN event_timestamp END) AS first_login_time,
        MAX(CASE WHEN event_name = 'feature_used' THEN event_timestamp END) AS first_feature_time,
        MAX(CASE WHEN event_name = 'upgrade' THEN event_timestamp END) AS upgrade_time
    FROM funnel_events
    WHERE event_rank = 1  -- Only first occurrence of each event
    GROUP BY user_id
),

-- Calculate funnel progression with timing
funnel_analysis AS (
    SELECT 
        user_id,
        signup_time,
        CASE WHEN first_login_time IS NOT NULL THEN 1 ELSE 0 END AS completed_login,
        CASE WHEN first_feature_time IS NOT NULL THEN 1 ELSE 0 END AS completed_feature_use,
        CASE WHEN upgrade_time IS NOT NULL THEN 1 ELSE 0 END AS completed_upgrade,
        
        -- Time to complete each step (in hours)
        CASE WHEN first_login_time IS NOT NULL 
            THEN EXTRACT(EPOCH FROM (first_login_time - signup_time)) / 3600 
        END AS hours_to_login,
        
        CASE WHEN first_feature_time IS NOT NULL 
            THEN EXTRACT(EPOCH FROM (first_feature_time - signup_time)) / 3600 
        END AS hours_to_feature,
        
        CASE WHEN upgrade_time IS NOT NULL 
            THEN EXTRACT(EPOCH FROM (upgrade_time - signup_time)) / 3600 
        END AS hours_to_upgrade
    FROM user_funnel_steps
    WHERE signup_time IS NOT NULL
)

-- Funnel conversion rates and timing analysis
SELECT 
    'Overall Funnel' AS segment,
    COUNT(*) AS total_signups,
    SUM(completed_login) AS completed_login,
    SUM(completed_feature_use) AS completed_feature_use,
    SUM(completed_upgrade) AS completed_upgrade,
    
    -- Conversion rates
    ROUND(100.0 * SUM(completed_login) / COUNT(*), 2) AS login_conversion_rate,
    ROUND(100.0 * SUM(completed_feature_use) / COUNT(*), 2) AS feature_conversion_rate,
    ROUND(100.0 * SUM(completed_upgrade) / COUNT(*), 2) AS upgrade_conversion_rate,
    
    -- Step-to-step conversion rates
    ROUND(100.0 * SUM(completed_feature_use) / NULLIF(SUM(completed_login), 0), 2) AS login_to_feature_rate,
    ROUND(100.0 * SUM(completed_upgrade) / NULLIF(SUM(completed_feature_use), 0), 2) AS feature_to_upgrade_rate,
    
    -- Timing analysis
    ROUND(AVG(hours_to_login), 2) AS avg_hours_to_login,
    ROUND(AVG(hours_to_feature), 2) AS avg_hours_to_feature,
    ROUND(AVG(hours_to_upgrade), 2) AS avg_hours_to_upgrade,
    
    -- Time distribution percentiles
    ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY hours_to_upgrade), 2) AS median_hours_to_upgrade,
    ROUND(PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY hours_to_upgrade), 2) AS p90_hours_to_upgrade

FROM funnel_analysis;

This comprehensive funnel analysis provides both conversion rates and timing insights. You can see not just where users drop off, but how quickly successful users move through each step.

Key insight: Pay special attention to step-to-step conversion rates. A 90% signup-to-login rate might look good, but if only 20% of users who log in actually use a feature, that's where you should focus optimization efforts.

Let's extend this to analyze funnels by user segment:

-- Segmented funnel analysis - comparing different user groups
WITH user_segments AS (
    SELECT 
        user_id,
        JSON_EXTRACT_PATH_TEXT(properties, 'source') AS acquisition_source,
        JSON_EXTRACT_PATH_TEXT(properties, 'plan') AS initial_plan,
        event_timestamp AS signup_time
    FROM user_events 
    WHERE event_name = 'signup'
),

funnel_events AS (
    SELECT 
        user_id,
        event_name,
        event_timestamp,
        ROW_NUMBER() OVER (PARTITION BY user_id, event_name ORDER BY event_timestamp) AS event_rank
    FROM user_events
    WHERE event_name IN ('signup', 'first_login', 'feature_used', 'upgrade')
),

user_funnel_steps AS (
    SELECT 
        user_id,
        MAX(CASE WHEN event_name = 'first_login' THEN event_timestamp END) AS first_login_time,
        MAX(CASE WHEN event_name = 'feature_used' THEN event_timestamp END) AS first_feature_time,
        MAX(CASE WHEN event_name = 'upgrade' THEN event_timestamp END) AS upgrade_time
    FROM funnel_events
    WHERE event_rank = 1
    GROUP BY user_id
),

segmented_funnel AS (
    SELECT 
        s.acquisition_source,
        s.initial_plan,
        COUNT(*) AS total_signups,
        SUM(CASE WHEN f.first_login_time IS NOT NULL THEN 1 ELSE 0 END) AS completed_login,
        SUM(CASE WHEN f.first_feature_time IS NOT NULL THEN 1 ELSE 0 END) AS completed_feature_use,
        SUM(CASE WHEN f.upgrade_time IS NOT NULL THEN 1 ELSE 0 END) AS completed_upgrade
    FROM user_segments s
    LEFT JOIN user_funnel_steps f ON s.user_id = f.user_id
    GROUP BY s.acquisition_source, s.initial_plan
)

SELECT 
    acquisition_source,
    initial_plan,
    total_signups,
    ROUND(100.0 * completed_login / total_signups, 2) AS login_conversion_rate,
    ROUND(100.0 * completed_feature_use / total_signups, 2) AS feature_conversion_rate,
    ROUND(100.0 * completed_upgrade / total_signups, 2) AS upgrade_conversion_rate,
    ROUND(100.0 * completed_upgrade / NULLIF(completed_feature_use, 0), 2) AS feature_to_upgrade_rate
FROM segmented_funnel
WHERE total_signups >= 10  -- Only include segments with meaningful sample sizes
ORDER BY upgrade_conversion_rate DESC;

This segmented analysis reveals which acquisition channels and initial plans produce users with the highest conversion rates through your funnel.

Combining Cohort and Funnel Analysis

The real power emerges when you combine cohort and funnel analysis to understand how different user groups convert over time. This hybrid approach answers questions like "Do users who sign up in December convert better than those who sign up in June?"

-- Time-based cohort funnel analysis
WITH signup_cohorts AS (
    SELECT 
        user_id,
        DATE_TRUNC('month', event_timestamp) AS signup_month,
        JSON_EXTRACT_PATH_TEXT(properties, 'source') AS acquisition_source,
        event_timestamp AS signup_time
    FROM user_events 
    WHERE event_name = 'signup'
),

conversion_events AS (
    SELECT 
        user_id,
        event_name,
        event_timestamp,
        ROW_NUMBER() OVER (PARTITION BY user_id, event_name ORDER BY event_timestamp) AS event_rank
    FROM user_events
    WHERE event_name IN ('first_login', 'feature_used', 'upgrade')
),

user_conversions AS (
    SELECT 
        user_id,
        MAX(CASE WHEN event_name = 'first_login' AND event_rank = 1 THEN event_timestamp END) AS first_login_time,
        MAX(CASE WHEN event_name = 'feature_used' AND event_rank = 1 THEN event_timestamp END) AS first_feature_time,
        MAX(CASE WHEN event_name = 'upgrade' AND event_rank = 1 THEN event_timestamp END) AS first_upgrade_time
    FROM conversion_events
    GROUP BY user_id
),

cohort_conversion_analysis AS (
    SELECT 
        c.signup_month,
        c.acquisition_source,
        COUNT(*) AS cohort_size,
        
        -- Basic conversion counts
        SUM(CASE WHEN u.first_login_time IS NOT NULL THEN 1 ELSE 0 END) AS login_converts,
        SUM(CASE WHEN u.first_feature_time IS NOT NULL THEN 1 ELSE 0 END) AS feature_converts,
        SUM(CASE WHEN u.first_upgrade_time IS NOT NULL THEN 1 ELSE 0 END) AS upgrade_converts,
        
        -- Time-based conversions (within 7 days, 30 days)
        SUM(CASE WHEN u.first_upgrade_time <= c.signup_time + INTERVAL '7 days' THEN 1 ELSE 0 END) AS upgrade_7d,
        SUM(CASE WHEN u.first_upgrade_time <= c.signup_time + INTERVAL '30 days' THEN 1 ELSE 0 END) AS upgrade_30d,
        
        -- Average time to convert (in days)
        AVG(CASE WHEN u.first_upgrade_time IS NOT NULL 
            THEN EXTRACT(EPOCH FROM (u.first_upgrade_time - c.signup_time)) / 86400 
        END) AS avg_days_to_upgrade
        
    FROM signup_cohorts c
    LEFT JOIN user_conversions u ON c.user_id = u.user_id
    GROUP BY c.signup_month, c.acquisition_source
)

SELECT 
    signup_month,
    acquisition_source,
    cohort_size,
    
    -- Overall conversion rates
    ROUND(100.0 * upgrade_converts / cohort_size, 2) AS overall_upgrade_rate,
    ROUND(100.0 * upgrade_7d / cohort_size, 2) AS upgrade_rate_7d,
    ROUND(100.0 * upgrade_30d / cohort_size, 2) AS upgrade_rate_30d,
    
    -- Funnel efficiency
    ROUND(100.0 * upgrade_converts / NULLIF(feature_converts, 0), 2) AS feature_to_upgrade_efficiency,
    
    ROUND(avg_days_to_upgrade, 1) AS avg_days_to_upgrade,
    
    -- Cohort performance vs overall average
    ROUND(
        100.0 * (upgrade_converts::DECIMAL / cohort_size) / 
        AVG(upgrade_converts::DECIMAL / cohort_size) OVER (), 
        2
    ) AS performance_vs_average

FROM cohort_conversion_analysis
WHERE cohort_size >= 20  -- Focus on statistically meaningful cohorts
ORDER BY signup_month, upgrade_rate_30d DESC;

This combined analysis shows you which months and acquisition sources produce the most valuable users, and how conversion patterns change over time.

Seasonal insight: You might discover that users who sign up in Q4 convert at higher rates but take longer to upgrade, while Q2 signups convert quickly but at lower rates. This suggests different onboarding strategies for different time periods.

Hands-On Exercise

Let's build a comprehensive retention and funnel dashboard for a SaaS product. You'll create a query that product managers can use to understand user behavior across multiple dimensions.

-- Comprehensive user behavior analysis dashboard
WITH base_events AS (
    -- Normalize your event data structure
    SELECT 
        user_id,
        event_name,
        event_timestamp,
        DATE_TRUNC('week', event_timestamp) AS event_week,
        JSON_EXTRACT_PATH_TEXT(properties, 'source') AS source,
        JSON_EXTRACT_PATH_TEXT(properties, 'plan') AS plan,
        session_id
    FROM user_events
),

user_cohorts AS (
    -- Define user cohorts by signup week and characteristics
    SELECT 
        user_id,
        MIN(event_timestamp) AS signup_time,
        DATE_TRUNC('week', MIN(event_timestamp)) AS cohort_week,
        FIRST_VALUE(source) OVER (PARTITION BY user_id ORDER BY event_timestamp) AS acquisition_source,
        FIRST_VALUE(plan) OVER (PARTITION BY user_id ORDER BY event_timestamp) AS initial_plan
    FROM base_events
    WHERE event_name = 'signup'
    GROUP BY user_id, source, plan
),

weekly_activity AS (
    -- Track user activity by week
    SELECT 
        user_id,
        event_week,
        COUNT(DISTINCT event_name) AS unique_events,
        COUNT(*) AS total_events,
        COUNT(DISTINCT session_id) AS sessions,
        MAX(CASE WHEN event_name = 'upgrade' THEN 1 ELSE 0 END) AS upgraded_this_week
    FROM base_events
    WHERE event_name != 'signup'
    GROUP BY user_id, event_week
),

-- Your task: Complete this analysis by adding:
-- 1. Retention calculation by cohort and acquisition source
-- 2. Funnel analysis from signup to first value event to upgrade
-- 3. Identify power users (top 10% by activity) and their characteristics
-- 4. Calculate weekly cohort performance metrics

retention_analysis AS (
    SELECT 
        c.cohort_week,
        c.acquisition_source,
        a.event_week,
        FLOOR((a.event_week - c.cohort_week) / INTERVAL '7 days') AS week_number,
        COUNT(DISTINCT c.user_id) AS cohort_size,
        COUNT(DISTINCT a.user_id) AS active_users,
        SUM(a.upgraded_this_week) AS upgrades_this_week
    FROM user_cohorts c
    LEFT JOIN weekly_activity a ON c.user_id = a.user_id
    GROUP BY c.cohort_week, c.acquisition_source, a.event_week
),

-- Complete the final summary query
final_dashboard AS (
    -- Your implementation here
    -- Should include retention rates, funnel conversion rates, and cohort comparisons
)

SELECT * FROM final_dashboard;

Your challenge: Complete this query to create a dashboard that shows:

  1. Weekly retention rates by cohort and acquisition source
  2. Conversion funnel performance for each cohort
  3. Identification of high-value user segments
  4. Performance trends over time

Common Mistakes & Troubleshooting

Mistake #1: Ignoring Data Quality Issues

The most common error in cohort analysis is assuming your event data is clean. Always validate your assumptions:

-- Data quality check for cohort analysis
SELECT 
    'Duplicate signup events' AS issue,
    COUNT(*) AS affected_users
FROM (
    SELECT user_id, COUNT(*) as signup_count
    FROM user_events 
    WHERE event_name = 'signup'
    GROUP BY user_id
    HAVING COUNT(*) > 1
) duplicates

UNION ALL

SELECT 
    'Users with events before signup' AS issue,
    COUNT(*) AS affected_users
FROM (
    SELECT DISTINCT e1.user_id
    FROM user_events e1
    JOIN user_events e2 ON e1.user_id = e2.user_id
    WHERE e1.event_name != 'signup' 
    AND e2.event_name = 'signup'
    AND e1.event_timestamp < e2.event_timestamp
) pre_signup_activity;

Mistake #2: Incorrect Time Window Calculations

Date arithmetic in SQL can be tricky, especially across month boundaries:

-- Wrong: This doesn't account for varying month lengths
DATEDIFF(month, signup_date, activity_date) AS month_number

-- Correct: Use proper interval arithmetic
EXTRACT(EPOCH FROM (activity_date - signup_date)) / 2628000 AS month_number

Mistake #3: Survivorship Bias in Retention Analysis

Only analyzing users who are still active introduces bias. Always include all users in your cohort:

-- Wrong: Only shows retained users
SELECT cohort_month, AVG(activity_days)
FROM user_activity 
WHERE last_activity >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY cohort_month;

-- Correct: Shows all users, with NULL for inactive ones
SELECT 
    c.cohort_month,
    COUNT(*) AS total_users,
    COUNT(a.user_id) AS active_users,
    AVG(a.activity_days) AS avg_activity_days
FROM cohorts c
LEFT JOIN user_activity a ON c.user_id = a.user_id 
    AND a.last_activity >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY c.cohort_month;

Performance Troubleshooting:

For large datasets, these queries can become slow. Key optimization strategies:

  1. Partition your tables by date if you have millions of events
  2. Create composite indexes on (user_id, event_timestamp, event_name)
  3. Use materialized views for cohort definitions that don't change often
  4. Limit time ranges in development—analyze 3 months of data before scaling to years
-- Index creation for better performance
CREATE INDEX idx_events_user_time_event 
ON user_events (user_id, event_timestamp, event_name);

CREATE INDEX idx_events_event_time 
ON user_events (event_name, event_timestamp);

Summary & Next Steps

You now have the SQL skills to build sophisticated user behavior analyses that drive product decisions. Cohort analysis reveals how different user segments behave over time, funnel analysis identifies conversion bottlenecks, and retention metrics track long-term user value.

The key insights from this lesson:

  • Cohort analysis segments users by acquisition characteristics and tracks behavior over time
  • Retention analysis goes beyond simple "did they return" to measure engagement quality
  • Funnel analysis identifies specific drop-off points in user journeys
  • Combined approaches reveal how different user segments convert and engage over time

Next steps for your development:

  1. Practice with your own data—apply these techniques to your company's user events
  2. Automate the analysis—turn these queries into scheduled reports or dashboards
  3. Extend the analysis—add more sophisticated segmentation like user personas or behavioral clustering
  4. Learn advanced techniques—explore statistical significance testing for cohort comparisons and predictive retention modeling

Advanced topics to explore:

  • Statistical significance testing for cohort performance differences
  • Predictive analytics for churn risk scoring
  • Real-time retention tracking with streaming data
  • Machine learning approaches to user segmentation and lifetime value prediction

The analytical foundation you've built here scales to enterprise-level user research. Companies like Spotify, Netflix, and Airbnb use these exact techniques (with more sophisticated infrastructure) to understand and optimize user behavior at massive scale.

Learning Path: Advanced SQL Queries

Previous

Working with JSON and Arrays in Modern SQL: Complete Guide

Related Articles

SQL🌱 Foundation

Working with JSON and Arrays in Modern SQL: Complete Guide

14 min
SQL🔥 Expert

Advanced JOIN Patterns: Self Joins, Anti Joins, and Semi Joins

20 min
SQL⚡ Practitioner

SQL Transactions, Isolation Levels, and Locking: A Complete Guide to Concurrent Database Programming

17 min

On this page

  • Prerequisites
  • Understanding the Data Structure
  • Building Your First Cohort Analysis
  • Advanced Retention Analysis
  • Building Conversion Funnels
  • Combining Cohort and Funnel Analysis
  • Hands-On Exercise
  • Common Mistakes & Troubleshooting
  • Summary & Next Steps