Mastering SQL Set Operations: UNION, INTERSECT, and EXCEPT for Complex Data Reconciliation and Deduplication

Introduction

You're three hours into a data migration audit. Two systems — the legacy CRM and the new cloud platform — both claim to be the source of truth for customer records. Your job is to find out which customers exist in only one system, which appear in both but with conflicting data, and which have been duplicated within the same system. You've got access to both databases, a deadline, and a sneaking suspicion that someone's going to be very unhappy with what you find.

This is exactly the kind of problem SQL set operations were designed for. UNION, INTERSECT, and EXCEPT (sometimes called MINUS in Oracle) are the mathematical operators of relational algebra brought to life in SQL. They let you combine, compare, and contrast entire result sets the same way a mathematician would work with sets — finding unions, intersections, and differences between collections of values. Most SQL developers have a passing familiarity with UNION ALL, but the full power of these operators — including their interaction with NULL handling, query optimization, deduplication semantics, and cross-database reconciliation — is rarely taught with the depth it deserves.

By the end of this lesson, you'll have a complete, expert-level understanding of how set operations work under the hood, how to use them to solve real data quality problems, and when they're the right tool versus when they'll burn you. You'll be writing queries that data engineers actually use in production.

What you'll learn:

The relational algebra foundations behind set operations and why column semantics — not names — drive compatibility
The critical behavioral difference between UNION and UNION ALL, including the performance implications of implicit deduplication
How INTERSECT and EXCEPT handle NULL values differently than standard equality operators
Advanced patterns for multi-source data reconciliation, change detection, and symmetric difference queries
Query execution internals and how to optimize set operations for large-scale production workloads

Prerequisites

This lesson assumes you're comfortable with:

Writing multi-table JOIN queries and understanding their execution
Aggregate functions (GROUP BY, HAVING, COUNT, SUM)
Subqueries and CTEs (Common Table Expressions)
A working knowledge of execution plans (understanding terms like hash join, sort, and scan)
Familiarity with at least one major RDBMS: PostgreSQL, SQL Server, or MySQL 8+

Platform note: All examples in this lesson are written in PostgreSQL-compatible SQL. Where SQL Server or MySQL diverge in syntax or behavior, those differences will be explicitly called out.

The Relational Foundation: What Set Operations Actually Do

Before writing a single query, let's establish something that trips up even experienced developers: set operations work on rows as atomic units. When you write SELECT first_name, email FROM customers, the result is a set of rows where each row is the combination (first_name, email). Set operations then apply to those complete row tuples.

This is why the fundamental rule of set operations is structural compatibility, not name compatibility. The columns in each SELECT statement must match in number and have compatible data types. The column names from the first query are what appear in the final result. This behavior surprises people constantly:

-- This works, and the result column is named "customer_email"
SELECT customer_email FROM crm_customers
UNION
SELECT email_address FROM erp_customers;

-- The result column is named "customer_email" — the second column name is ignored

The database engine compares rows from both result sets by their positional values, not by column names. The first column of the first query is compared with the first column of the second query, and so on. If your types aren't compatible, the engine will attempt implicit casting, and when that fails, you'll get an error that can be confusing if you forget this positional rule.

Here's the mental model that makes everything else click: imagine two result sets laid out as mathematical sets:

Set A = {(alice@example.com, active), (bob@example.com, inactive), (carol@example.com, active)}
Set B = {(bob@example.com, inactive), (carol@example.com, suspended), (dave@example.com, active)}

UNION gives you every unique row from both sets
INTERSECT gives you rows that appear in both sets
EXCEPT gives you rows in A that don't appear in B

The word "appear" is doing a lot of work there, especially when NULL values are involved — and we'll dig into that shortly.

UNION vs. UNION ALL: The Deduplication Decision

This is the most commonly misunderstood distinction in set operations, and it has serious performance consequences.

UNION returns all distinct rows from both queries. UNION ALL returns all rows from both queries, including duplicates. The difference sounds simple, but the mechanism matters enormously.

The Hidden Cost of UNION

When you write UNION (without ALL), the database must perform a deduplication step across the entire combined result set. Depending on the query planner, this typically means either:

Sort-based deduplication: Sort the combined rows, then eliminate adjacent duplicates — O(n log n)
Hash-based deduplication: Build a hash table of all rows seen so far, discard rows whose hash exists — O(n) average, but with significant memory pressure

For large result sets, this is expensive. Let's look at a real scenario.

You're combining transaction data from two regional payment processors. Both write to separate tables in your analytics database, and there's an overlap window during which transactions from the cutover period appear in both tables:

-- NAIVE: This will deduplicate, which costs CPU and memory
SELECT transaction_id, amount, customer_id, processed_at
FROM payments_processor_east
UNION
SELECT transaction_id, amount, customer_id, processed_at
FROM payments_processor_west;

-- BETTER: If you know the tables don't overlap, use UNION ALL
SELECT transaction_id, amount, customer_id, processed_at
FROM payments_processor_east
UNION ALL
SELECT transaction_id, amount, customer_id, processed_at
FROM payments_processor_west;

The second query can be executed with a simple sequential scan of both tables and no additional work. The first query requires materializing both result sets and performing deduplication. On tens of millions of rows, this is the difference between a 2-second query and a 45-second query.

Rule of thumb: Start with UNION ALL unless you have a specific, documented reason to need deduplication. If you need deduplication, ask yourself whether you should be investigating why duplicates exist before just silently removing them.

When UNION (Deduplication) Is Actually the Right Answer

There are legitimate cases for UNION with deduplication. The most common is combining data from multiple sources where you genuinely expect and want idempotency:

-- Combine all email addresses that should receive a compliance notice
-- Users may exist in multiple systems; we want exactly one email per address
SELECT email_address FROM crm_customers WHERE region = 'EU'
UNION
SELECT email_address FROM legacy_users WHERE gdpr_consent_required = true
UNION
SELECT email_address FROM trial_accounts WHERE created_at > '2023-01-01';

Here, UNION is correct and intentional — you want exactly one copy of each email address regardless of how many source tables it appears in.

Deduplication Semantics: NULL Is Equal to NULL

This is where things get genuinely subtle and where most documentation falls short.

In standard SQL comparisons, NULL = NULL is FALSE (or more precisely, UNKNOWN). This is why WHERE column = NULL never returns rows — you have to use IS NULL. But set operations use a different equality definition called IS NOT DISTINCT FROM, which treats two NULL values as equal.

-- Create two simple tables to demonstrate
CREATE TEMP TABLE set_a (val INTEGER);
CREATE TEMP TABLE set_b (val INTEGER);

INSERT INTO set_a VALUES (1), (2), (NULL), (NULL);
INSERT INTO set_b VALUES (2), (3), (NULL);

-- UNION deduplication treats NULL as equal to NULL
SELECT val FROM set_a
UNION
SELECT val FROM set_b;

Result:

val
-----
  1
  2
  3
NULL

Only one NULL appears in the result, even though set_a had two NULL values and set_b had one. The deduplication logic correctly identifies all three NULLs as the same value and retains only one. This is almost always what you want, but it's important to understand it's happening.

Compare this to what would happen with a naive JOIN-based approach where you'd use = for comparison — the NULLs would fall through.

INTERSECT: Finding What Belongs to Both Sets

INTERSECT returns rows that appear in both result sets. It's the SQL equivalent of a set intersection, and it's surprisingly powerful for data reconciliation tasks.

The Classic Use Case: Change Detection

Consider you're responsible for validating a data migration. You've migrated customer records from an old schema to a new one, and you need to verify that all customers who existed before the migration also exist after, with the same key attributes.

-- Find customers whose core attributes are identical in both systems
-- (these are successfully migrated with no changes)
SELECT 
    customer_id,
    email,
    first_name,
    last_name,
    date_of_birth
FROM legacy_customers
INTERSECT
SELECT 
    customer_id,
    email,
    first_name,
    last_name,
    date_of_birth
FROM migrated_customers;

This query finds the rows that are perfectly identical in both tables. Every column value must match (using the IS NOT DISTINCT FROM semantics, so NULLs match NULLs).

INTERSECT vs. INNER JOIN: A Critical Distinction

Developers often ask why you'd use INTERSECT when an INNER JOIN does something similar. The distinction matters a lot in practice.

INNER JOIN matches rows based on specified join keys and returns all columns from both tables. INTERSECT compares entire rows and returns only distinct matching rows from one conceptual set. They have fundamentally different cardinality behavior:

-- INNER JOIN version (verbose, different semantics)
SELECT DISTINCT 
    l.customer_id, l.email, l.first_name, l.last_name, l.date_of_birth
FROM legacy_customers l
INNER JOIN migrated_customers m 
    ON l.customer_id = m.customer_id
   AND l.email = m.email
   AND l.first_name = m.first_name
   AND l.last_name = m.last_name
   AND (l.date_of_birth = m.date_of_birth 
        OR (l.date_of_birth IS NULL AND m.date_of_birth IS NULL));

-- INTERSECT version (concise, correct NULL handling built-in)
SELECT customer_id, email, first_name, last_name, date_of_birth
FROM legacy_customers
INTERSECT
SELECT customer_id, email, first_name, last_name, date_of_birth
FROM migrated_customers;

Notice the JOIN version requires explicit IS NULL handling. The INTERSECT version handles it automatically. More importantly, the JOIN version can produce duplicate rows if either table has duplicates on the join keys — the DISTINCT is necessary to match INTERSECT semantics. For wide tables with many columns, INTERSECT is far cleaner and less error-prone.

Performance Considerations for INTERSECT

INTERSECT (like UNION) implies deduplication. Unlike UNION ALL, there is no INTERSECT ALL in most databases (though PostgreSQL does support it). INTERSECT must deduplicate its inputs and then find common rows.

For performance-critical queries, consider pre-filtering each subquery to reduce the rows being compared:

-- Better performance: filter before intersecting
SELECT customer_id, email, status
FROM legacy_customers
WHERE created_at >= '2022-01-01'  -- narrow the scope first
INTERSECT
SELECT customer_id, email, status
FROM migrated_customers
WHERE created_at >= '2022-01-01';

EXCEPT: Finding What's Missing or Unique

EXCEPT (called MINUS in Oracle and older MySQL) returns rows from the first result set that do not appear in the second result set. This is the workhorse of data reconciliation — it's how you find gaps, orphaned records, and missing migrations.

Migration Completeness Verification

Let's continue the migration scenario. After running the migration, you need to know: which legacy customers were not successfully migrated?

-- Customers in legacy system but NOT in migrated system
SELECT customer_id, email, first_name, last_name
FROM legacy_customers
EXCEPT
SELECT customer_id, email, first_name, last_name
FROM migrated_customers;

This is clean and direct. Any row returned represents a customer who either wasn't migrated or whose data changed during migration (since EXCEPT compares all specified columns).

But here's a subtlety that bites people: if a customer's first_name changed during migration, they'll appear in this result even though their customer_id was migrated. You need to be deliberate about which columns you include.

-- Find customer_ids that don't exist in the migrated system at all
-- (regardless of whether other fields changed)
SELECT customer_id FROM legacy_customers
EXCEPT
SELECT customer_id FROM migrated_customers;

-- Find customer_ids that exist in both but have DIFFERENT data
-- (this is the "dirty migration" query)
SELECT customer_id FROM legacy_customers
EXCEPT
SELECT customer_id FROM migrated_customers
-- is NOT what we want here. Use this instead:

SELECT customer_id
FROM legacy_customers
WHERE customer_id IN (
    SELECT customer_id FROM migrated_customers
)
EXCEPT
SELECT customer_id
FROM (
    SELECT customer_id, email, first_name, last_name
    FROM legacy_customers
    INTERSECT
    SELECT customer_id, email, first_name, last_name
    FROM migrated_customers
) clean_matches;

That last query finds customers who made it to the migration but with data discrepancies — the classic "exists in both, but something changed" scenario. We'll build this into a complete reconciliation framework later.

The Symmetric Difference: Finding Records Unique to Either Side

One of the most useful patterns in data reconciliation is the symmetric difference — records that appear in either set A or set B, but not in both. SQL doesn't have a native SYMMETRIC EXCEPT operator, but you can build it easily:

-- Records in legacy but not migrated (missing from migration)
SELECT customer_id, email, 'missing_from_migration' AS status
FROM (
    SELECT customer_id, email FROM legacy_customers
    EXCEPT
    SELECT customer_id, email FROM migrated_customers
) legacy_only

UNION ALL

-- Records in migrated but not legacy (phantom records)
SELECT customer_id, email, 'phantom_in_migration' AS status
FROM (
    SELECT customer_id, email FROM migrated_customers
    EXCEPT
    SELECT customer_id, email FROM legacy_customers
) migrated_only;

This gives you a complete picture: records that should have been migrated but weren't, and records that appear in the migration with no corresponding legacy source.

Warning: EXCEPT is directional. A EXCEPT B gives you rows in A not in B. B EXCEPT A gives you rows in B not in A. These are completely different result sets. Always think explicitly about which direction you need.

Advanced Pattern: Full Data Reconciliation Framework

Let's build a complete, production-quality reconciliation query. This is the kind of query you'd actually put in a data quality monitoring job.

The scenario: You have a financial reporting system where account balances are written to two separate tables — one from the operational system and one from the accounting system. Every month-end, they should agree. Your job is to produce a reconciliation report.

WITH 
-- Normalize both sources to the same structure
ops_balances AS (
    SELECT 
        account_id,
        COALESCE(account_code, 'UNKNOWN') AS account_code,
        ROUND(closing_balance::NUMERIC, 2) AS balance,
        reporting_date
    FROM operational_system.account_balances
    WHERE reporting_date = '2024-01-31'
      AND is_active = true
),

acct_balances AS (
    SELECT 
        account_id,
        account_code,
        ROUND(balance_amount::NUMERIC, 2) AS balance,
        period_end_date AS reporting_date
    FROM accounting_system.period_balances
    WHERE period_end_date = '2024-01-31'
      AND status = 'CLOSED'
),

-- Records identical in both systems (clean)
clean_matches AS (
    SELECT account_id, account_code, balance, reporting_date
    FROM ops_balances
    INTERSECT
    SELECT account_id, account_code, balance, reporting_date
    FROM acct_balances
),

-- In ops but not in accounting (missing from accounting)
missing_from_accounting AS (
    SELECT account_id, account_code, balance, reporting_date
    FROM ops_balances
    EXCEPT
    SELECT account_id, account_code, balance, reporting_date
    FROM acct_balances
),

-- In accounting but not in ops (phantom in accounting)
phantom_in_accounting AS (
    SELECT account_id, account_code, balance, reporting_date
    FROM acct_balances
    EXCEPT
    SELECT account_id, account_code, balance, reporting_date
    FROM ops_balances
),

-- Accounts that exist in both but with balance discrepancies
-- (these appear in both missing sets, having stripped balance from comparison)
discrepancies AS (
    SELECT 
        mfa.account_id,
        mfa.account_code,
        mfa.balance AS ops_balance,
        pia.balance AS acct_balance,
        ROUND(ABS(mfa.balance - pia.balance), 2) AS difference
    FROM missing_from_accounting mfa
    INNER JOIN phantom_in_accounting pia 
        ON mfa.account_id = pia.account_id
)

-- Final reconciliation report
SELECT 
    'CLEAN' AS reconciliation_status,
    account_id,
    account_code,
    balance AS ops_balance,
    balance AS acct_balance,
    0.00 AS difference,
    reporting_date
FROM clean_matches

UNION ALL

SELECT 
    'MISSING_FROM_ACCOUNTING',
    account_id,
    account_code,
    balance,
    NULL,
    balance,
    reporting_date
FROM missing_from_accounting
WHERE account_id NOT IN (SELECT account_id FROM discrepancies)

UNION ALL

SELECT 
    'PHANTOM_IN_ACCOUNTING',
    account_id,
    account_code,
    NULL,
    balance,
    balance,
    reporting_date
FROM phantom_in_accounting
WHERE account_id NOT IN (SELECT account_id FROM discrepancies)

UNION ALL

SELECT 
    'BALANCE_DISCREPANCY',
    account_id,
    account_code,
    ops_balance,
    acct_balance,
    difference,
    '2024-01-31'::DATE
FROM discrepancies

ORDER BY reconciliation_status, account_id;

This query produces a complete reconciliation report in a single pass. The CTE structure makes the logic auditable, and the final UNION ALL (not UNION, since all four categories are mutually exclusive) assembles the report without any deduplication overhead.

Ordering, Limiting, and Composing Set Operations

ORDER BY with Set Operations

A common gotcha: you can only have one ORDER BY clause in a set operation, and it goes at the very end. It applies to the entire result, not to individual queries.

-- CORRECT
SELECT customer_id, email FROM legacy_customers
UNION ALL
SELECT customer_id, email FROM migrated_customers
ORDER BY customer_id;

-- WRONG — this will error
SELECT customer_id, email FROM legacy_customers ORDER BY customer_id  -- ERROR
UNION ALL
SELECT customer_id, email FROM migrated_customers;

If you need to order intermediate results before combining them, use a subquery or CTE:

-- Order within a subquery, then union
SELECT customer_id, email, row_number
FROM (
    SELECT customer_id, email, ROW_NUMBER() OVER (ORDER BY created_at) AS row_number
    FROM legacy_customers
) ranked_legacy

UNION ALL

SELECT customer_id, email, row_number
FROM (
    SELECT customer_id, email, ROW_NUMBER() OVER (ORDER BY created_at) AS row_number
    FROM migrated_customers
) ranked_migrated;

LIMIT and FETCH FIRST with Set Operations

Similarly, LIMIT (or FETCH FIRST) applies to the entire combined result:

-- Returns the first 100 rows from the combined result
SELECT customer_id, email FROM legacy_customers
UNION ALL
SELECT customer_id, email FROM migrated_customers
LIMIT 100;

If you want to limit individual subqueries, wrap them:

SELECT customer_id, email FROM (
    SELECT customer_id, email FROM legacy_customers LIMIT 1000
) top_legacy

UNION ALL

SELECT customer_id, email FROM (
    SELECT customer_id, email FROM migrated_customers LIMIT 1000
) top_migrated;

Operator Precedence in Chains

When you chain multiple set operations, precedence matters. In standard SQL:

INTERSECT has higher precedence than UNION and EXCEPT
UNION and EXCEPT have equal precedence and evaluate left to right

This means:

-- This evaluates as: A UNION (B INTERSECT C)
SELECT * FROM a
UNION
SELECT * FROM b
INTERSECT
SELECT * FROM c;

-- This evaluates as: (A UNION B) EXCEPT C
SELECT * FROM a
UNION
SELECT * FROM b
EXCEPT
SELECT * FROM c;

Warning: Don't rely on implicit precedence. Parentheses are cheap, bugs are expensive. Always parenthesize complex set operation chains explicitly to make your intent clear and prevent subtle ordering bugs.

-- Explicit and unambiguous
(SELECT * FROM a UNION SELECT * FROM b)
EXCEPT
SELECT * FROM c;

Performance Deep Dive: Execution Internals

Understanding how the query planner executes set operations helps you write faster queries and interpret execution plans correctly.

UNION ALL: The Free Lunch

UNION ALL is implemented as an append operation. The query planner reads from each source in sequence (or in some cases, in parallel) and passes rows directly to the next operation without any intermediate materialization. It's essentially free — you pay only for the cost of reading the underlying data.

In PostgreSQL, you'll see this in the execution plan as an Append node:

EXPLAIN SELECT email FROM legacy_customers
UNION ALL
SELECT email FROM migrated_customers;

-- Typical plan:
-- Append
--   -> Seq Scan on legacy_customers
--   -> Seq Scan on migrated_customers

UNION: The Sorting Cost

UNION requires deduplication. The planner typically uses one of two strategies:

HashAggregate: Build a hash table of all rows, emit only new ones. Fast for data that fits in memory (work_mem in PostgreSQL). Degrades badly when spilling to disk.

Sort + Unique: Sort all rows from both inputs, then eliminate adjacent duplicates. Predictable memory usage, good when data is already partially sorted or when indexes can be used.

EXPLAIN SELECT email FROM legacy_customers
UNION
SELECT email FROM migrated_customers;

-- Common plan:
-- HashAggregate (or Sort + Unique)
--   -> Append
--     -> Seq Scan on legacy_customers
--     -> Seq Scan on migrated_customers

For very large datasets, you can sometimes help the planner by increasing work_mem for the session before running the query:

SET work_mem = '256MB';

SELECT email FROM legacy_customers
UNION
SELECT email FROM migrated_customers;

RESET work_mem;

INTERSECT and EXCEPT: The Set Algebra Cost

Both INTERSECT and EXCEPT typically execute using one of these strategies:

HashSetOp: Hash both inputs and compare — efficient but memory-intensive
Sort-based SetOp: Sort both inputs and walk them in parallel (merge join style) — predictable, good for large datasets

The key insight: INTERSECT and EXCEPT must see all rows from both inputs before they can emit results. They have no equivalent to a streaming UNION ALL. This means they have inherent memory and latency costs that UNION ALL doesn't.

For production workloads with massive tables, consider whether a NOT EXISTS or NOT IN subquery might be more optimizer-friendly and allow better index usage:

-- EXCEPT approach (may not use indexes efficiently)
SELECT customer_id FROM legacy_customers
EXCEPT
SELECT customer_id FROM migrated_customers;

-- NOT EXISTS approach (can leverage indexes on customer_id)
SELECT l.customer_id
FROM legacy_customers l
WHERE NOT EXISTS (
    SELECT 1 FROM migrated_customers m
    WHERE m.customer_id = l.customer_id
);

The NOT EXISTS version allows the planner to use an index on migrated_customers.customer_id for each row lookup, which can be dramatically faster when the result set is small relative to the input. However, note that NOT EXISTS with NULL-able columns behaves differently — it uses regular equality, not the IS NOT DISTINCT FROM semantics that EXCEPT uses.

Rule: For single-key comparisons, NOT EXISTS / NOT IN with proper indexing usually beats EXCEPT. For multi-column row comparisons, EXCEPT is often cleaner and comparably fast. Profile both on your actual data.

Cross-Database and Cross-Schema Set Operations

Set operations become especially powerful when working across databases, schemas, or even database systems via foreign data wrappers.

Within the Same Database, Different Schemas

This is straightforward in PostgreSQL and SQL Server — just qualify your table names:

-- PostgreSQL: compare across schemas
SELECT customer_id, email, status
FROM public.customers
EXCEPT
SELECT customer_id, email, status
FROM archive.customers;

Using Foreign Data Wrappers (PostgreSQL)

PostgreSQL's foreign data wrappers let you query remote tables as if they were local. Combined with set operations, this enables real-time cross-system reconciliation:

-- Assuming a foreign server 'legacy_db' has been configured
CREATE FOREIGN TABLE legacy_customers_fdw (
    customer_id INTEGER,
    email VARCHAR(255),
    status VARCHAR(50)
)
SERVER legacy_db
OPTIONS (schema_name 'public', table_name 'customers');

-- Now reconcile across systems in a single query
SELECT customer_id, email, status
FROM legacy_customers_fdw
EXCEPT
SELECT customer_id, email, status
FROM new_customers;

The query planner will push down predicates to the remote system where possible, but the set operation itself happens locally. For large datasets, this means significant data movement — consider materializing to temp tables first:

-- More efficient for large datasets
CREATE TEMP TABLE legacy_snapshot AS
SELECT customer_id, email, status
FROM legacy_customers_fdw
WHERE updated_at > CURRENT_DATE - INTERVAL '30 days';

CREATE TEMP TABLE new_snapshot AS
SELECT customer_id, email, status
FROM new_customers
WHERE updated_at > CURRENT_DATE - INTERVAL '30 days';

-- Now set operations are local
SELECT customer_id, email, status FROM legacy_snapshot
EXCEPT
SELECT customer_id, email, status FROM new_snapshot;

Database-Specific Variations

Oracle: Uses MINUS instead of EXCEPT. Supports INTERSECT ALL and MINUS ALL (though rarely used). Same deduplication semantics.

-- Oracle syntax
SELECT customer_id FROM legacy_customers
MINUS
SELECT customer_id FROM migrated_customers;

MySQL (8.0+): Added INTERSECT and EXCEPT in version 8.0.31. Pre-8.0 MySQL only supports UNION. If you're on older MySQL, you'll need to use JOIN-based workarounds.

SQL Server: Full support for all three operators, same semantics as PostgreSQL. One notable difference: SQL Server doesn't support INTERSECT ALL or EXCEPT ALL.

BigQuery: Full support. For massive datasets, BigQuery's optimizer typically handles these well, but EXCEPT DISTINCT (their syntax for what PostgreSQL calls just EXCEPT) can be expensive on billion-row tables. Consider using window functions or approximate matching instead.

Deduplication Strategies Beyond Simple UNION

Sometimes you need more nuanced deduplication than "remove exact row duplicates." This is where set operations combine with window functions.

Deduplicating Within a Single Table by Priority

You have a customer table where the same email appears multiple times (once per source system). You want to keep the record from the "best" source:

WITH ranked_customers AS (
    SELECT 
        customer_id,
        email,
        first_name,
        last_name,
        source_system,
        ROW_NUMBER() OVER (
            PARTITION BY email 
            ORDER BY 
                CASE source_system 
                    WHEN 'CRM' THEN 1 
                    WHEN 'ERP' THEN 2 
                    WHEN 'LEGACY' THEN 3 
                    ELSE 4 
                END,
                created_at DESC
        ) AS priority_rank
    FROM all_customers
)
SELECT customer_id, email, first_name, last_name, source_system
FROM ranked_customers
WHERE priority_rank = 1;

Identifying True Duplicates for Investigation

When you need to find not just the duplicates but all records involved in a duplication (so you can route them to a data quality workflow):

-- Find all customers where the email appears more than once
-- Return ALL copies of the duplicated records, not just the extras
WITH duplicate_emails AS (
    SELECT email
    FROM all_customers
    GROUP BY email
    HAVING COUNT(*) > 1
)
SELECT c.*
FROM all_customers c
INNER JOIN duplicate_emails d ON c.email = d.email
ORDER BY c.email, c.created_at;

Combining UNION ALL with Deduplication for ETL

A common ETL pattern is to load incremental data and deduplicate against the full table. Using UNION ALL plus a window function is more efficient than UNION on large tables:

-- Efficient upsert pattern using UNION ALL + window function
INSERT INTO customers_final
SELECT customer_id, email, first_name, last_name, updated_at
FROM (
    SELECT 
        customer_id, email, first_name, last_name, updated_at,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id 
            ORDER BY updated_at DESC
        ) AS rn
    FROM (
        -- Combine existing + new records
        SELECT customer_id, email, first_name, last_name, updated_at
        FROM customers_final
        
        UNION ALL
        
        SELECT customer_id, email, first_name, last_name, updated_at
        FROM customers_staging
    ) combined
) deduped
WHERE rn = 1;

This pattern materializes the UNION ALL (cheap), then uses a window function to pick the most recent version of each customer — effectively implementing an upsert without an actual UPSERT statement.

Hands-On Exercise

This exercise uses a realistic scenario that exercises all three set operations together. Work through it in your own PostgreSQL environment.

Setup

-- Create the scenario: two systems tracking employee records
-- after a company merger

CREATE TABLE company_a_employees (
    employee_id INTEGER,
    email VARCHAR(255),
    department VARCHAR(100),
    salary NUMERIC(10,2),
    hire_date DATE
);

CREATE TABLE company_b_employees (
    employee_id INTEGER,
    email VARCHAR(255), 
    department VARCHAR(100),
    salary NUMERIC(10,2),
    hire_date DATE
);

-- Company A employees
INSERT INTO company_a_employees VALUES
(1001, 'alice.chen@companya.com', 'Engineering', 95000.00, '2019-03-15'),
(1002, 'bob.martinez@companya.com', 'Marketing', 72000.00, '2020-07-01'),
(1003, 'carol.lee@companya.com', 'Engineering', 105000.00, '2018-01-10'),
(1004, 'david.park@companya.com', 'Sales', 68000.00, '2021-04-22'),
(1005, 'eve.johnson@companya.com', 'Engineering', 88000.00, '2022-02-14'),
(1006, 'shared@bothcompanies.com', 'IT', 78000.00, '2017-06-01'); -- contractor in both

-- Company B employees  
INSERT INTO company_b_employees VALUES
(2001, 'frank.wilson@companyb.com', 'Engineering', 98000.00, '2019-11-20'),
(2002, 'grace.taylor@companyb.com', 'Finance', 82000.00, '2020-03-08'),
(1003, 'carol.lee@companya.com', 'Engineering', 115000.00, '2018-01-10'), -- salary changed!
(2003, 'henry.brown@companyb.com', 'Sales', 71000.00, '2021-08-15'),
(1006, 'shared@bothcompanies.com', 'IT', 78000.00, '2017-06-01'), -- same contractor
(2004, 'iris.white@companyb.com', 'Marketing', NULL, '2023-01-05'); -- NULL salary

Exercises

Exercise 1: Find all unique email/department combinations across both companies. (Use UNION)

Exercise 2: Find employees who appear in both systems with identical records. (Use INTERSECT)

Exercise 3: Find employees in Company A who have no matching email in Company B. (Use EXCEPT on just the email column)

Exercise 4: Find employees who appear in both companies but with a different salary. This requires combining EXCEPT and INTERSECT logic — think about it carefully before writing the query.

Exercise 5: Build a full merge report showing: UNIQUE_TO_A, UNIQUE_TO_B, IDENTICAL_IN_BOTH, and CONFLICT (same email, different data). This is the full symmetric difference plus clean matches.

Solutions

-- Exercise 1: All unique email/department combinations
SELECT email, department FROM company_a_employees
UNION
SELECT email, department FROM company_b_employees;

-- Exercise 2: Truly identical records in both systems
SELECT employee_id, email, department, salary, hire_date
FROM company_a_employees
INTERSECT
SELECT employee_id, email, department, salary, hire_date
FROM company_b_employees;
-- Should return only the shared contractor (1006)

-- Exercise 3: Emails in A but not in B
SELECT email FROM company_a_employees
EXCEPT
SELECT email FROM company_b_employees;

-- Exercise 4: Same email, different salary
-- First find emails that appear in both
-- Then exclude the ones where full records match
SELECT a.email, a.salary AS salary_a, b.salary AS salary_b
FROM company_a_employees a
INNER JOIN company_b_employees b ON a.email = b.email
WHERE a.salary IS DISTINCT FROM b.salary;
-- Note: using IS DISTINCT FROM to handle NULLs correctly

-- Exercise 5: Full merge report
WITH 
all_emails_a AS (SELECT email FROM company_a_employees),
all_emails_b AS (SELECT email FROM company_b_employees),
identical_records AS (
    SELECT email FROM company_a_employees
    INTERSECT
    SELECT email FROM 
    (SELECT employee_id, email, department, salary, hire_date FROM company_a_employees
     INTERSECT
     SELECT employee_id, email, department, salary, hire_date FROM company_b_employees) ir
),
emails_in_both AS (
    SELECT email FROM all_emails_a
    INTERSECT
    SELECT email FROM all_emails_b
)

SELECT 'UNIQUE_TO_A' AS status, email FROM (
    SELECT email FROM all_emails_a
    EXCEPT
    SELECT email FROM all_emails_b
) u

UNION ALL

SELECT 'UNIQUE_TO_B', email FROM (
    SELECT email FROM all_emails_b
    EXCEPT
    SELECT email FROM all_emails_a
) u

UNION ALL

SELECT 'IDENTICAL_IN_BOTH', email FROM identical_records

UNION ALL

SELECT 'CONFLICT', email FROM (
    SELECT email FROM emails_in_both
    EXCEPT
    SELECT email FROM identical_records
) conflicts

ORDER BY status, email;

Common Mistakes & Troubleshooting

Mistake 1: Using UNION When You Mean UNION ALL

Already covered, but worth re-emphasizing: this is the most common performance bug in set operation queries. If your sets are mutually exclusive (as in the final merge report above), UNION ALL is always correct and always faster.

Mistake 2: Column Count or Type Mismatch

-- ERROR: different number of columns
SELECT customer_id, email FROM customers
UNION
SELECT customer_id FROM leads;  -- Only 1 column

-- ERROR: incompatible types (in strict databases)
SELECT customer_id, created_at FROM customers  -- created_at is TIMESTAMP
UNION
SELECT customer_id, age FROM leads;  -- age is INTEGER

Fix: Ensure structural compatibility. Cast explicitly when needed:

SELECT customer_id, created_at::TEXT AS info FROM customers
UNION
SELECT customer_id, age::TEXT FROM leads;

Mistake 3: Expecting Column Names from the Second Query

SELECT 'Source A' AS source, customer_id FROM customers_a
UNION
SELECT 'Source B' AS source_system, customer_id FROM customers_b;
-- Result column is named "source", not "source_system"
-- The first query's column names always win

Mistake 4: EXCEPT Returning No Rows When You Expect Some

This almost always comes down to data type mismatch causing implicit casting, or NULL behavior you didn't anticipate.

-- Debugging EXCEPT: if this returns 0 rows unexpectedly...
SELECT customer_id FROM table_a
EXCEPT
SELECT customer_id FROM table_b;

-- Check data types
SELECT pg_typeof(customer_id) FROM table_a LIMIT 1;
SELECT pg_typeof(customer_id) FROM table_b LIMIT 1;
-- One might be INTEGER, other might be BIGINT or VARCHAR

Mistake 5: Using EXCEPT When You Need NOT EXISTS for Partial Matching

EXCEPT compares entire rows. If you want to find records where a customer_id exists in table A but not in table B (regardless of other columns), you need to select only customer_id in both queries, or use NOT EXISTS:

-- WRONG: finds rows where ALL columns don't match
SELECT customer_id, email, status FROM table_a
EXCEPT
SELECT customer_id, email, status FROM table_b;

-- RIGHT: finds customer_ids not in table_b (other columns don't matter)
SELECT customer_id FROM table_a
EXCEPT
SELECT customer_id FROM table_b;

Mistake 6: INTERSECT ALL / EXCEPT ALL Availability

INTERSECT ALL and EXCEPT ALL (which keep duplicates the same way UNION ALL does) are supported in PostgreSQL but not in SQL Server or BigQuery. If you're writing cross-platform SQL, avoid these variants or abstract them.

Summary & Next Steps

Set operations are the relational data professional's scalpel — precise, powerful, and occasionally dangerous when misused. Let's review what you now understand at a deep level:

The mechanics: Set operations work on complete row tuples using positional column matching and IS NOT DISTINCT FROM equality (NULLs equal NULLs). Column names come from the first query. All operations except UNION ALL perform deduplication.

Performance: UNION ALL is essentially free (an append). UNION, INTERSECT, and EXCEPT all require deduplication work — sort-based or hash-based — with real memory and CPU cost. For single-key comparisons on large tables, NOT EXISTS with proper indexes often outperforms EXCEPT.

The reconciliation toolkit: UNION ALL assembles multi-source data. INTERSECT finds clean matches. EXCEPT finds gaps and discrepancies. Together with CTEs, they enable complete, auditable reconciliation reports in a single query.

Production patterns: Symmetric difference, ETL deduplication via window functions, cross-schema and cross-database comparison via FDWs — these are real patterns used in production data engineering.

Next Steps

To continue building on this foundation:

Window Functions for Advanced Deduplication: Explore ROW_NUMBER(), RANK(), and DENSE_RANK() for priority-based deduplication — the patterns that make ETL idempotent.
Incremental Change Detection with CDC: Combine EXCEPT with change data capture patterns to build efficient incremental processing pipelines.
Query Plan Analysis: Spend time in EXPLAIN ANALYZE output for set operations on your actual production datasets. The difference between a HashAggregate and a Sort+Unique plan can be significant at scale.
dbt and Incremental Models: The incremental model pattern in dbt is essentially a productionized version of the UNION ALL + window function deduplication pattern you learned here. Understanding the SQL underneath makes the dbt abstraction much more approachable.
Approximate Set Operations: For truly massive datasets (billions of rows), investigate HyperLogLog-based approximate counting and Bloom filter approaches available in extensions like pg_trgm and postgresql-hll.

The ability to write a complete, correct, performant reconciliation query from scratch — without relying on ORM magic or data pipeline tooling to do it for you — is a genuine differentiator for data professionals. You now have the foundation to do exactly that.

Mastering SQL Set Operations: UNION, INTERSECT, and EXCEPT for Complex Data Reconciliation and Deduplication

Mastering SQL Set Operations: UNION, INTERSECT, and EXCEPT for Complex Data Reconciliation and Deduplication

Introduction

Prerequisites

The Relational Foundation: What Set Operations Actually Do

UNION vs. UNION ALL: The Deduplication Decision

The Hidden Cost of UNION

When UNION (Deduplication) Is Actually the Right Answer

Deduplication Semantics: NULL Is Equal to NULL

INTERSECT: Finding What Belongs to Both Sets

The Classic Use Case: Change Detection

INTERSECT vs. INNER JOIN: A Critical Distinction

Performance Considerations for INTERSECT

EXCEPT: Finding What's Missing or Unique

Migration Completeness Verification

The Symmetric Difference: Finding Records Unique to Either Side

Advanced Pattern: Full Data Reconciliation Framework

Ordering, Limiting, and Composing Set Operations

ORDER BY with Set Operations

LIMIT and FETCH FIRST with Set Operations

Operator Precedence in Chains

Performance Deep Dive: Execution Internals

UNION ALL: The Free Lunch

UNION: The Sorting Cost

INTERSECT and EXCEPT: The Set Algebra Cost

Cross-Database and Cross-Schema Set Operations

Within the Same Database, Different Schemas

Using Foreign Data Wrappers (PostgreSQL)

Database-Specific Variations

Deduplication Strategies Beyond Simple UNION

Deduplicating Within a Single Table by Priority

Identifying True Duplicates for Investigation

Combining UNION ALL with Deduplication for ETL

Hands-On Exercise

Setup

Exercises

Solutions

Common Mistakes & Troubleshooting

Mistake 1: Using UNION When You Mean UNION ALL

Mistake 2: Column Count or Type Mismatch

Mistake 3: Expecting Column Names from the Second Query

Mistake 4: EXCEPT Returning No Rows When You Expect Some

Mistake 5: Using EXCEPT When You Need NOT EXISTS for Partial Matching

Mistake 6: INTERSECT ALL / EXCEPT ALL Availability

Summary & Next Steps

Next Steps

Related Articles

Fuzzy Matching and Deduplication in SQL: Using Similarity Functions and Blocking Strategies to Find Near-Duplicate Records

Filtering and Transforming Data with SQL String Functions: LIKE, REGEXP, SUBSTRING, and REPLACE

Query Optimization with Materialized Views: Caching Complex Aggregations and Refreshing Strategies for High-Performance Analytics