Batch vs. Stream Processing: Choosing the Right Ingestion Pattern for Your Pipeline

Introduction

Imagine you're the data engineer at a mid-sized e-commerce company. Every day, thousands of orders flow through your system — customers browsing, adding items to carts, checking out, and requesting refunds. Your analytics team needs this data to answer questions, but not all questions are equal. The finance team can wait until morning to see yesterday's revenue totals. The fraud detection team absolutely cannot wait eight hours to find out that someone just made 47 suspicious purchases in three minutes.

That gap — between "I need this data eventually" and "I need this data right now" — is the fundamental problem that batch and stream processing exist to solve. And choosing the wrong approach for a given use case isn't just an engineering inconvenience; it can mean fraud goes undetected, inventory goes oversold, or your data infrastructure burns through computing budget on work that didn't need to happen in real-time.

By the end of this lesson, you'll understand exactly how batch and stream processing work under the hood, why each exists, and how to make a confident, defensible decision about which pattern belongs in your pipeline. We'll work through realistic examples, look at actual code, and build the mental model you need to reason about these trade-offs on your own.

What you'll learn:

What batch processing is, how it works, and when it's the right choice
What stream processing is, how it differs from batch, and when to use it
The key trade-offs between the two patterns across latency, cost, and complexity
How to evaluate a real-world use case and choose the right ingestion pattern
How to implement a basic version of each approach in Python

Prerequisites

You should be comfortable with:

Basic Python (reading and writing functions, loops, and simple file I/O)
A conceptual understanding of what a data pipeline does (data moves from source to destination)
Familiarity with the idea of a database or data warehouse (where data is stored for analysis)

No prior experience with distributed systems, messaging queues, or real-time infrastructure is required.

What Is Data Ingestion, and Why Does the Pattern Matter?

Before we compare the two approaches, let's anchor on the problem they both solve.

Data ingestion is the process of moving raw data from wherever it originates — an API, a database, a log file, a sensor — into a system where it can be processed, stored, and analyzed. Every data pipeline starts with ingestion. The question is: when do you move the data, and how much do you move at once?

Think of it like handling mail at a busy office. You have two choices: wait until the end of the day, collect every piece of mail at once, sort it all in a single session, and distribute it the next morning (batch). Or, station someone at the mailroom door who processes and routes each letter the moment it arrives (stream).

Neither approach is universally better. They reflect different assumptions about urgency, volume, and available resources. Your job as a data engineer is to match the approach to the problem.

Batch Processing: Moving Data in Chunks

How Batch Processing Works

Batch processing means collecting data over a period of time and then processing all of it together in a single run. The "batch" is the accumulated chunk of data — it might be one hour's worth of web server logs, one day's worth of sales transactions, or one week's worth of user activity records.

A typical batch pipeline looks like this:

Data accumulates in a source system (a database, a flat file, an API)
At a scheduled time, a job kicks off
The job reads all the accumulated data since the last run
It transforms, cleans, or aggregates the data
It writes the results to a destination (a data warehouse, another database, a report)
The job completes and goes idle until the next scheduled run

The key word is scheduled. Batch jobs run on a clock — hourly, nightly, weekly. Between runs, the pipeline is doing nothing.

A Simple Batch Processing Example

Let's say you're processing daily sales data from a CSV file that your point-of-sale system writes out every night at midnight.

import csv
from datetime import date
from collections import defaultdict

def process_daily_sales(filepath):
    """
    Reads a day's worth of sales transactions and computes
    total revenue and order count per product category.
    """
    category_totals = defaultdict(lambda: {"revenue": 0.0, "orders": 0})

    with open(filepath, "r") as f:
        reader = csv.DictReader(f)
        for row in reader:
            category = row["product_category"]
            amount = float(row["order_total"])
            
            category_totals[category]["revenue"] += amount
            category_totals[category]["orders"] += 1

    return category_totals

def write_summary(summary, output_path):
    """
    Writes the aggregated summary to a new CSV for the analytics team.
    """
    with open(output_path, "w", newline="") as f:
        writer = csv.DictWriter(
            f, fieldnames=["category", "total_revenue", "order_count"]
        )
        writer.writeheader()
        for category, totals in summary.items():
            writer.writerow({
                "category": category,
                "total_revenue": round(totals["revenue"], 2),
                "order_count": totals["orders"]
            })

if __name__ == "__main__":
    today = date.today().isoformat()
    input_file = f"sales_{today}.csv"
    output_file = f"sales_summary_{today}.csv"

    summary = process_daily_sales(input_file)
    write_summary(summary, output_file)
    print(f"Processed summary written to {output_file}")

This script does something simple but representative: it opens a file, reads every row, accumulates totals, and writes a summary. You'd schedule this with a cron job or an orchestration tool like Apache Airflow to run every morning at 1:00 AM.

Notice what's happening here: the processing is decoupled from the data arrival. The sales happened throughout yesterday, but we process all of them together, hours later. That delay is called latency — the time between when something happens and when your system knows about it.

When Batch Is the Right Choice

Batch processing shines in a specific set of circumstances:

When freshness requirements are relaxed. Monthly financial reports, weekly marketing attribution analysis, nightly inventory reconciliation — these use cases don't require the data to be current to the minute. A 12-hour lag is perfectly acceptable.

When you're working with very large volumes. Reading and transforming terabytes of log data is often more efficient as a scheduled batch job than as a continuous stream. You can tune your compute resources specifically for the batch window, run the job on a cluster, and shut it down when done.

When the source data isn't continuous. If your data source produces files once a day (a vendor's SFTP drop, a partner's API export), batch processing is the natural fit. There's nothing to stream.

When you need complex aggregations over historical data. Computing 90-day rolling averages, rebuilding a full dimensional model, or running ML training on a historical dataset — these are inherently batch operations that need all the data together before they can produce a result.

Tip: Batch processing is also much easier to debug and re-run. If something goes wrong, you re-run the job with the same input data and get the same result. This idempotency (the same job run twice produces the same output) is a significant operational advantage.

Stream Processing: Reacting to Data as It Arrives

How Stream Processing Works

Stream processing treats data as an infinite, continuous flow of individual events rather than as a bounded collection to process later. Each event — a user click, a sensor reading, a payment transaction — is processed individually (or in tiny micro-batches) as soon as it arrives.

A typical stream pipeline looks like this:

An event occurs in the source system (a user clicks "Buy Now")
The source system immediately publishes that event to a message broker — a system designed to receive, hold briefly, and forward messages (Apache Kafka and Amazon Kinesis are common examples)
A stream processor — a constantly running program — reads that event from the broker within milliseconds
It transforms or enriches the event
It writes the result to a destination in near real-time
The processor stays running, waiting for the next event

The critical difference from batch: the pipeline never goes idle. It's always listening.

Understanding Message Brokers

A message broker is worth pausing on because it's the infrastructure that makes streaming possible. Think of it as a high-speed conveyor belt. The source system (called a producer) places events on the belt. The stream processor (called a consumer) picks them up from the other end. The broker in the middle handles the transport, ordering, and durability of the messages.

This decoupling matters. The producer doesn't need to wait for the consumer to finish processing before publishing the next event. And if the consumer temporarily goes offline, the broker holds the events until it comes back — nothing is lost.

A Simple Stream Processing Example

Let's simulate a stream processor for payment transactions. We'll use Python to illustrate the concept without requiring a full Kafka installation.

import json
import time
import random
from datetime import datetime

def generate_payment_event():
    """
    Simulates a payment event arriving from an event stream.
    In production, this would come from a Kafka consumer or similar.
    """
    return {
        "event_id": random.randint(100000, 999999),
        "user_id": random.randint(1, 5000),
        "amount": round(random.uniform(5.0, 850.0), 2),
        "merchant_id": random.randint(1, 200),
        "timestamp": datetime.utcnow().isoformat(),
        "country_code": random.choice(["US", "US", "US", "US", "NG", "RU", "CN"])
    }

def is_suspicious(event):
    """
    Applies a simple fraud rule: flag transactions over $500
    from high-risk country codes.
    """
    high_risk_countries = {"NG", "RU", "CN"}
    return (
        event["amount"] > 500.0
        and event["country_code"] in high_risk_countries
    )

def process_event(event):
    """
    Processes a single payment event in real time.
    """
    if is_suspicious(event):
        alert = {
            "alert_type": "POTENTIAL_FRAUD",
            "event_id": event["event_id"],
            "user_id": event["user_id"],
            "amount": event["amount"],
            "country_code": event["country_code"],
            "detected_at": datetime.utcnow().isoformat()
        }
        # In production: publish alert to a fraud queue or database
        print(f"[FRAUD ALERT] {json.dumps(alert)}")
    else:
        # In production: write to a real-time analytics store
        print(f"[OK] Processed payment {event['event_id']} "
              f"for ${event['amount']} from {event['country_code']}")

def run_stream_processor():
    """
    Continuously processes incoming payment events.
    This loop represents the always-on nature of stream processing.
    """
    print("Stream processor started. Listening for events...")
    while True:
        event = generate_payment_event()  # Simulates reading from Kafka
        process_event(event)
        time.sleep(0.5)  # Simulate events arriving every 500ms

if __name__ == "__main__":
    run_stream_processor()

Run this script and you'll see it process events one by one, flagging suspicious transactions immediately. Notice that the fraud check happens per event, not after accumulating a thousand transactions. That's the point — a fraudulent transaction detected in 200 milliseconds can be blocked before the charge settles. The same detection running as a nightly batch job is nearly useless for fraud prevention.

Warning: The stream processing example above is a simplified simulation. In a production environment, you would use a proper message broker like Apache Kafka, Amazon Kinesis, or Google Pub/Sub. The architecture is the same — producers publish events, consumers process them — but the broker handles durability, ordering guarantees, and scaling that a while loop cannot.

When Stream Processing Is the Right Choice

Stream processing is the right pattern when:

Low latency is a hard requirement. Fraud detection, live inventory updates during a flash sale, real-time recommendation engines, monitoring dashboards — anything where a delay of more than a few seconds would break the business logic.

You need to react to individual events. If your downstream action is triggered by a specific thing happening (a user's cart total exceeding $1,000 triggers a discount offer, a sensor reading spiking above a threshold triggers an alert), you need to know about each event as it happens.

Your data has time-sensitive context. Location data, user session activity, and financial market data lose their value rapidly. Processing yesterday's GPS trace to show a "nearby deal" isn't useful.

The Core Trade-Offs: A Head-to-Head Comparison

Understanding both approaches in isolation is useful. Understanding how they compare is where you build real decision-making capability.

Latency

Batch processing introduces latency by design. The minimum delay is the length of the batch window — if you run hourly, your data is always at least one hour old, and potentially up to two hours old for data that arrived just after the last run.

Stream processing can achieve sub-second latency. The delay is only the time it takes for an event to travel from producer to consumer and get processed — typically measured in milliseconds to seconds.

Throughput and Cost

This is where batch often wins. Processing a million records at midnight on a large cluster, then shutting the cluster down, is dramatically cheaper than maintaining a cluster that runs 24/7 to process a few events per second. Stream processing infrastructure is always running, which means you're always paying for it.

If you're ingesting and processing 10TB of log data per day, batch is almost certainly more cost-efficient. If you're processing 500 payment events per minute and need real-time fraud detection, the cost of streaming infrastructure is justified.

Complexity

Batch pipelines are fundamentally simpler. They read data, transform it, write it — linear, debuggable, re-runnable. When a batch job fails, you look at the logs, find the problem, fix it, and re-run the same job.

Stream processing is significantly more complex. You need to reason about:

Event ordering — events can arrive out of order; how do you handle a late-arriving event that should have been processed 30 seconds ago?
Exactly-once processing — what happens if your consumer processes an event and then crashes before acknowledging it? Do you process it twice?
Stateful aggregations — if you want a rolling 5-minute count of transactions, you need to maintain state across events, which requires careful management

These are solvable problems — frameworks like Apache Flink, Apache Spark Structured Streaming, and AWS Kinesis Data Analytics handle them well — but they add significant engineering overhead.

Reprocessing and Recovery

Batch wins again here. Because a batch job processes a bounded, defined set of data, you can always re-run it. Made a mistake in your transformation logic? Fix it and re-run last night's batch.

In streaming, the data arrived and was processed. If you need to reprocess it, you need either a mechanism that retained the raw events (Kafka does this with configurable retention policies), or you made a backup. This is why many streaming architectures also store raw events in a data lake — not for real-time processing, but precisely to enable reprocessing.

Lambda and Kappa: When One Pattern Isn't Enough

Some systems need both low latency and comprehensive historical accuracy. For that, two architectural patterns are worth knowing.

The Lambda Architecture runs a batch layer and a streaming layer simultaneously. The stream layer provides low-latency approximate results. The batch layer runs periodically and overwrites the stream results with accurate, fully-processed historical data. The two layers serve results through a unified query interface. The downside: you're maintaining two codebases that need to produce the same results, which is a significant maintenance burden.

The Kappa Architecture simplifies this by using only a streaming system, but with a message broker that retains all events (like Kafka with long retention). For reprocessing, you simply replay the retained events through the stream processor. One codebase, one paradigm. The trade-off is that your stream processing system needs to be powerful enough to handle both real-time and historical reprocessing workloads.

These are advanced architectural patterns, but knowing they exist helps you understand that "batch vs. stream" isn't always an either/or decision.

Hands-On Exercise

This exercise ties together the concepts above. You'll build two versions of the same pipeline — one batch, one stream — and experience the latency difference directly.

Scenario: You have a stream of customer support ticket submissions. The support team wants to see a real-time count of open tickets by priority (High, Medium, Low). Management wants a daily summary report.

Step 1: Create a synthetic data generator. Create a file called ticket_generator.py:

import csv
import random
import time
from datetime import datetime

def generate_ticket():
    return {
        "ticket_id": random.randint(10000, 99999),
        "customer_id": random.randint(1, 10000),
        "priority": random.choice(["High", "High", "Medium", "Medium", "Low"]),
        "submitted_at": datetime.utcnow().isoformat()
    }

def write_batch_file(filepath, num_tickets=200):
    """Generates a day's worth of tickets as a CSV (simulating batch source)."""
    with open(filepath, "w", newline="") as f:
        writer = csv.DictWriter(
            f, fieldnames=["ticket_id", "customer_id", "priority", "submitted_at"]
        )
        writer.writeheader()
        for _ in range(num_tickets):
            writer.writerow(generate_ticket())
    print(f"Generated {num_tickets} tickets to {filepath}")

if __name__ == "__main__":
    write_batch_file("tickets_today.csv")

Step 2: Write the batch processor. Create batch_tickets.py:

import csv
from collections import Counter

def summarize_tickets(filepath):
    priority_counts = Counter()
    with open(filepath) as f:
        for row in csv.DictReader(f):
            priority_counts[row["priority"]] += 1
    return priority_counts

summary = summarize_tickets("tickets_today.csv")
print("Daily Ticket Summary:")
for priority, count in sorted(summary.items()):
    print(f"  {priority}: {count} tickets")

Step 3: Write the stream processor. Create stream_tickets.py:

import random
import time
from collections import Counter
from datetime import datetime

priority_counts = Counter()

def process_ticket_event(ticket):
    priority_counts[ticket["priority"]] += 1
    if ticket["priority"] == "High":
        print(f"[{datetime.utcnow().isoformat()}] HIGH PRIORITY ticket "
              f"#{ticket['ticket_id']} received! Current High count: "
              f"{priority_counts['High']}")
    
    # Print dashboard every 10 events
    total = sum(priority_counts.values())
    if total % 10 == 0:
        print(f"\n--- Live Dashboard (Total: {total}) ---")
        for p in ["High", "Medium", "Low"]:
            print(f"  {p}: {priority_counts[p]}")
        print()

def run():
    print("Stream processor running...")
    while True:
        ticket = {
            "ticket_id": random.randint(10000, 99999),
            "priority": random.choice(["High", "High", "Medium", "Medium", "Low"])
        }
        process_ticket_event(ticket)
        time.sleep(0.3)

run()

Run ticket_generator.py first, then batch_tickets.py to see the batch result. Then run stream_tickets.py separately to see how the live counter updates in real-time. The contrast in feedback speed is the experience you're building intuition around.

Common Mistakes & Troubleshooting

Choosing streaming because it sounds more impressive. Stream processing is genuinely harder to build and operate. If your business question is "what was our revenue last month?", you do not need a streaming pipeline. Choosing streaming for a use case that doesn't need it adds complexity without benefit.

Assuming batch means "slow." A well-optimized batch job running on a modern data warehouse like BigQuery or Snowflake can process terabytes of data and return results in minutes. For many analytical workloads, that's more than fast enough.

Not accounting for late-arriving data in stream pipelines. Events don't always arrive in the order they were generated. A payment event might be delayed by a flaky network and arrive 45 seconds after a related refund event. Stream processing frameworks handle this with watermarks — a configurable tolerance for how long to wait for late events before finalizing a window. Ignoring this leads to incorrect aggregations.

Forgetting that stream processors can fail. If your stream consumer crashes mid-processing, what happens to the event it was handling? You need to understand your broker's delivery guarantees (at-most-once, at-least-once, exactly-once) and design your consumer to handle duplicates or missed events accordingly.

Building batch pipelines that accumulate too much data per run. If your nightly batch job starts taking six hours to run, it'll still be running when the next one is supposed to start. Design batch jobs with a clear understanding of data volume growth, and partition your loads when necessary.

Summary & Next Steps

You now have a solid mental model for one of the most foundational decisions in data engineering. Let's anchor the key ideas:

Batch processing collects data over time and processes it together in scheduled runs. It's simpler, cheaper to operate, and ideal when latency requirements are relaxed or data arrives in natural batches.
Stream processing processes each event as it arrives, enabling sub-second latency. It's essential for fraud detection, real-time dashboards, and any use case where delayed data has no value.
The primary trade-offs are latency vs. cost and complexity. Streaming gives you speed; batch gives you simplicity and often lower cost.
The right choice depends on the business requirement, not the technology. Always start by asking: how old can this data be before it becomes useless?

Next steps to deepen your understanding:

Explore Apache Kafka — the most widely used message broker for stream processing. Their official quickstart documentation will walk you through running a local broker and producing/consuming real messages.
Learn Apache Airflow — the standard orchestration tool for scheduling and managing batch pipelines. Understanding how to define DAGs (directed acyclic graphs) will make you productive immediately in most data engineering environments.
Experiment with Apache Spark — Spark handles both large-scale batch processing and Structured Streaming, making it a versatile tool worth understanding early.
Study the next lesson in this path: Schema Design for Scalable Pipelines, where you'll learn how to model the data that flows through the pipelines you're now equipped to build.

Batch vs. Stream Processing: Choosing the Right Ingestion Pattern for Your Pipeline

Batch vs. Stream Processing: Choosing the Right Ingestion Pattern for Your Pipeline

Introduction

Prerequisites

What Is Data Ingestion, and Why Does the Pattern Matter?

Batch Processing: Moving Data in Chunks

How Batch Processing Works

A Simple Batch Processing Example

When Batch Is the Right Choice

Stream Processing: Reacting to Data as It Arrives

How Stream Processing Works

Understanding Message Brokers

A Simple Stream Processing Example

When Stream Processing Is the Right Choice

The Core Trade-Offs: A Head-to-Head Comparison

Latency

Throughput and Cost

Complexity

Reprocessing and Recovery

Lambda and Kappa: When One Pattern Isn't Enough

Hands-On Exercise

Common Mistakes & Troubleshooting

Summary & Next Steps

Related Articles

Multi-Hop Data Lineage Tracking Across the Modern Data Stack: Connecting Ingestion, Transformation, and Consumption with OpenLineage and Marquez

Multi-Tenant Data Pipeline Architecture: Isolating, Routing, and Scaling Pipelines Across Customers and Teams

Monitoring dbt Pipeline Failures in Production: Alerting, Logging, and Root Cause Analysis with Elementary and re_data

Related Articles

Data Engineering🔥 Expert
Multi-Hop Data Lineage Tracking Across the Modern Data Stack: Connecting Ingestion, Transformation, and Consumption with OpenLineage and Marquez
29 min

Data Engineering🔥 Expert
Multi-Tenant Data Pipeline Architecture: Isolating, Routing, and Scaling Pipelines Across Customers and Teams
26 min

Data Engineering⚡ Practitioner
Monitoring dbt Pipeline Failures in Production: Alerting, Logging, and Root Cause Analysis with Elementary and re_data
22 min