
Imagine you're the data engineer at a mid-sized e-commerce company. Every day, thousands of orders flow through your system — customers browsing, adding items to carts, checking out, and requesting refunds. Your analytics team needs this data to answer questions, but not all questions are equal. The finance team can wait until morning to see yesterday's revenue totals. The fraud detection team absolutely cannot wait eight hours to find out that someone just made 47 suspicious purchases in three minutes.
That gap — between "I need this data eventually" and "I need this data right now" — is the fundamental problem that batch and stream processing exist to solve. And choosing the wrong approach for a given use case isn't just an engineering inconvenience; it can mean fraud goes undetected, inventory goes oversold, or your data infrastructure burns through computing budget on work that didn't need to happen in real-time.
By the end of this lesson, you'll understand exactly how batch and stream processing work under the hood, why each exists, and how to make a confident, defensible decision about which pattern belongs in your pipeline. We'll work through realistic examples, look at actual code, and build the mental model you need to reason about these trade-offs on your own.
What you'll learn:
You should be comfortable with:
No prior experience with distributed systems, messaging queues, or real-time infrastructure is required.
Before we compare the two approaches, let's anchor on the problem they both solve.
Data ingestion is the process of moving raw data from wherever it originates — an API, a database, a log file, a sensor — into a system where it can be processed, stored, and analyzed. Every data pipeline starts with ingestion. The question is: when do you move the data, and how much do you move at once?
Think of it like handling mail at a busy office. You have two choices: wait until the end of the day, collect every piece of mail at once, sort it all in a single session, and distribute it the next morning (batch). Or, station someone at the mailroom door who processes and routes each letter the moment it arrives (stream).
Neither approach is universally better. They reflect different assumptions about urgency, volume, and available resources. Your job as a data engineer is to match the approach to the problem.
Batch processing means collecting data over a period of time and then processing all of it together in a single run. The "batch" is the accumulated chunk of data — it might be one hour's worth of web server logs, one day's worth of sales transactions, or one week's worth of user activity records.
A typical batch pipeline looks like this:
The key word is scheduled. Batch jobs run on a clock — hourly, nightly, weekly. Between runs, the pipeline is doing nothing.
Let's say you're processing daily sales data from a CSV file that your point-of-sale system writes out every night at midnight.
import csv
from datetime import date
from collections import defaultdict
def process_daily_sales(filepath):
"""
Reads a day's worth of sales transactions and computes
total revenue and order count per product category.
"""
category_totals = defaultdict(lambda: {"revenue": 0.0, "orders": 0})
with open(filepath, "r") as f:
reader = csv.DictReader(f)
for row in reader:
category = row["product_category"]
amount = float(row["order_total"])
category_totals[category]["revenue"] += amount
category_totals[category]["orders"] += 1
return category_totals
def write_summary(summary, output_path):
"""
Writes the aggregated summary to a new CSV for the analytics team.
"""
with open(output_path, "w", newline="") as f:
writer = csv.DictWriter(
f, fieldnames=["category", "total_revenue", "order_count"]
)
writer.writeheader()
for category, totals in summary.items():
writer.writerow({
"category": category,
"total_revenue": round(totals["revenue"], 2),
"order_count": totals["orders"]
})
if __name__ == "__main__":
today = date.today().isoformat()
input_file = f"sales_{today}.csv"
output_file = f"sales_summary_{today}.csv"
summary = process_daily_sales(input_file)
write_summary(summary, output_file)
print(f"Processed summary written to {output_file}")
This script does something simple but representative: it opens a file, reads every row, accumulates totals, and writes a summary. You'd schedule this with a cron job or an orchestration tool like Apache Airflow to run every morning at 1:00 AM.
Notice what's happening here: the processing is decoupled from the data arrival. The sales happened throughout yesterday, but we process all of them together, hours later. That delay is called latency — the time between when something happens and when your system knows about it.
Batch processing shines in a specific set of circumstances:
When freshness requirements are relaxed. Monthly financial reports, weekly marketing attribution analysis, nightly inventory reconciliation — these use cases don't require the data to be current to the minute. A 12-hour lag is perfectly acceptable.
When you're working with very large volumes. Reading and transforming terabytes of log data is often more efficient as a scheduled batch job than as a continuous stream. You can tune your compute resources specifically for the batch window, run the job on a cluster, and shut it down when done.
When the source data isn't continuous. If your data source produces files once a day (a vendor's SFTP drop, a partner's API export), batch processing is the natural fit. There's nothing to stream.
When you need complex aggregations over historical data. Computing 90-day rolling averages, rebuilding a full dimensional model, or running ML training on a historical dataset — these are inherently batch operations that need all the data together before they can produce a result.
Tip: Batch processing is also much easier to debug and re-run. If something goes wrong, you re-run the job with the same input data and get the same result. This idempotency (the same job run twice produces the same output) is a significant operational advantage.
Stream processing treats data as an infinite, continuous flow of individual events rather than as a bounded collection to process later. Each event — a user click, a sensor reading, a payment transaction — is processed individually (or in tiny micro-batches) as soon as it arrives.
A typical stream pipeline looks like this:
The critical difference from batch: the pipeline never goes idle. It's always listening.
A message broker is worth pausing on because it's the infrastructure that makes streaming possible. Think of it as a high-speed conveyor belt. The source system (called a producer) places events on the belt. The stream processor (called a consumer) picks them up from the other end. The broker in the middle handles the transport, ordering, and durability of the messages.
This decoupling matters. The producer doesn't need to wait for the consumer to finish processing before publishing the next event. And if the consumer temporarily goes offline, the broker holds the events until it comes back — nothing is lost.
Let's simulate a stream processor for payment transactions. We'll use Python to illustrate the concept without requiring a full Kafka installation.
import json
import time
import random
from datetime import datetime
def generate_payment_event():
"""
Simulates a payment event arriving from an event stream.
In production, this would come from a Kafka consumer or similar.
"""
return {
"event_id": random.randint(100000, 999999),
"user_id": random.randint(1, 5000),
"amount": round(random.uniform(5.0, 850.0), 2),
"merchant_id": random.randint(1, 200),
"timestamp": datetime.utcnow().isoformat(),
"country_code": random.choice(["US", "US", "US", "US", "NG", "RU", "CN"])
}
def is_suspicious(event):
"""
Applies a simple fraud rule: flag transactions over $500
from high-risk country codes.
"""
high_risk_countries = {"NG", "RU", "CN"}
return (
event["amount"] > 500.0
and event["country_code"] in high_risk_countries
)
def process_event(event):
"""
Processes a single payment event in real time.
"""
if is_suspicious(event):
alert = {
"alert_type": "POTENTIAL_FRAUD",
"event_id": event["event_id"],
"user_id": event["user_id"],
"amount": event["amount"],
"country_code": event["country_code"],
"detected_at": datetime.utcnow().isoformat()
}
# In production: publish alert to a fraud queue or database
print(f"[FRAUD ALERT] {json.dumps(alert)}")
else:
# In production: write to a real-time analytics store
print(f"[OK] Processed payment {event['event_id']} "
f"for ${event['amount']} from {event['country_code']}")
def run_stream_processor():
"""
Continuously processes incoming payment events.
This loop represents the always-on nature of stream processing.
"""
print("Stream processor started. Listening for events...")
while True:
event = generate_payment_event() # Simulates reading from Kafka
process_event(event)
time.sleep(0.5) # Simulate events arriving every 500ms
if __name__ == "__main__":
run_stream_processor()
Run this script and you'll see it process events one by one, flagging suspicious transactions immediately. Notice that the fraud check happens per event, not after accumulating a thousand transactions. That's the point — a fraudulent transaction detected in 200 milliseconds can be blocked before the charge settles. The same detection running as a nightly batch job is nearly useless for fraud prevention.
Warning: The stream processing example above is a simplified simulation. In a production environment, you would use a proper message broker like Apache Kafka, Amazon Kinesis, or Google Pub/Sub. The architecture is the same — producers publish events, consumers process them — but the broker handles durability, ordering guarantees, and scaling that a while loop cannot.
Stream processing is the right pattern when:
Low latency is a hard requirement. Fraud detection, live inventory updates during a flash sale, real-time recommendation engines, monitoring dashboards — anything where a delay of more than a few seconds would break the business logic.
You need to react to individual events. If your downstream action is triggered by a specific thing happening (a user's cart total exceeding $1,000 triggers a discount offer, a sensor reading spiking above a threshold triggers an alert), you need to know about each event as it happens.
Your data has time-sensitive context. Location data, user session activity, and financial market data lose their value rapidly. Processing yesterday's GPS trace to show a "nearby deal" isn't useful.
Understanding both approaches in isolation is useful. Understanding how they compare is where you build real decision-making capability.
Batch processing introduces latency by design. The minimum delay is the length of the batch window — if you run hourly, your data is always at least one hour old, and potentially up to two hours old for data that arrived just after the last run.
Stream processing can achieve sub-second latency. The delay is only the time it takes for an event to travel from producer to consumer and get processed — typically measured in milliseconds to seconds.
This is where batch often wins. Processing a million records at midnight on a large cluster, then shutting the cluster down, is dramatically cheaper than maintaining a cluster that runs 24/7 to process a few events per second. Stream processing infrastructure is always running, which means you're always paying for it.
If you're ingesting and processing 10TB of log data per day, batch is almost certainly more cost-efficient. If you're processing 500 payment events per minute and need real-time fraud detection, the cost of streaming infrastructure is justified.
Batch pipelines are fundamentally simpler. They read data, transform it, write it — linear, debuggable, re-runnable. When a batch job fails, you look at the logs, find the problem, fix it, and re-run the same job.
Stream processing is significantly more complex. You need to reason about:
These are solvable problems — frameworks like Apache Flink, Apache Spark Structured Streaming, and AWS Kinesis Data Analytics handle them well — but they add significant engineering overhead.
Batch wins again here. Because a batch job processes a bounded, defined set of data, you can always re-run it. Made a mistake in your transformation logic? Fix it and re-run last night's batch.
In streaming, the data arrived and was processed. If you need to reprocess it, you need either a mechanism that retained the raw events (Kafka does this with configurable retention policies), or you made a backup. This is why many streaming architectures also store raw events in a data lake — not for real-time processing, but precisely to enable reprocessing.
Some systems need both low latency and comprehensive historical accuracy. For that, two architectural patterns are worth knowing.
The Lambda Architecture runs a batch layer and a streaming layer simultaneously. The stream layer provides low-latency approximate results. The batch layer runs periodically and overwrites the stream results with accurate, fully-processed historical data. The two layers serve results through a unified query interface. The downside: you're maintaining two codebases that need to produce the same results, which is a significant maintenance burden.
The Kappa Architecture simplifies this by using only a streaming system, but with a message broker that retains all events (like Kafka with long retention). For reprocessing, you simply replay the retained events through the stream processor. One codebase, one paradigm. The trade-off is that your stream processing system needs to be powerful enough to handle both real-time and historical reprocessing workloads.
These are advanced architectural patterns, but knowing they exist helps you understand that "batch vs. stream" isn't always an either/or decision.
This exercise ties together the concepts above. You'll build two versions of the same pipeline — one batch, one stream — and experience the latency difference directly.
Scenario: You have a stream of customer support ticket submissions. The support team wants to see a real-time count of open tickets by priority (High, Medium, Low). Management wants a daily summary report.
Step 1: Create a synthetic data generator. Create a file called ticket_generator.py:
import csv
import random
import time
from datetime import datetime
def generate_ticket():
return {
"ticket_id": random.randint(10000, 99999),
"customer_id": random.randint(1, 10000),
"priority": random.choice(["High", "High", "Medium", "Medium", "Low"]),
"submitted_at": datetime.utcnow().isoformat()
}
def write_batch_file(filepath, num_tickets=200):
"""Generates a day's worth of tickets as a CSV (simulating batch source)."""
with open(filepath, "w", newline="") as f:
writer = csv.DictWriter(
f, fieldnames=["ticket_id", "customer_id", "priority", "submitted_at"]
)
writer.writeheader()
for _ in range(num_tickets):
writer.writerow(generate_ticket())
print(f"Generated {num_tickets} tickets to {filepath}")
if __name__ == "__main__":
write_batch_file("tickets_today.csv")
Step 2: Write the batch processor. Create batch_tickets.py:
import csv
from collections import Counter
def summarize_tickets(filepath):
priority_counts = Counter()
with open(filepath) as f:
for row in csv.DictReader(f):
priority_counts[row["priority"]] += 1
return priority_counts
summary = summarize_tickets("tickets_today.csv")
print("Daily Ticket Summary:")
for priority, count in sorted(summary.items()):
print(f" {priority}: {count} tickets")
Step 3: Write the stream processor. Create stream_tickets.py:
import random
import time
from collections import Counter
from datetime import datetime
priority_counts = Counter()
def process_ticket_event(ticket):
priority_counts[ticket["priority"]] += 1
if ticket["priority"] == "High":
print(f"[{datetime.utcnow().isoformat()}] HIGH PRIORITY ticket "
f"#{ticket['ticket_id']} received! Current High count: "
f"{priority_counts['High']}")
# Print dashboard every 10 events
total = sum(priority_counts.values())
if total % 10 == 0:
print(f"\n--- Live Dashboard (Total: {total}) ---")
for p in ["High", "Medium", "Low"]:
print(f" {p}: {priority_counts[p]}")
print()
def run():
print("Stream processor running...")
while True:
ticket = {
"ticket_id": random.randint(10000, 99999),
"priority": random.choice(["High", "High", "Medium", "Medium", "Low"])
}
process_ticket_event(ticket)
time.sleep(0.3)
run()
Run ticket_generator.py first, then batch_tickets.py to see the batch result. Then run stream_tickets.py separately to see how the live counter updates in real-time. The contrast in feedback speed is the experience you're building intuition around.
Choosing streaming because it sounds more impressive. Stream processing is genuinely harder to build and operate. If your business question is "what was our revenue last month?", you do not need a streaming pipeline. Choosing streaming for a use case that doesn't need it adds complexity without benefit.
Assuming batch means "slow." A well-optimized batch job running on a modern data warehouse like BigQuery or Snowflake can process terabytes of data and return results in minutes. For many analytical workloads, that's more than fast enough.
Not accounting for late-arriving data in stream pipelines. Events don't always arrive in the order they were generated. A payment event might be delayed by a flaky network and arrive 45 seconds after a related refund event. Stream processing frameworks handle this with watermarks — a configurable tolerance for how long to wait for late events before finalizing a window. Ignoring this leads to incorrect aggregations.
Forgetting that stream processors can fail. If your stream consumer crashes mid-processing, what happens to the event it was handling? You need to understand your broker's delivery guarantees (at-most-once, at-least-once, exactly-once) and design your consumer to handle duplicates or missed events accordingly.
Building batch pipelines that accumulate too much data per run. If your nightly batch job starts taking six hours to run, it'll still be running when the next one is supposed to start. Design batch jobs with a clear understanding of data volume growth, and partition your loads when necessary.
You now have a solid mental model for one of the most foundational decisions in data engineering. Let's anchor the key ideas:
Next steps to deepen your understanding:
Learning Path: Data Pipeline Fundamentals