Imagine you're a data engineer at a retail company. Every morning, the sales dashboard shows yesterday's revenue numbers, and every morning the VP of Sales refreshes it over her coffee. One Tuesday, the numbers are wrong — the dashboard shows $0 in sales for the entire Northeast region. Your phone starts ringing. You open up your pipeline monitoring tool and stare at a tangle of tasks, jobs, and tables with no clear sense of where the problem actually lives. Did the source database fail to export? Did a transformation step silently drop rows? Did someone rename a column in a table that three other tables depend on?
This is the problem that data pipeline dependencies were invented to solve. When you understand how data flows through a system — which processes must run before others, which outputs feed into which inputs, and where data came from at any given point — you can answer those panicked Tuesday morning questions in minutes instead of hours. You can also prevent the problem in the first place by designing pipelines that are transparent, auditable, and robust.
By the end of this lesson, you'll have a solid mental model of how data pipelines are connected, how to trace problems both forward and backward through a system, and how to implement basic lineage tracking in your own work.
What you'll learn:
This lesson assumes you're comfortable with the idea of a data pipeline — that data moves from a source (like a database or API) through transformations and into a destination (like a data warehouse or dashboard). You don't need experience with any specific tool. Familiarity with basic SQL and the concept of a Python script will help you follow the code examples, but the concepts are universal.
Before we can talk about upstream and downstream, we need to be precise about what a dependency actually means in a data pipeline context.
A dependency is a relationship between two tasks where one task requires the output of another task in order to run correctly. If Task B cannot produce accurate results without Task A having completed first, then Task B depends on Task A.
Think about cooking a meal. You can't plate the pasta before you've boiled it. You can't boil it before you've filled the pot with water. Each step depends on the previous one. In data pipelines, this sequencing is just as real — and just as consequential when it breaks.
Here's a simple concrete example. Suppose you work on an e-commerce platform and you have three jobs:
raw_orders.raw_orders by cleaning nulls and standardizing currency codes, and load the results into clean_orders.clean_orders with a customers table to produce a daily_revenue_by_customer summary table.Job C depends on Job B. Job B depends on Job A. If Job A fails or runs late, the failure cascades forward through B and then C. If you didn't know about these dependencies, you might look at C's error message, assume the problem is in C's SQL, and spend an hour debugging code that was perfectly fine.
The terms upstream and downstream describe the direction of data flow relative to any given point in a pipeline.
Using the cooking analogy: if you're the person making the pasta sauce, the farmer who grew the tomatoes is upstream of you, and the person plating the dish is downstream of you.
In our e-commerce example:
raw_orders are upstream. Job C and daily_revenue_by_customer are downstream.raw_orders, and clean_orders — is upstream. Nothing is downstream of C in this pipeline (C is the final output).This language matters because it gives you directional vocabulary when you're debugging or communicating with teammates. "The issue is upstream of the revenue table" tells someone exactly where to look. "Something broke" does not.
When a downstream table has bad data, you have two directions to investigate:
This two-directional thinking is the foundation of effective incident response. Root cause analysis goes upstream; impact assessment goes downstream.
Most modern data pipeline tools represent dependencies using a structure called a DAG, which stands for Directed Acyclic Graph.
Let's unpack that term:
You don't need to be a graph theorist to work with DAGs. The practical implication is simple: a DAG is a flowchart where arrows show which tasks must complete before others can start, and no task ever circles back to require itself.
Here's a slightly more complex real-world example. Suppose you're building a daily analytics pipeline for a subscription software company:
[extract_subscriptions] [extract_support_tickets]
| |
v v
[clean_subscriptions] [clean_support_tickets]
| |
+----------+---------------+
|
v
[build_customer_health_score]
|
v
[load_to_dashboard]
In this DAG:
build_customer_health_score has two upstream dependencies: both clean_subscriptions and clean_support_tickets must succeed before it can run.load_to_dashboard is the terminal node — nothing is downstream of it in this pipeline.Recognizing independent branches is important for efficiency. If you force everything to run sequentially when some tasks have no dependency relationship, you're leaving performance on the table.
Data lineage is the historical record of where data came from, how it was transformed, and where it ended up. If dependency tracking answers "which tasks must run in what order," lineage tracking answers "for any given piece of data in my system, what is its complete origin story?"
The distinction is subtle but important:
| Dependency Tracking | Data Lineage |
|---|---|
| Operational and structural | Historical and auditable |
| Answers "what runs before what?" | Answers "where did this data come from?" |
| Used for scheduling and orchestration | Used for debugging, compliance, and trust |
| Describes the pipeline architecture | Describes the data's journey |
Think of it this way. Dependency tracking is the pipeline blueprint — the architect's diagram. Data lineage is more like a chain of custody log — a record for any specific batch of data that says: "This row in daily_revenue_by_customer came from a join of clean_orders (which was derived from raw_orders extracted from the production DB at 3:14 AM on Tuesday) and customers (last refreshed Monday at 11 PM)."
Debugging data quality issues. If the revenue numbers are wrong, lineage tells you exactly which source records contributed to that calculation and when they were ingested. You can trace the problem to its origin rather than guessing.
Impact analysis before making changes. If you want to rename a column in clean_orders or change a calculation, lineage tells you every downstream asset that will be affected. This prevents the dreaded "silent breakage" where you change something and don't realize you've broken five dashboards and two ML feature stores.
Regulatory compliance. In industries like finance and healthcare, you may be legally required to demonstrate that specific data points came from approved, audited sources. Lineage tracking is how you prove it.
Building trust in data. When stakeholders ask "where does this number come from?", a well-maintained lineage system lets you answer with precision and confidence rather than vague hand-waving.
Let's make this concrete with code. While production pipelines typically use orchestration tools like Apache Airflow, Prefect, or dbt, you can implement basic dependency logic in plain Python to understand the underlying mechanics.
Here's a simple pipeline that enforces dependencies explicitly:
import pandas as pd
from datetime import date
def extract_raw_orders(db_connection):
"""
Upstream: Production database
Downstream: clean_orders transformation
"""
query = "SELECT order_id, customer_id, amount, currency, created_at FROM orders WHERE DATE(created_at) = CURDATE()"
raw_orders = pd.read_sql(query, db_connection)
raw_orders.to_parquet(f"data/raw_orders_{date.today()}.parquet")
print(f"Extracted {len(raw_orders)} raw orders.")
return raw_orders
def clean_orders(raw_orders: pd.DataFrame):
"""
Upstream: extract_raw_orders
Downstream: build_daily_revenue
"""
if raw_orders is None or raw_orders.empty:
raise ValueError("clean_orders received empty input — upstream extraction may have failed.")
cleaned = raw_orders.dropna(subset=["order_id", "customer_id", "amount"])
cleaned = cleaned[cleaned["amount"] > 0]
# Normalize currencies to USD (simplified)
exchange_rates = {"USD": 1.0, "EUR": 1.08, "GBP": 1.27}
cleaned["amount_usd"] = cleaned.apply(
lambda row: row["amount"] * exchange_rates.get(row["currency"], 1.0),
axis=1
)
return cleaned
def build_daily_revenue(clean_orders_df: pd.DataFrame):
"""
Upstream: clean_orders
Downstream: Dashboard load, revenue reports
"""
if clean_orders_df is None or clean_orders_df.empty:
raise ValueError("build_daily_revenue received empty input — check clean_orders step.")
revenue = (
clean_orders_df
.groupby("customer_id")["amount_usd"]
.sum()
.reset_index()
.rename(columns={"amount_usd": "total_revenue_usd"})
)
return revenue
def run_pipeline(db_connection):
"""
Orchestrates the pipeline in dependency order.
Each step explicitly receives the output of its upstream dependency.
"""
print("Step 1: Extracting raw orders...")
raw = extract_raw_orders(db_connection)
print("Step 2: Cleaning orders...")
cleaned = clean_orders(raw)
print("Step 3: Building daily revenue...")
revenue = build_daily_revenue(cleaned)
print(f"Pipeline complete. Revenue computed for {len(revenue)} customers.")
return revenue
Notice what this code makes explicit that a poorly structured pipeline often obscures:
clean_orders receives empty data, it raises an error with a message that points upstream — that's where you should look first.run_pipeline function controls execution order. The dependency chain is visible just by reading top to bottom.Tip: In real pipelines, you'd use an orchestration tool to manage this execution order rather than a Python function. But the underlying logic — "don't run Step 2 until Step 1 succeeds" — is exactly the same. Understanding it in plain code first makes orchestration tools far easier to learn.
Full lineage tracking in production uses dedicated tools (we'll cover those shortly). But you can implement lightweight lineage tracking manually to understand the concept.
The core idea is: every time you produce a dataset, record metadata about how it was created.
import json
from datetime import datetime
def record_lineage(output_name: str, source_tables: list, transformation: str, row_count: int):
"""
Records basic lineage metadata for an output dataset.
In production, this would write to a metadata store (database, catalog, etc.)
"""
lineage_record = {
"output": output_name,
"created_at": datetime.utcnow().isoformat(),
"source_tables": source_tables,
"transformation_description": transformation,
"row_count": row_count
}
# In production: insert into a metadata database
# For demonstration: write to a local JSON log
with open("lineage_log.json", "a") as f:
f.write(json.dumps(lineage_record) + "\n")
print(f"Lineage recorded for {output_name}")
# Example usage after the clean_orders step:
record_lineage(
output_name="clean_orders_2024_01_15",
source_tables=["raw_orders_2024_01_15"],
transformation="Removed null order_id/customer_id/amount rows, filtered negative amounts, converted to USD",
row_count=len(cleaned)
)
After several pipeline runs, your lineage_log.json becomes a queryable audit trail. You can answer questions like: "How many rows were in clean_orders last Tuesday?" or "When was daily_revenue last rebuilt, and which source tables did it use?"
In professional data environments, you won't manually track lineage in JSON files. Here's a brief landscape of purpose-built tools so you know what you're building toward:
dbt (data build tool) is one of the most widely used tools for managing SQL transformations in a data warehouse. It automatically builds a DAG from your models by parsing SQL ref() statements — when you write FROM {{ ref('clean_orders') }}, dbt knows clean_orders is an upstream dependency of the current model. It generates visual DAGs and exposes lineage through its web UI and metadata API.
Apache Airflow is an orchestration platform where you explicitly define task dependencies in Python using the >> operator (e.g., extract_task >> clean_task >> revenue_task). Airflow's UI shows DAG graphs with real-time status for each task.
OpenLineage and Marquez are open standards and tools specifically for lineage tracking. They capture lineage events from multiple systems and provide a centralized graph of how data flows across your entire data stack — across Airflow, Spark, dbt, and more.
Cloud data catalogs like Google Dataplex, AWS Glue Data Catalog, and Azure Purview automatically capture lineage from jobs running in their respective ecosystems.
Warning: Don't let tool complexity discourage you from thinking about lineage early. The mental model — knowing what feeds what and recording that history — is more important than any specific tool. You can implement meaningful lineage practices with a simple database table and disciplined documentation before you ever deploy Marquez.
Let's put this into practice. You'll map the dependencies of a fictional pipeline and trace a data quality issue.
Scenario: You work at a food delivery company. Your data platform has the following tables, produced by nightly jobs:
raw_deliveries — extracted from the app's production databaseraw_drivers — extracted from the HR systemclean_deliveries — built from raw_deliveries after removing cancelled and test ordersclean_drivers — built from raw_drivers after standardizing city namesdeliveries_with_driver — a joined table combining clean_deliveries and clean_driverscity_delivery_performance — an aggregation of deliveries_with_driver grouped by cityOne morning, city_delivery_performance shows dramatically low delivery counts for Chicago. Your job is to investigate.
Part 1: Draw the dependency DAG
On paper or in a text diagram, map the dependencies between these six tables. Draw arrows from upstream to downstream. Your diagram should make it immediately clear which tables must be built before others.
Part 2: Trace upstream
Starting from city_delivery_performance, identify all possible upstream sources of the Chicago problem. List at least three specific hypotheses about where the data loss could have occurred, naming the specific table or transformation step where each issue might live.
Part 3: Assess downstream impact
Suppose you discover that clean_deliveries is missing all rows where city = 'Chicago' due to a typo in a filter condition (someone wrote city != 'Chicago' instead of city = 'Chicago'). List every downstream table affected by this bug.
Part 4: Write a lineage record
Using the record_lineage function from earlier in this lesson (or your own version), write the call you would make after successfully rebuilding city_delivery_performance. Include accurate source_tables, a meaningful transformation_description, and whatever row count makes sense to you.
When a downstream task fails, the first instinct is often to debug that task's code. In reality, the majority of downstream failures are caused by upstream problems — empty tables, schema changes, or missing data. Always check upstream first.
What to do: Start your investigation at the task that feeds the broken one, not the broken one itself. Ask: "Did my upstream dependency succeed? Did it produce the expected number of rows? Did the schema change?"
Sometimes a job depends on a file or table that isn't formally declared anywhere — maybe a developer knew that Job C needed Job A's output, but only wrote the connection in a comment or relied on timing. These are implicit dependencies, and they're silent landmines.
What to do: Make every dependency explicit. If you're using an orchestration tool, declare it in the DAG. If you're writing scripts, document it in the function's docstring or a pipeline manifest file. If it's not written down, it will eventually break silently.
A job can run to completion without raising an error and still produce garbage — truncated data, wrong row counts, silently dropped records. A green checkmark in your orchestrator means the process completed, not that the data is correct.
What to do: Add data quality checks as explicit steps in your pipeline. After each major transformation, assert expected row counts, check for nulls in key columns, and validate value ranges. These checks should be their own pipeline nodes, with downstream tasks depending on them passing.
Renaming a column, changing a table's grain, or altering a filter condition can silently break many downstream consumers. If you don't know what's downstream of the table you're modifying, you can't know the blast radius of your change.
What to do: Before modifying any table or transformation, use your lineage documentation to identify every downstream dependency. Communicate changes to affected teams. Consider adding a deprecation period where both the old and new versions coexist.
You've now built a solid foundation for thinking about data pipeline structure. Let's consolidate what you've learned:
Where to go from here:
The most important habit you can build right now is simple: every time you write a pipeline task, ask yourself "what does this depend on, and what depends on this?" Write the answer down. That discipline, maintained consistently, is the foundation of a trustworthy data platform.
Learning Path: Data Pipeline Fundamentals