You're three hours into building an automated pipeline that summarizes customer support tickets. Everything works perfectly in testing — your prompts are clean, the model outputs are good, and you're feeling great. Then you push it to production with real data, and half your requests start failing. The error messages are cryptic. Some summaries are mysteriously truncated. A few responses seem to forget context that was clearly in the input. You start wondering if the model is just... broken.
It's not broken. You've hit a context window limit, and you didn't know it was there.
This lesson is about avoiding that moment entirely. Before you write a single production AI workflow, you need to understand how language models actually receive and process text — specifically, how they chop your words into tokens, why there's a hard ceiling on how much text they can handle at once, and what happens when you push against that ceiling. By the end of this lesson, you'll be able to look at any AI API, understand its constraints, and design workflows that stay well inside those limits without sacrificing quality.
What you'll learn:
This lesson assumes you're comfortable reading basic Python and understand the general idea that AI language models take text as input and return text as output. You don't need to have used an AI API before, though it helps. No machine learning knowledge is required.
Before we can talk about limits, we need to talk about what's actually being counted. Language models don't read your text the way you do — they don't see letters, and they don't see whole words. They see tokens.
A token is a chunk of text that the model has learned to treat as a single unit. Think of it like the smallest meaningful piece the model works with. The exact definition of a "chunk" is determined by a process called tokenization, which runs your raw text through an algorithm that breaks it into these pieces before any actual AI processing happens.
Here's the key insight: tokens don't map cleanly to words. Sometimes one word is one token. Sometimes a word is two or three tokens. Sometimes a single token spans multiple short words.
Let's make this concrete. Take the sentence:
The quarterly revenue report was disappointing.
A tokenizer might split this as:
["The", " quarterly", " revenue", " report", " was", " dis", "appoint", "ing", "."]
That's 9 tokens for 7 words. The word "disappointing" got split into three tokens (dis, appoint, ing) because the tokenizer breaks down less common words into recognizable sub-word pieces.
Here are the general rules of thumb that hold across most modern tokenizers:
the, data, report)disappointing, infrastructure, hyperparameter)2024 might be 1 token, but 2,024,381.50 could be 6 or moreA useful working approximation: 1 token ≈ 4 characters of English text, or roughly 100 tokens ≈ 75 words. This is a rough heuristic, not a guarantee — but it's good enough for back-of-envelope planning.
Tip: OpenAI provides a free tool called the Tokenizer at platform.openai.com/tokenizer. Paste any text and it shows you exactly how many tokens it uses and how the text gets split. It's the fastest way to build intuition about tokenization. Anthropic's Claude and Google's Gemini use slightly different tokenizers, but the numbers are similar enough that the same heuristics apply.
If you're working with structured data — CSVs, database query results, JSON responses — you might assume tokens are someone else's problem. They're not.
Here's why: when you want a language model to do anything with data, you have to convert that data into text and send it as part of your prompt. A table with 100 rows and 10 columns, pasted as plain text, might cost 2,000–4,000 tokens depending on your values. A JSON blob from an API response with nested fields can bloat token counts dramatically.
Let's look at a realistic example. Suppose you're querying a CRM and you get back a list of customer records:
records = [
{"customer_id": "C00123", "name": "Patricia Chen", "tier": "enterprise",
"last_contact": "2024-03-15", "open_tickets": 4, "revenue_usd": 142000},
{"customer_id": "C00124", "name": "James Okafor", "tier": "mid-market",
"last_contact": "2024-02-28", "open_tickets": 1, "revenue_usd": 38500},
# ... 98 more records
]
If you naively dump all 100 records as JSON into a prompt asking "which customers are at risk of churning?", you need to estimate the token cost. Let's do the math:
That fits fine in most models today. But if you had 1,000 records? You'd be looking at 20,000–25,000 tokens — which starts bumping against the limits of some models and gets expensive fast on others.
Understanding this math is what separates data professionals who build reliable AI workflows from those who discover limits the hard way in production.
Now we can talk about the context window. This is the single most important constraint you'll work with when building AI pipelines.
A context window is the maximum amount of text — measured in tokens — that a language model can process in a single interaction. Everything counts toward this limit: your instructions, your data, any conversation history, and the model's response.
Think of it like a whiteboard. You and the model are collaborating on a task, and the whiteboard is where you write everything down. The whiteboard has a fixed size. Once it's full, you can't add anything new without erasing something — and whatever falls off the whiteboard is simply gone from the model's awareness.
This isn't a software bug or a lazy design choice. It's a fundamental property of how transformer-based language models work. The attention mechanism that makes these models so powerful — the thing that lets them understand relationships between words across long passages — scales in complexity with the length of the input. There's a physical and computational limit to how far that can stretch.
Here's how context windows have evolved across major models (approximate figures that shift with new releases):
| Model | Context Window |
|---|---|
| GPT-3.5 Turbo | 16,000 tokens |
| GPT-4o | 128,000 tokens |
| Claude 3.5 Sonnet | 200,000 tokens |
| Gemini 1.5 Pro | 1,000,000 tokens |
| Llama 3 (8B, local) | 8,000 tokens |
The numbers vary wildly. A local model you're running on your own hardware might have an 8,000-token window. A cloud API might offer 200,000 tokens. Your job as a workflow designer is to know which model you're using and what its ceiling is — before you start architecting the pipeline.
Warning: A large context window does not mean unlimited. Even with 128,000 tokens available, there's a well-documented phenomenon called lost in the middle — models tend to pay less attention to information buried in the middle of very long inputs than to information at the beginning or end. Don't assume that sending more context always means better results.
Here's the mistake most beginners make: they see "128k tokens" and think they have 128,000 tokens of input space. They don't. The context window is shared between everything — input and output combined.
Let's break down how context gets consumed in a real workflow:
Total Context Window
│
├── System Prompt (your instructions to the model)
│ └── e.g., 500 tokens for a detailed summarization prompt
│
├── User Input / Data
│ └── e.g., the actual text or data you want processed
│
├── Few-shot Examples (if you include them)
│ └── e.g., 3 example input/output pairs = ~600 tokens
│
└── Model Output (the response)
└── e.g., a 300-word summary = ~400 tokens
So if you're working with GPT-3.5 Turbo (16k context) and your setup looks like this:
Then your usable input space is:
16,000 - 400 - 600 - 500 - 200 = 14,300 tokens
That's 14,300 tokens for the actual data you want processed. At ~750 words per 1,000 tokens, that's about 10,700 words — enough for a detailed report or a handful of lengthy documents, but nowhere near "unlimited."
Now run the same math with a local Llama 3 model at 8,000 tokens:
8,000 - 400 - 600 - 500 - 200 = 6,300 tokens
You're working with about 4,700 words of usable input. That's a tight constraint — roughly the length of a short article, or maybe 30–40 customer support tickets of average length.
This calculation is something you should do before designing any workflow, every time.
Rather than guessing, you can measure token counts directly using the tiktoken library, which is what OpenAI's models use internally. If you're using other model providers, the counts will be slightly different but close enough for planning.
Install it:
pip install tiktoken
Then count tokens like this:
import tiktoken
def count_tokens(text: str, model: str = "gpt-4o") -> int:
"""Count the number of tokens in a string for a given model."""
encoder = tiktoken.encoding_for_model(model)
tokens = encoder.encode(text)
return len(tokens)
# Example: checking a customer support ticket batch
ticket_batch = """
Ticket #4821 | 2024-03-15 | Priority: High
Customer: Enterprise account, 3 years
Issue: Dashboard not loading after latest update.
Reproducible on Chrome and Firefox. IT has confirmed it's
not a network issue. Affecting 12 users on their team.
Ticket #4822 | 2024-03-15 | Priority: Medium
Customer: Mid-market, 18 months
Issue: CSV export button is greyed out on the reports page.
Has tried refreshing and clearing cache.
"""
token_count = count_tokens(ticket_batch)
print(f"Token count: {token_count}")
# Output will be roughly 110-130 tokens
You can also build a simple check into your pipeline that raises an error before making an expensive API call with an oversized input:
MAX_INPUT_TOKENS = 12000 # Your usable input budget
def safe_prompt_check(data_text: str, system_prompt: str) -> None:
"""Raise an error if combined input exceeds safe limits."""
data_tokens = count_tokens(data_text)
prompt_tokens = count_tokens(system_prompt)
total = data_tokens + prompt_tokens
if total > MAX_INPUT_TOKENS:
raise ValueError(
f"Input too large: {total} tokens "
f"(limit: {MAX_INPUT_TOKENS}). "
f"Data: {data_tokens}, Prompt: {prompt_tokens}. "
f"Consider chunking the data."
)
print(f"Input OK: {total} tokens used of {MAX_INPUT_TOKENS} available.")
This kind of defensive check saves you from silent failures in production.
Once you know your limits, you have a few practical strategies for working within them. None of them are perfect — each involves a tradeoff. Your job is to choose the right one for your use case.
Chunking means splitting your input into smaller pieces and processing each one separately. This is the most common approach for large document processing.
def chunk_text(text: str, max_tokens: int, model: str = "gpt-4o") -> list[str]:
"""Split text into chunks that fit within token limits."""
encoder = tiktoken.encoding_for_model(model)
tokens = encoder.encode(text)
chunks = []
for i in range(0, len(tokens), max_tokens):
chunk_tokens = tokens[i:i + max_tokens]
chunk_text = encoder.decode(chunk_tokens)
chunks.append(chunk_text)
return chunks
The tradeoff: chunking breaks context. If a key piece of information is in chunk 3 but the relevant question is answered in chunk 7, the model won't connect them unless you design your workflow to handle cross-chunk reasoning.
Instead of sending all your data, send a compressed version. For a dataset of 500 customer records, you might first compute summary statistics and only send those — rather than sending every row.
import pandas as pd
def summarize_dataframe_for_prompt(df: pd.DataFrame) -> str:
"""Convert a DataFrame to a compact text summary for AI input."""
summary_lines = [
f"Dataset: {len(df)} rows, {len(df.columns)} columns",
f"Columns: {', '.join(df.columns.tolist())}",
f"\nNumerical summary:",
df.describe().to_string(),
f"\nSample rows (first 5):",
df.head(5).to_string()
]
return "\n".join(summary_lines)
The tradeoff: you lose row-level detail. The model can reason about patterns but can't flag specific anomalies in records that were excluded.
For processing many documents (like a corpus of support tickets), use a map-reduce pattern:
def map_reduce_summarize(documents: list[str], summarize_fn) -> str:
"""Summarize many documents by first summarizing each individually."""
# Map step: summarize each document
individual_summaries = []
for doc in documents:
summary = summarize_fn(doc) # Your API call here
individual_summaries.append(summary)
# Reduce step: combine summaries into a final summary
combined = "\n\n---\n\n".join(individual_summaries)
final_summary = summarize_fn(combined) # One final call
return final_summary
The tradeoff: two layers of summarization means you lose more detail. But it scales to arbitrarily large document sets.
Tip: For retrieval-heavy workflows — like asking questions over a large knowledge base — consider a RAG (Retrieval-Augmented Generation) approach instead. You embed your documents, retrieve only the most relevant chunks for each query, and send only those. This is more sophisticated but far more token-efficient for large corpora.
Let's put this all together with a realistic exercise. You have a CSV file of support tickets that you want to summarize, but you don't know if it fits within your context window.
Setup: Install tiktoken (pip install tiktoken) and create a CSV with at least 20 fake support tickets. Each ticket should have: a ticket ID, date, priority level, customer description, and issue description.
Step 1: Load and inspect your data
Load the CSV with pandas and print the first few rows. Convert the relevant columns to a single string per row (e.g., f"Ticket {id}: {description}").
Step 2: Count tokens
Write a function that takes your list of ticket strings and returns the total token count. Use tiktoken with gpt-4o encoding.
Step 3: Define your budget Assume you're using a model with a 16,000-token context window. Write out the budget calculation:
Step 4: Check if your data fits Does your ticket data fit in the usable input space? If not, how many tickets can you fit per request? Calculate the number of API calls you'd need to process your full dataset.
Step 5: Build the chunker Using the chunking function from earlier, split your ticket data into chunks that fit within your usable input space. Print the number of chunks and the token count of each.
By the end of this exercise, you should have a concrete sense of how to size any AI workflow before you write the API-calling code.
Mistake: Assuming tokens equal words
This leads to systematic underestimation, especially with numerical data, code, and non-English text. Always use a tokenizer to measure, not word counts.
Mistake: Forgetting the system prompt in your budget
You might write a rich, detailed system prompt (which is good for quality) without accounting for its token cost. A thorough set of instructions can easily run 500–1,000 tokens. Measure it.
Mistake: Not accounting for output tokens
If you're asking the model to produce a detailed analysis, that output counts against your context window too. Always reserve token budget for the response.
Mistake: Treating the context window as the optimal input size
Just because you can send 100,000 tokens doesn't mean you should. Larger contexts are slower, more expensive, and can degrade model performance due to the "lost in the middle" problem. Send what's relevant.
Mistake: Hard-coding token limits
If you hard-code MAX_TOKENS = 128000 for GPT-4o today, you'll need to change it if you switch models. Build configuration so token limits are defined in one place and applied everywhere.
Troubleshooting: Truncated or incomplete outputs
If your model's response is cut off mid-sentence, you've likely hit the max_tokens parameter for the response, not the context window. These are different. The context window is the total size limit; max_tokens (or max_completion_tokens) is a cap you set on the output length. Increase it.
Troubleshooting: "Context length exceeded" API errors
This error means your combined input + output exceeded the model's context window. Reduce your input using the strategies above — chunking is usually the fastest fix.
Let's bring it together. Tokens are the fundamental unit of text that language models process — not words, not characters. A tokenizer breaks your text into these units before anything else happens, and the counts are often surprising, especially with numbers, code, and non-English content. The context window is the hard ceiling on how many tokens can participate in a single model interaction, covering your instructions, your data, and the model's response all at once.
The core skill this lesson built is the budget calculation: take your model's context window, subtract the tokens your system prompt needs, subtract the expected output size, subtract a safety buffer, and what remains is your true usable input space. Do this calculation before designing any workflow.
When your data is larger than that budget, you have three main strategies: chunk the data and process in batches, summarize and reduce before sending, or use a map-reduce pattern for large document sets. Each tradeoff is real — you always give something up when you compress input — so choose based on what matters most for your specific use case.
Where to go next:
The difference between AI workflows that work once in a notebook and AI workflows that work reliably in production almost always comes down to understanding limits. You now understand them.
Learning Path: Intro to AI & Prompt Engineering