Imagine you've built a retrieval system that pulls the right documents every time. Your vector search is solid, your chunking strategy is dialed in, and the relevant passages are landing in your pipeline. Then you ask the model a question — and it ignores half the context, confabulates a detail that isn't in any of your documents, and responds with the breezy confidence of someone who has no idea they're wrong. The retrieval worked. The prompting failed.
This is one of the most common failure modes in production RAG systems, and it's almost entirely a prompt engineering problem. The LLM isn't broken — it just wasn't given clear instructions about how to use the context it received. Writing a system prompt for a RAG application is a fundamentally different task from writing a general-purpose assistant prompt. You're not just setting a tone; you're defining a contract between the model and the retrieved evidence.
By the end of this lesson, you'll understand exactly how to write system prompts that keep LLM responses tightly grounded in retrieved context. You'll know which structural components matter, how to handle edge cases like missing or conflicting information, and how to test whether your prompts are actually working.
What you'll learn:
You should be comfortable with the basic idea of what RAG (Retrieval-Augmented Generation) is — that is, you understand that a RAG system retrieves relevant text chunks from a knowledge base and passes them to an LLM along with the user's question. You don't need to have built a RAG system from scratch, but you should have read an introductory overview. Basic familiarity with calling an LLM API (like OpenAI's) will help you follow the code examples.
Before we talk about how to write a system prompt, let's be precise about what it is and where it sits.
Most LLM APIs accept messages in a structured format. There are typically three roles: system, user, and assistant. The system message is sent at the beginning of a conversation and functions like standing orders — instructions the model carries with it throughout every exchange. The user message is the human's actual question or input. The assistant message is the model's response.
Here's a minimal API call to illustrate the structure:
import openai
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is the capital of France?"
}
]
)
print(response.choices[0].message.content)
In a RAG application, the system message is where you give the model its identity, its constraints, and — critically — instructions for how to treat the retrieved context. The retrieved documents don't appear in the system message itself. They're typically injected into the user message, formatted alongside the actual question.
Think of it this way: the system prompt is the policy, and the user message (with its embedded context) is the case. The model is the decision-maker applying the policy to the case.
Key insight: The system prompt can't know in advance what documents will be retrieved. It has to give the model durable, general instructions for how to handle any retrieved context it might receive.
A well-designed RAG system prompt has five distinct functional components. They don't have to appear in this exact order, and they don't have to be labeled — but every component should be present, because each one closes a specific failure mode.
Tell the model who it is and what domain it's operating in. This primes the model's behavior and sets expectations for the vocabulary and reasoning style it should use.
A weak version:
You are a helpful assistant.
A stronger version for, say, an internal HR policy chatbot:
You are an HR policy assistant for Meridian Technologies.
You help employees understand company policies, benefits,
and procedures based on official documentation.
The specificity matters. "Helpful assistant" is so general that the model will freely draw on everything it knows about everything. Specifying a domain and a purpose is the first constraint you're applying.
This is the most critical component in a RAG prompt. It explicitly instructs the model to base its answer on the provided context, not on its general training knowledge.
Here's a baseline grounding instruction:
Answer the user's question using only the information provided
in the CONTEXT section below. Do not use prior knowledge or make
assumptions beyond what is stated in the provided documents.
This sounds simple, but the phrasing matters. Notice the phrase "make assumptions beyond what is stated." LLMs are probabilistic completion engines — their default behavior is to fill in gaps. The grounding instruction is explicitly overriding that default.
Warning: Saying "use the context to answer" is not the same as saying "use only the context." The word "only" does a lot of work here. Without it, many models will blend retrieved content with training knowledge in ways that are difficult to detect.
If you want the model to cite its sources — and in most production RAG systems, you do — you need to be explicit about the format and the expectation. Don't leave this to chance.
When you use information from the context, cite the source
using the document title or identifier provided. Format
citations inline like this: [Source: Document Title].
You can also enforce which documents were used:
If your answer draws on multiple documents, cite each one
where relevant. Do not combine information from different
sources without making clear which claim comes from which source.
This prevents a common failure mode where the model synthesizes across multiple chunks and produces a statement that isn't cleanly traceable to any single document.
What should the model do when the retrieved context doesn't contain the answer? If you don't specify, it will often improvise — and that improvisation is where hallucinations live.
If the provided context does not contain enough information
to answer the question, say so clearly. Do not speculate,
infer, or draw on outside knowledge to fill the gap.
A response like "The provided documents don't address this
specific question" is better than a guess.
This instruction is uncomfortable for some teams to write because it means the product will sometimes say "I don't know." But that honesty is a feature, not a bug. A RAG system that confabulates confidently is far more dangerous than one that acknowledges its limits.
The final component controls the shape of the response. This isn't about aesthetics — it's about preventing the model from padding answers with filler that might smuggle in unsupported claims.
Keep responses concise and directly grounded in the evidence.
Use bullet points when listing multiple distinct items.
Do not add preamble, disclaimers, or commentary beyond
what is necessary to answer the question.
Let's build a complete system prompt for a realistic scenario: an internal knowledge base assistant for a software company, where employees can ask questions about engineering runbooks, incident postmortems, and deployment procedures.
You are an internal technical knowledge assistant for Cloudstream Engineering.
Your role is to help engineers quickly find accurate information from
official runbooks, postmortem reports, and deployment documentation.
## How to Use Context
You will be provided with a CONTEXT section containing one or more
retrieved document excerpts, each labeled with a source identifier.
Base your answer exclusively on the information in these excerpts.
Do not use prior knowledge, make inferences beyond what is explicitly
stated, or draw conclusions that aren't supported by the provided text.
## Citations
When your answer is drawn from specific documents, cite the source inline
using this format: [Source: <document_id>]. If multiple documents contribute
to the answer, cite each one at the point where you use it.
## When Context Is Insufficient
If the retrieved context does not contain the information needed to answer
the question, respond with a clear statement that the documentation available
does not address this topic. Suggest that the user refine their search or
consult the relevant team directly. Do not speculate or fill gaps with
assumed knowledge.
## Response Format
- Be direct and specific. Avoid preamble.
- Use numbered steps when describing procedures.
- Use bullet points for lists of items or options.
- Keep responses under 300 words unless the complexity of the question
genuinely requires more.
Notice that this prompt uses markdown headers internally. This is deliberate — it helps the model mentally separate each instruction set, and it mirrors the structured format it's likely to have seen in training on technical documentation.
The system prompt defines the policy. The user message delivers the context and the question. The format of that user message matters almost as much as the system prompt itself.
Here's a pattern that works well:
def build_user_message(question: str, retrieved_chunks: list[dict]) -> str:
context_block = ""
for chunk in retrieved_chunks:
context_block += f"[Document ID: {chunk['doc_id']}]\n"
context_block += f"{chunk['text']}\n\n"
return f"""CONTEXT:
{context_block}
QUESTION:
{question}"""
And here's what a full API call looks like with this structure:
retrieved_chunks = [
{
"doc_id": "runbook-k8s-rollback-v3",
"text": "To roll back a Kubernetes deployment, run: kubectl rollout undo deployment/<deployment-name>. This reverts to the previous ReplicaSet. Verify the rollback completed with kubectl rollout status deployment/<deployment-name>."
},
{
"doc_id": "postmortem-2024-03-db-failover",
"text": "During the March 2024 database failover incident, the team observed a 4-minute delay in rollback execution due to a misconfigured readiness probe that prevented the old pods from being marked as ready."
}
]
question = "How do I roll back a Kubernetes deployment, and are there any known issues I should watch for?"
user_message = build_user_message(question, retrieved_chunks)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT}, # the full prompt from above
{"role": "user", "content": user_message}
]
)
print(response.choices[0].message.content)
The model now has clearly labeled sources, a structured separation between context and question, and standing instructions for how to handle both. This structure makes it much easier to diagnose problems — if the model cites a document ID that doesn't match any retrieved chunk, you know something went wrong.
Tip: Always label your chunks with identifiers before passing them to the model. Even if you don't surface those IDs to end users, they let you trace exactly which retrieved passage contributed to a given claim during debugging.
Writing the prompt is only half the work. You need to test it against the failure modes it was designed to prevent. Here are three test categories that will catch the most common problems.
Test 1: The Out-of-Scope Question Ask a question that your knowledge base definitely doesn't cover. A well-grounded prompt should produce a clear "I don't have that information" response, not an improvised answer. If the model answers anyway, your grounding instruction is too weak.
Test 2: The Partial Context Question Retrieve chunks that contain related but incomplete information — enough to tempt the model into extrapolating. The model should answer only what the context supports and flag what's missing.
Test 3: The Conflicting Documents Question Deliberately retrieve two chunks that contain slightly different information about the same topic (for example, two versions of a runbook with different command syntax). A well-structured prompt should either note the discrepancy and cite both sources, or report to the user that the documents contain conflicting information.
Keep a running log of these test cases. Every time a failure slips through in production, turn it into a test case and adjust the prompt. RAG prompts are not write-once artifacts — they evolve as you discover new edge cases.
Work through this exercise end-to-end to solidify what you've learned.
Scenario: You're building a RAG assistant for a small legal services firm. Lawyers and paralegals will use it to query a database of case summaries, jurisdiction-specific statutes, and internal practice guidelines.
Your task:
Write a system prompt tailored to this use case. Make sure it includes all five components: role definition, grounding instruction, citation rules, the "I don't know" protocol, and formatting guidance. Use the Cloudstream Engineering prompt as a reference, but adapt it to the legal context.
Write a build_user_message() function that formats retrieved chunks with [Document ID: ...] labels, followed by a QUESTION: section.
Create three test cases — one for each category (out-of-scope, partial context, conflicting documents) — and note what a good response would look like for each one.
If you have access to an OpenAI API key, run your prompt against at least one test case and evaluate whether the model's response meets your expectations. If it doesn't, identify which component of the prompt failed and revise it.
The model ignores the context and answers from training knowledge. This almost always means your grounding instruction doesn't include the word "only," or the instruction appears too late in the prompt. Move the grounding instruction earlier and strengthen the language. Try: "You must answer exclusively from the CONTEXT section. Do not use any information that does not appear in the provided documents, even if you know it to be true."
The model hallucinates document IDs in its citations. This happens when the citation format is under-specified, or when the model has no source labels to reference. Make sure every retrieved chunk is labeled before being inserted into the user message, and make sure your citation instruction refers to those labels explicitly.
The model always says "I don't know" even when the answer is in the context. This is an over-correction — your "I don't know" protocol is too aggressive. Check whether your context is being properly formatted and that the document text isn't being truncated. Also check whether the question and context are semantically aligned; if your retrieval system is returning irrelevant chunks, the model is correctly reporting that it can't answer from those documents.
The model produces very long, padded responses. Your formatting guidance either isn't present or isn't strong enough. Add an explicit word limit and reinforce it: "Do not add preamble, caveats, or filler content. Every sentence in your response should be directly traceable to a claim in the context."
Responses degrade when multiple chunks are provided. When context windows fill up, models tend to privilege content near the beginning and end, ignoring the middle. This is called the "lost in the middle" problem. Try reordering chunks so the most relevant passages appear first, and consider reducing the number of chunks passed per query.
A RAG pipeline is only as reliable as the instructions you give the model for using it. The system prompt is where you define those instructions, and every component does a specific job: the role definition scopes the domain, the grounding instruction overrides the model's default behavior of drawing on training knowledge, the citation rules create traceability, the "I don't know" protocol prevents confabulation at the edges of the knowledge base, and the formatting guidance keeps responses tight and evidence-based.
The key mental shift is moving from thinking of the system prompt as a "personality setting" to thinking of it as a protocol document — a precise specification of how the model should reason when it's acting as the interface to your knowledge base.
From here, the natural next steps are:
The prompt is where retrieval meets generation. Getting it right is what turns a technically functional RAG pipeline into a system people can actually trust.
Learning Path: RAG & AI Agents