You've probably noticed that asking ChatGPT the same question twice can give you wildly different answers — sometimes brilliantly useful, sometimes frustratingly vague. That's not a bug in the model. It's a signal that how you communicate with a language model matters enormously. The difference between a mediocre AI integration and a reliable, production-ready one often comes down to three things: a well-crafted system prompt, a handful of carefully chosen examples, and a properly tuned temperature setting.
This lesson treats prompt engineering the way it deserves to be treated — as a genuine engineering discipline, not a collection of magic phrases. You'll learn the mechanics behind why prompts work the way they do, then build up a toolkit of techniques that you can apply immediately in your own projects. We're going to use a running scenario throughout: you're building an AI assistant that classifies customer support tickets for a SaaS company and suggests a first-response draft. It's the kind of real task data and AI practitioners encounter constantly, and it gives us something concrete to iterate on.
By the end of this lesson, you'll be able to design prompts that produce consistent, predictable output — the kind you can actually build software around.
What you'll learn:
Before we touch any technique, you need a mental model of what's happening under the hood. An LLM doesn't understand your prompt the way a human reads a sentence. It's a statistical prediction engine: given the text it has seen so far, what token (roughly, what word-chunk) is most likely to come next?
This is the fundamental insight that makes prompt engineering make sense. When you write a prompt, you're not commanding the model — you're setting up a context that makes certain kinds of responses statistically likely. A prompt that begins with "You are a professional legal analyst specializing in contract review..." doesn't magically install legal knowledge. It shifts the probability distribution of the model's next tokens toward formal, precise, legally-flavored language — because in the training data, text that followed phrases like that tended to be formal and legally precise.
Keep this in mind as we go. Every technique we discuss is really just a way of manipulating that probability distribution in your favor.
Most LLM APIs split your input into two or three distinct "roles":
Understanding these roles is the foundation of everything that follows.
A system prompt is a set of instructions that runs before the user's message, invisible to the end user but deeply influential on the model's behavior. Think of it like a job description combined with a style guide, handed to a contractor before they start work. Without it, the model defaults to a kind of "helpful generalist" mode — useful for casual chat, but too unpredictable for a production application.
Good system prompts typically define four things:
Let's write one for our customer support scenario. Here's a weak first attempt:
You are a helpful customer support assistant. Help users with their questions.
This is vague. "Helpful" and "help" are near-meaningless to a model — they're compatible with almost any response. Here's a much better version:
You are a customer support specialist for Meridian, a B2B project management SaaS platform. Your job has two parts:
1. Classify the incoming support ticket into exactly one of these categories:
- Billing
- Bug Report
- Feature Request
- Account Access
- General Inquiry
2. Draft a professional, empathetic first response to the customer. The response should:
- Acknowledge their issue specifically (don't use generic openers like "Thank you for reaching out")
- Set a realistic expectation for next steps
- Be no longer than 150 words
Respond in this exact format:
CATEGORY: [category name]
DRAFT RESPONSE: [your drafted reply]
Do not answer questions outside the scope of Meridian's support function. If a ticket is abusive or off-topic, classify it as General Inquiry and respond professionally that it has been forwarded to the appropriate team.
Notice what changed. The second version specifies exactly what categories are valid (so the model can't invent new ones), describes the output format precisely, sets a word limit, handles edge cases explicitly, and even calls out what not to do. Every sentence closes a door on unwanted behavior.
Tip: Write system prompts defensively. Ask yourself: "What's the weirdest, most off-track response the model could give that technically follows my instructions?" Then add a rule that prevents it.
In OpenAI's Playground, navigate to the "Chat" mode. You'll see a text area labeled "System." Paste your system prompt there. Then in the user turn, type a realistic ticket:
User message: "I was charged twice for my subscription this month. I need a refund immediately and I want to know why this happened. This is unacceptable."
Run it and evaluate: Did it produce exactly one category? Is the format correct? Is the tone right? Does the response acknowledge the specific issue (a double charge) rather than giving a generic reply?
If anything's off, edit the system prompt — not the user message. The user message represents real-world input you can't control. The system prompt is your lever.
Even a perfect system prompt leaves a lot of room for interpretation. Few-shot prompting fills that gap by showing the model exactly what good output looks like, using examples embedded in the prompt itself.
"Few-shot" means you provide a few example input-output pairs before the actual query. "Zero-shot" means you provide none. "One-shot" means exactly one example. The number matters — but so does the quality and diversity of your examples.
Remember our mental model: the model is predicting what comes next. When you show it three examples of a specific input-output pattern, you're essentially saying "the next thing in this sequence follows this pattern too." The model picks up on the structure, tone, level of formality, output format, and even the length of your examples. It's learning from context, not from training.
Here's the key principle: your examples should cover the cases where your system prompt is ambiguous.
If you already have high confidence the model will handle billing questions correctly from the system prompt alone, don't waste an example slot on billing. Use examples for the edge cases — the tickets that could reasonably fit two categories, or the tone that's hard to describe in words.
Let's build out an example set for our support classifier. In API terms, this is how you'd structure it (Python using the OpenAI library):
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """You are a customer support specialist for Meridian, a B2B project management SaaS platform. Your job has two parts:
1. Classify the incoming support ticket into exactly one of these categories:
- Billing
- Bug Report
- Feature Request
- Account Access
- General Inquiry
2. Draft a professional, empathetic first response to the customer. The response should:
- Acknowledge their issue specifically
- Set a realistic expectation for next steps
- Be no longer than 150 words
Respond in this exact format:
CATEGORY: [category name]
DRAFT RESPONSE: [your drafted reply]"""
},
# Few-shot example 1: Bug Report
{
"role": "user",
"content": "The Gantt chart view keeps crashing whenever I try to drag a task past the end of the month. This has happened three times today and I have a client demo tomorrow."
},
{
"role": "assistant",
"content": """CATEGORY: Bug Report
DRAFT RESPONSE: We're sorry to hear the Gantt chart is crashing during task drag operations — especially with a client demo coming up tomorrow. I've flagged this as a high-priority bug for our engineering team, and I've noted the specific trigger (dragging past month-end) to help them reproduce it quickly. In the meantime, try switching to the List view for your demo, which doesn't have this limitation. I'll follow up with you directly once we have a fix confirmed. You can expect an update within one business day."""
},
# Few-shot example 2: Account Access (ambiguous — could seem like billing)
{
"role": "user",
"content": "My team members can't log in since we upgraded our plan. It says their seats are at capacity but we paid for 20 seats and only have 12 users."
},
{
"role": "assistant",
"content": """CATEGORY: Account Access
DRAFT RESPONSE: It sounds like your seat count didn't update correctly after your plan upgrade — this occasionally happens when the billing and provisioning systems don't sync immediately. I've escalated this to our account provisioning team, who will audit your seat allocation against your current subscription. Your 12 team members should have access restored within 2 hours. If the issue persists beyond that, reply to this ticket and we'll prioritize an immediate fix. We apologize for the disruption."""
},
# Few-shot example 3: Feature Request with frustrated tone
{
"role": "user",
"content": "Why is there still no way to export reports as Excel files? We've been asking for this for two years. Every other tool has this."
},
{
"role": "assistant",
"content": """CATEGORY: Feature Request
DRAFT RESPONSE: Your frustration is completely understandable — Excel export is one of our most-requested features, and I appreciate you pushing for it. I've added your vote to the active feature request and noted your specific use case, which helps our product team prioritize. While native Excel export isn't available yet, our CSV export (found under Reports → Export → CSV) opens directly in Excel and preserves formatting for most report types. I'll make sure you're notified as soon as Excel export ships. Thank you for sticking with us."""
},
# The actual query
{
"role": "user",
"content": "I was charged twice for my subscription this month. I need a refund immediately."
}
]
)
print(response.choices[0].message.content)
A few principles for building a strong example set:
Diversity over quantity. Three well-chosen examples covering different categories, tones, and edge cases outperform ten examples of the same kind. Your examples should collectively span the space of inputs you expect.
Put your most instructive example last. The model gives slightly more weight to examples closer to the actual query. If you have one example that captures the exact tone or format you want, put it third rather than first.
Match example quality to desired output quality. This sounds obvious but gets ignored constantly: if your example responses are mediocre, your real responses will be mediocre. The model is learning the level of quality, not just the format.
Use real data when possible. If you have historical support tickets and actual good responses written by your best support agent, use those. They'll outperform anything you manufacture.
Warning: Don't use examples with personally identifiable information (PII) in production prompts — names, email addresses, account numbers. Scrub or anonymize your example data before embedding it in prompts.
You now know how to shape what the model does. Temperature controls how deterministic it is when doing it.
Here's the honest explanation: when the model predicts the next token, it produces a probability score for every possible token in its vocabulary. Temperature is a mathematical parameter that either sharpens or flattens that probability distribution before the model makes its selection.
This isn't about creative vs. boring. It's about matching the randomness level to what your task actually needs.
| Task Type | Recommended Temperature | Why |
|---|---|---|
| Data extraction, classification | 0.0–0.2 | You want the same answer every time for the same input |
| Structured summarization | 0.2–0.4 | Consistent but natural-sounding |
| Customer response drafting | 0.4–0.6 | Professional but not robotic |
| Brainstorming, ideation | 0.7–1.0 | Variety is the point |
| Creative writing | 0.8–1.2 | Unexpected combinations are valuable |
For our support ticket classifier, the classification itself (choosing one of five categories) should be at a very low temperature — you do not want randomness there. The draft response can tolerate a bit more, since slightly varied phrasing is actually desirable. In practice, you'd set a single temperature for the whole call, so aim for the middle: around 0.3–0.4 for this use case.
Setting temperature in the API is one line:
response = client.chat.completions.create(
model="gpt-4o",
temperature=0.3, # Low but not zero — consistent classification, natural prose
messages=[...] # your messages here
)
Tip: Temperature 0.0 doesn't mean "the model will always give identical output." There's still some non-determinism in how modern LLMs are served at scale. But it gets you as close to deterministic as practically possible. For truly mission-critical consistency, combine low temperature with explicit output format constraints in your system prompt.
Many beginners think high temperature makes the model "smarter" or "more creative." It doesn't add capability — it adds unpredictability. A model at temperature 1.2 isn't accessing deeper knowledge; it's just taking wilder guesses. Use high temperature only when variety itself is the deliverable.
System prompts, few-shot examples, and temperature don't operate independently. They interact, and understanding the interaction makes you a better prompt engineer.
Think of it this way:
A tight system prompt with good examples and low temperature gives you the most predictable, format-consistent output. That's what you want for classification tasks, data extraction, or any pipeline where downstream code is parsing the model's response.
A lighter system prompt with no examples and higher temperature gives you maximum flexibility — useful early in development when you're still figuring out what you want.
A common mistake is using high temperature with a strict output format requirement. You're essentially telling the model "follow this exact format... but feel free to be unpredictable." Those instructions fight each other. If your system prompt says "respond in JSON format," set temperature below 0.4.
Here's a structured exercise to cement everything from this lesson. Set aside 30–45 minutes.
Scenario: You're building a prompt for a financial analyst assistant that categorizes transaction descriptions from bank statements into expense categories (Travel, Software & Subscriptions, Office Supplies, Meals & Entertainment, Professional Services, Other) and flags any that look potentially fraudulent for human review.
Step 1: Write a system prompt that defines the role, lists the valid categories, specifies a JSON output format (so it's easy to parse programmatically), and includes rules for the fraud flag.
Your output format should look like:
{
"category": "Travel",
"fraud_flag": false,
"reasoning": "One sentence explaining the classification"
}
Step 2: Create three few-shot examples covering: a clear-cut case, an ambiguous case (e.g., "AMZN*MKTP" could be office supplies or personal), and a fraud-flagged case.
Step 3: Run your prompt on these five transaction descriptions:
Step 4: Run the same prompt at temperature 0.0, 0.5, and 1.0. Compare the results. Does the category change? Does the reasoning change? Does the fraud flag flip? Write down what you observe.
The model ignores your format instructions. Usually caused by format instructions buried at the end of a long system prompt, or a high temperature fighting your structure. Move format instructions to a clearly labeled section near the end of the system prompt, keep them as explicit as possible ("Respond with only valid JSON — no markdown, no explanation"), and lower your temperature.
Output is technically correct but weirdly robotic. You over-constrained the tone. System prompts that specify every word ("Always begin by saying...") produce stilted output. Instead, describe the desired quality ("Direct and warm, like a knowledgeable colleague — not formal, not casual") and let examples demonstrate the actual tone.
Few-shot examples are making the model ignore edge cases. Your examples are too similar. If all three examples are billing complaints handled with apologies, the model starts treating "apologize" as the universal response. Diversify your examples deliberately.
Classification is inconsistent — same input, different categories on different runs. Temperature is too high for a classification task. Drop it to 0.1–0.2. Also check whether your category definitions overlap — if "Account Access" and "Billing" could both apply to the same ticket, the model will oscillate. Add tie-breaking rules to your system prompt ("If a ticket involves both billing and account access, classify as Account Access unless the primary complaint is about a charge amount").
The model keeps adding unsolicited explanations after the formatted output. Add an explicit instruction: "Do not include any text before or after the formatted output." You can also include that constraint in your few-shot examples by showing responses that are only the formatted output with no preamble.
Let's recap what you've built up in this lesson:
System prompts are the foundation of any reliable LLM integration. They define the model's role, constrain its behavior, specify output formats, and handle edge cases before the user ever types anything. Write them defensively, test them with adversarial inputs, and iterate.
Few-shot examples teach the model what "correct" looks like in practice. Three to five diverse, high-quality examples will consistently outperform a longer system prompt trying to describe the same thing in words. Use real examples from your domain when you have them.
Temperature controls the randomness of the model's token selection. Low temperature (0–0.3) for structured, consistent tasks. Medium (0.4–0.7) for drafting and natural language. High (0.8+) only when variety is the point. Match temperature to the strictness of your output requirements.
Together, these three tools let you move from "prompting" — ad-hoc text thrown at a model — to genuine prompt engineering: a systematic, testable, iterable process for building AI behavior you can depend on.
Where to go next:
The best prompt engineers aren't the ones who write the cleverest individual prompts — they're the ones who treat prompts like software: versioned, tested, and designed to handle the unexpected.
Learning Path: Building with LLMs