
Imagine you're a business analyst who needs to quickly find answers from hundreds of company policy documents, research reports, or customer feedback files. Instead of manually searching through each document, what if you could simply ask questions in plain English and get accurate answers instantly? This is exactly what a document Q&A system does—and it's become one of the most practical applications of AI in the workplace today.
A document Q&A system allows users to upload documents and ask natural language questions about their content. Behind the scenes, the system uses embeddings—mathematical representations of text that capture semantic meaning—to find relevant passages and generate accurate answers. Unlike traditional keyword search, this approach understands context and meaning, making it incredibly powerful for extracting insights from large document collections.
What you'll learn:
You should be comfortable with Python programming and have basic familiarity with APIs. While we'll explain all AI concepts from scratch, some experience with data structures like lists and dictionaries will help you follow the code examples.
Before we build our Q&A system, we need to understand embeddings—the technology that makes semantic search possible.
Think of embeddings as a way to convert text into coordinates in a multi-dimensional space where similar meanings cluster together. Just like GPS coordinates tell you where something is located on Earth, embeddings tell you where text is located in "meaning space."
Here's a simple analogy: imagine you have thousands of books in a library. Traditional search would be like organizing them alphabetically by title—you can find a specific book if you know its exact name, but you can't easily find books about similar topics. Embeddings are like organizing books in a multi-dimensional space where books about similar topics are physically closer together, regardless of their titles.
# Simple example of how embeddings work conceptually
documents = [
"The company's quarterly revenue increased by 15%",
"Sales grew significantly in the fourth quarter",
"Our marketing budget was reduced this year",
"The weather was sunny today"
]
# After converting to embeddings (simplified representation):
# Revenue doc: [0.8, 0.2, 0.1, 0.0]
# Sales doc: [0.7, 0.3, 0.1, 0.0] # Similar to revenue doc
# Budget doc: [0.1, 0.1, 0.8, 0.0]
# Weather doc: [0.0, 0.0, 0.0, 0.9] # Completely different
# Query: "How did our sales perform?"
# Query embedding: [0.75, 0.25, 0.05, 0.0]
# This would be closest to the revenue and sales documents
In reality, embeddings have hundreds or thousands of dimensions, not just four, which allows them to capture subtle nuances in meaning.
Our document Q&A system consists of four main components working together:
Here's how they work together:
# High-level system flow
def document_qa_system(documents, question):
# 1. Process documents into chunks
chunks = chunk_documents(documents)
# 2. Generate embeddings for all chunks
chunk_embeddings = generate_embeddings(chunks)
# 3. Store in vector database
vector_db = store_embeddings(chunks, chunk_embeddings)
# 4. When user asks a question:
question_embedding = generate_embedding(question)
relevant_chunks = vector_db.search_similar(question_embedding)
# 5. Generate answer using relevant context
answer = generate_answer(question, relevant_chunks)
return answer
The first challenge in building our system is handling documents of varying sizes. Large language models have token limits—they can only process a certain amount of text at once. Even if we could process entire documents, doing so would be inefficient and often irrelevant to specific questions.
The solution is chunking: breaking documents into smaller, overlapping segments that maintain context while staying within processing limits.
def chunk_document(text, chunk_size=1000, overlap=200):
"""
Split document into overlapping chunks.
Args:
text: Full document text
chunk_size: Maximum characters per chunk
overlap: Characters to overlap between chunks
"""
chunks = []
start = 0
while start < len(text):
# Find the end of this chunk
end = start + chunk_size
# If this isn't the last chunk, try to break at a sentence boundary
if end < len(text):
# Look for the last period within the chunk
last_period = text.rfind('.', start, end)
if last_period > start:
end = last_period + 1
chunk = text[start:end].strip()
if chunk: # Only add non-empty chunks
chunks.append({
'text': chunk,
'start_pos': start,
'end_pos': end
})
# Move start position, accounting for overlap
start = end - overlap
return chunks
# Example usage with a policy document
policy_text = """
Employee Handbook - Section 3: Time Off Policies
3.1 Vacation Policy
All full-time employees are eligible for paid vacation time. New employees receive 10 days of vacation per year for their first two years of employment. After two years, employees receive 15 days per year. After five years, employees receive 20 days per year.
Vacation time must be approved by your direct supervisor at least two weeks in advance. Emergency situations may be considered on a case-by-case basis.
3.2 Sick Leave Policy
Employees receive 8 sick days per year. Sick days can be used for personal illness or to care for immediate family members. A doctor's note is required for sick leave longer than three consecutive days.
3.3 Personal Days
All employees receive 3 personal days per year for personal matters that cannot be scheduled outside of work hours.
"""
chunks = chunk_document(policy_text, chunk_size=300, overlap=50)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:")
print(chunk['text'])
print("---")
Why overlap matters: Overlap ensures that concepts spanning chunk boundaries aren't lost. If a sentence about vacation approval spans two chunks, the overlap keeps both chunks contextually complete.
Now we need to convert our text chunks into embeddings. We'll use OpenAI's embedding API, which provides state-of-the-art text embeddings through a simple API call.
import openai
import numpy as np
from typing import List, Dict
class EmbeddingGenerator:
def __init__(self, api_key: str):
self.client = openai.OpenAI(api_key=api_key)
self.model = "text-embedding-ada-002"
def generate_embedding(self, text: str) -> List[float]:
"""Generate embedding for a single text."""
response = self.client.embeddings.create(
input=text,
model=self.model
)
return response.data[0].embedding
def generate_embeddings_batch(self, texts: List[str]) -> List[List[float]]:
"""Generate embeddings for multiple texts efficiently."""
response = self.client.embeddings.create(
input=texts,
model=self.model
)
return [item.embedding for item in response.data]
# Initialize the embedding generator
embedder = EmbeddingGenerator("your-openai-api-key")
# Generate embeddings for our document chunks
chunk_texts = [chunk['text'] for chunk in chunks]
embeddings = embedder.generate_embeddings_batch(chunk_texts)
# Each embedding is a list of 1536 numbers
print(f"Embedding dimensions: {len(embeddings[0])}")
print(f"First 5 values of first embedding: {embeddings[0][:5]}")
These embeddings capture the semantic meaning of each chunk. Chunks about similar topics will have similar embedding vectors, even if they use different words.
With our embeddings ready, we need a way to store them and quickly find the most similar ones to a query. This is where vector databases excel—they're optimized for similarity search in high-dimensional spaces.
For our tutorial, we'll build a simple in-memory vector database using cosine similarity:
import numpy as np
from scipy.spatial.distance import cosine
class SimpleVectorDB:
def __init__(self):
self.embeddings = []
self.chunks = []
self.metadata = []
def add_documents(self, chunks: List[Dict], embeddings: List[List[float]]):
"""Add chunks and their embeddings to the database."""
self.chunks.extend(chunks)
self.embeddings.extend(embeddings)
# Store metadata like original document source
self.metadata.extend([{"chunk_id": len(self.chunks) + i}
for i in range(len(chunks))])
def similarity_search(self, query_embedding: List[float],
top_k: int = 5) -> List[Dict]:
"""Find the most similar chunks to a query embedding."""
if not self.embeddings:
return []
# Calculate cosine similarity with all stored embeddings
similarities = []
for stored_embedding in self.embeddings:
# Cosine similarity = 1 - cosine distance
similarity = 1 - cosine(query_embedding, stored_embedding)
similarities.append(similarity)
# Get indices of top-k most similar embeddings
top_indices = np.argsort(similarities)[-top_k:][::-1]
results = []
for idx in top_indices:
results.append({
'chunk': self.chunks[idx],
'similarity': similarities[idx],
'metadata': self.metadata[idx]
})
return results
# Create and populate our vector database
vector_db = SimpleVectorDB()
vector_db.add_documents(chunks, embeddings)
# Test similarity search
query = "How many vacation days do new employees get?"
query_embedding = embedder.generate_embedding(query)
similar_chunks = vector_db.similarity_search(query_embedding, top_k=3)
print("Most relevant chunks for:", query)
for i, result in enumerate(similar_chunks):
print(f"\n{i+1}. Similarity: {result['similarity']:.3f}")
print(f"Text: {result['chunk']['text'][:200]}...")
Understanding cosine similarity: Cosine similarity measures the angle between two vectors, regardless of their magnitude. A similarity of 1 means the vectors point in the same direction (identical meaning), while 0 means they're perpendicular (unrelated meaning).
Now comes the magic: using the retrieved relevant chunks to generate natural language answers. We'll send both the user's question and the relevant context to a language model.
class AnswerGenerator:
def __init__(self, api_key: str):
self.client = openai.OpenAI(api_key=api_key)
def generate_answer(self, question: str, relevant_chunks: List[Dict],
max_context_length: int = 3000) -> str:
"""Generate an answer using retrieved context."""
# Combine relevant chunks into context
context_parts = []
total_length = 0
for result in relevant_chunks:
chunk_text = result['chunk']['text']
# Stop adding context if we exceed length limit
if total_length + len(chunk_text) > max_context_length:
break
context_parts.append(chunk_text)
total_length += len(chunk_text)
context = "\n\n".join(context_parts)
# Create the prompt
system_prompt = """You are a helpful assistant that answers questions based on provided context.
Use only the information given in the context to answer questions.
If the context doesn't contain enough information to answer the question, say so clearly.
Be concise but complete in your answers."""
user_prompt = f"""Context:
{context}
Question: {question}
Answer:"""
# Generate response
response = self.client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0.1, # Lower temperature for more consistent answers
max_tokens=500
)
return response.choices[0].message.content
# Create answer generator and test
answer_gen = AnswerGenerator("your-openai-api-key")
# Get answer for our vacation question
answer = answer_gen.generate_answer(query, similar_chunks)
print(f"Question: {query}")
print(f"Answer: {answer}")
Let's combine all components into a complete document Q&A system:
class DocumentQASystem:
def __init__(self, openai_api_key: str):
self.embedder = EmbeddingGenerator(openai_api_key)
self.vector_db = SimpleVectorDB()
self.answer_generator = AnswerGenerator(openai_api_key)
self.processed_documents = []
def add_document(self, text: str, document_name: str = "Unknown"):
"""Add a document to the system."""
print(f"Processing document: {document_name}")
# 1. Chunk the document
chunks = chunk_document(text)
print(f"Created {len(chunks)} chunks")
# 2. Generate embeddings
chunk_texts = [chunk['text'] for chunk in chunks]
embeddings = self.embedder.generate_embeddings_batch(chunk_texts)
# 3. Add metadata
for chunk in chunks:
chunk['source_document'] = document_name
# 4. Store in vector database
self.vector_db.add_documents(chunks, embeddings)
self.processed_documents.append(document_name)
print(f"Successfully added {document_name} to the system")
def ask_question(self, question: str, top_k: int = 3) -> Dict:
"""Ask a question and get an answer with source information."""
if not self.processed_documents:
return {"error": "No documents have been added to the system"}
print(f"Processing question: {question}")
# 1. Generate query embedding
query_embedding = self.embedder.generate_embedding(question)
# 2. Find relevant chunks
relevant_chunks = self.vector_db.similarity_search(
query_embedding, top_k=top_k
)
if not relevant_chunks:
return {"error": "No relevant information found"}
# 3. Generate answer
answer = self.answer_generator.generate_answer(question, relevant_chunks)
# 4. Compile response with sources
sources = []
for result in relevant_chunks:
sources.append({
'document': result['chunk']['source_document'],
'similarity': result['similarity'],
'text_preview': result['chunk']['text'][:150] + "..."
})
return {
'question': question,
'answer': answer,
'sources': sources,
'confidence': relevant_chunks[0]['similarity'] if relevant_chunks else 0
}
# Initialize the complete system
qa_system = DocumentQASystem("your-openai-api-key")
# Add our policy document
qa_system.add_document(policy_text, "Employee Handbook")
# Ask questions
questions = [
"How many vacation days do new employees get?",
"What's the policy for sick leave?",
"Do I need supervisor approval for vacation time?",
"Can I use sick days to care for family members?"
]
for question in questions:
print("\n" + "="*50)
result = qa_system.ask_question(question)
if 'error' in result:
print(f"Error: {result['error']}")
else:
print(f"Q: {result['question']}")
print(f"A: {result['answer']}")
print(f"Confidence: {result['confidence']:.3f}")
print("\nSources:")
for i, source in enumerate(result['sources']):
print(f" {i+1}. {source['document']} (similarity: {source['similarity']:.3f})")
Now it's time to build your own document Q&A system! Follow these steps:
Step 1: Set up your environment
# Install required packages
# pip install openai numpy scipy
import openai
import numpy as np
from scipy.spatial.distance import cosine
# Get your OpenAI API key from https://platform.openai.com/api-keys
API_KEY = "your-api-key-here"
Step 2: Create a test document
Create a document about a topic you're familiar with—maybe a project manual, company guidelines, or even a detailed recipe collection. Make it at least 1000 words so you can see chunking in action.
# Example: Create a comprehensive document about your company's remote work policy
test_document = """
Remote Work Policy - Effective 2024
1. Eligibility and Approval Process
All full-time employees with at least 6 months of tenure are eligible to request remote work arrangements. Part-time employees may be considered on a case-by-case basis...
[Continue with several more sections covering equipment, expectations, communication, etc.]
"""
Step 3: Initialize and test your system
# Initialize your Q&A system
qa_system = DocumentQASystem(API_KEY)
# Add your document
qa_system.add_document(test_document, "Remote Work Policy")
# Test with questions
test_questions = [
"Who is eligible for remote work?",
"What equipment does the company provide?",
"How often do remote workers need to come to the office?"
]
for question in test_questions:
result = qa_system.ask_question(question)
print(f"Q: {question}")
print(f"A: {result['answer']}")
print(f"Confidence: {result['confidence']:.3f}")
print("---")
Step 4: Experiment with different parameters
Try modifying the chunk size, overlap, and number of retrieved chunks (top_k) to see how they affect answer quality:
# Test different chunking strategies
small_chunks = chunk_document(test_document, chunk_size=500, overlap=50)
large_chunks = chunk_document(test_document, chunk_size=1500, overlap=100)
print(f"Small chunks: {len(small_chunks)}")
print(f"Large chunks: {len(large_chunks)}")
# Compare answers with different top_k values
for k in [1, 3, 5]:
result = qa_system.ask_question("Who is eligible for remote work?", top_k=k)
print(f"With top_k={k}: {result['answer']}")
1. Chunk Size Problems
Mistake: Using chunks that are too small or too large.
Symptoms:
Solution: Start with 800-1200 characters and adjust based on your document structure. Technical documents might need larger chunks, while FAQ-style content works with smaller ones.
# Test different chunk sizes with the same question
def test_chunk_sizes(document, question):
sizes = [500, 1000, 1500]
for size in sizes:
chunks = chunk_document(document, chunk_size=size)
print(f"Chunk size {size}: {len(chunks)} chunks created")
# Test retrieval quality with each size
2. Embedding API Errors
Mistake: Sending too much text at once or hitting rate limits.
Symptoms: API errors about token limits or rate limiting
Solution: Batch your requests and add retry logic:
import time
from typing import List
def safe_generate_embeddings(texts: List[str], batch_size: int = 100):
"""Generate embeddings with batching and retry logic."""
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
for attempt in range(3): # Retry up to 3 times
try:
embeddings = embedder.generate_embeddings_batch(batch)
all_embeddings.extend(embeddings)
break
except Exception as e:
print(f"Attempt {attempt + 1} failed: {e}")
time.sleep(2 ** attempt) # Exponential backoff
return all_embeddings
3. Poor Retrieval Quality
Mistake: Getting irrelevant chunks for questions.
Symptoms: Answers are off-topic or the system says it can't find relevant information
Solutions:
def filtered_similarity_search(self, query_embedding, top_k=5, min_similarity=0.7):
"""Only return chunks above a similarity threshold."""
results = self.similarity_search(query_embedding, top_k=10) # Get more candidates
filtered_results = [r for r in results if r['similarity'] >= min_similarity]
return filtered_results[:top_k] # Return only top k of filtered results
4. Generic or Unhelpful Answers
Mistake: The language model gives vague responses or admits it doesn't know when relevant information exists.
Solutions:
improved_system_prompt = """You are an expert assistant specializing in document analysis.
Your job is to provide accurate, specific answers based solely on the provided context.
Guidelines:
- Quote specific phrases from the context when relevant
- If the context contains partial information, state what you know and what's missing
- Never make up information not present in the context
- Be specific with numbers, dates, and requirements when they appear in the context"""
Congratulations! You've built a complete document Q&A system that can understand and answer questions about any text document. Let's recap what you've accomplished:
You learned how embeddings transform text into mathematical representations that capture semantic meaning, enabling computers to understand similarity between different pieces of text. You implemented document chunking strategies that balance context preservation with processing efficiency. You built a vector database that can quickly find relevant information using similarity search, and you integrated a language model to generate natural, accurate answers.
Your system can now:
Immediate next steps to improve your system:
Add document format support: Extend your system to handle PDFs, Word documents, and web pages using libraries like PyPDF2 or BeautifulSoup.
Implement persistent storage: Replace the in-memory vector database with a persistent solution like Pinecone, Weaviate, or Chroma for production use.
Add conversation memory: Allow follow-up questions by maintaining conversation context and referring back to previous exchanges.
Advanced improvements to explore:
The document Q&A system you've built represents a foundation that scales to handle enterprise-level document collections. Companies use exactly these techniques to build internal knowledge bases, customer support systems, and research tools that save countless hours of manual document review.
Your next learning milestone might be exploring retrieval-augmented generation (RAG) architectures in more depth, or diving into vector database optimization for large-scale systems. The principles you've learned here—embeddings, semantic search, and context-aware answer generation—form the backbone of modern AI-powered information retrieval systems.
Learning Path: Building with LLMs