Building a Document Q&A System with Embeddings

Imagine you're a business analyst who needs to quickly find answers from hundreds of company policy documents, research reports, or customer feedback files. Instead of manually searching through each document, what if you could simply ask questions in plain English and get accurate answers instantly? This is exactly what a document Q&A system does—and it's become one of the most practical applications of AI in the workplace today.

A document Q&A system allows users to upload documents and ask natural language questions about their content. Behind the scenes, the system uses embeddings—mathematical representations of text that capture semantic meaning—to find relevant passages and generate accurate answers. Unlike traditional keyword search, this approach understands context and meaning, making it incredibly powerful for extracting insights from large document collections.

What you'll learn:

How embeddings transform text into searchable mathematical representations
The complete architecture of a document Q&A system from document ingestion to answer generation
How to implement semantic search using vector similarity
Practical techniques for chunking documents and retrieving relevant context
How to integrate large language models for natural answer generation

Prerequisites

You should be comfortable with Python programming and have basic familiarity with APIs. While we'll explain all AI concepts from scratch, some experience with data structures like lists and dictionaries will help you follow the code examples.

Understanding Embeddings: The Foundation

Before we build our Q&A system, we need to understand embeddings—the technology that makes semantic search possible.

Think of embeddings as a way to convert text into coordinates in a multi-dimensional space where similar meanings cluster together. Just like GPS coordinates tell you where something is located on Earth, embeddings tell you where text is located in "meaning space."

Here's a simple analogy: imagine you have thousands of books in a library. Traditional search would be like organizing them alphabetically by title—you can find a specific book if you know its exact name, but you can't easily find books about similar topics. Embeddings are like organizing books in a multi-dimensional space where books about similar topics are physically closer together, regardless of their titles.

# Simple example of how embeddings work conceptually
documents = [
    "The company's quarterly revenue increased by 15%",
    "Sales grew significantly in the fourth quarter",
    "Our marketing budget was reduced this year",
    "The weather was sunny today"
]

# After converting to embeddings (simplified representation):
# Revenue doc: [0.8, 0.2, 0.1, 0.0]
# Sales doc:   [0.7, 0.3, 0.1, 0.0]  # Similar to revenue doc
# Budget doc:  [0.1, 0.1, 0.8, 0.0]
# Weather doc: [0.0, 0.0, 0.0, 0.9]  # Completely different

# Query: "How did our sales perform?"
# Query embedding: [0.75, 0.25, 0.05, 0.0]
# This would be closest to the revenue and sales documents

In reality, embeddings have hundreds or thousands of dimensions, not just four, which allows them to capture subtle nuances in meaning.

System Architecture Overview

Our document Q&A system consists of four main components working together:

Document Processor: Breaks documents into manageable chunks
Embedding Generator: Converts text chunks into mathematical vectors
Vector Database: Stores and searches embeddings efficiently
Answer Generator: Uses retrieved context to generate natural language answers

Here's how they work together:

# High-level system flow
def document_qa_system(documents, question):
    # 1. Process documents into chunks
    chunks = chunk_documents(documents)
    
    # 2. Generate embeddings for all chunks
    chunk_embeddings = generate_embeddings(chunks)
    
    # 3. Store in vector database
    vector_db = store_embeddings(chunks, chunk_embeddings)
    
    # 4. When user asks a question:
    question_embedding = generate_embedding(question)
    relevant_chunks = vector_db.search_similar(question_embedding)
    
    # 5. Generate answer using relevant context
    answer = generate_answer(question, relevant_chunks)
    return answer

Document Processing and Chunking

The first challenge in building our system is handling documents of varying sizes. Large language models have token limits—they can only process a certain amount of text at once. Even if we could process entire documents, doing so would be inefficient and often irrelevant to specific questions.

The solution is chunking: breaking documents into smaller, overlapping segments that maintain context while staying within processing limits.

def chunk_document(text, chunk_size=1000, overlap=200):
    """
    Split document into overlapping chunks.
    
    Args:
        text: Full document text
        chunk_size: Maximum characters per chunk
        overlap: Characters to overlap between chunks
    """
    chunks = []
    start = 0
    
    while start < len(text):
        # Find the end of this chunk
        end = start + chunk_size
        
        # If this isn't the last chunk, try to break at a sentence boundary
        if end < len(text):
            # Look for the last period within the chunk
            last_period = text.rfind('.', start, end)
            if last_period > start:
                end = last_period + 1
        
        chunk = text[start:end].strip()
        if chunk:  # Only add non-empty chunks
            chunks.append({
                'text': chunk,
                'start_pos': start,
                'end_pos': end
            })
        
        # Move start position, accounting for overlap
        start = end - overlap
    
    return chunks

# Example usage with a policy document
policy_text = """
Employee Handbook - Section 3: Time Off Policies

3.1 Vacation Policy
All full-time employees are eligible for paid vacation time. New employees receive 10 days of vacation per year for their first two years of employment. After two years, employees receive 15 days per year. After five years, employees receive 20 days per year.

Vacation time must be approved by your direct supervisor at least two weeks in advance. Emergency situations may be considered on a case-by-case basis.

3.2 Sick Leave Policy  
Employees receive 8 sick days per year. Sick days can be used for personal illness or to care for immediate family members. A doctor's note is required for sick leave longer than three consecutive days.

3.3 Personal Days
All employees receive 3 personal days per year for personal matters that cannot be scheduled outside of work hours.
"""

chunks = chunk_document(policy_text, chunk_size=300, overlap=50)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:")
    print(chunk['text'])
    print("---")

Why overlap matters: Overlap ensures that concepts spanning chunk boundaries aren't lost. If a sentence about vacation approval spans two chunks, the overlap keeps both chunks contextually complete.

Implementing Embeddings

Now we need to convert our text chunks into embeddings. We'll use OpenAI's embedding API, which provides state-of-the-art text embeddings through a simple API call.

import openai
import numpy as np
from typing import List, Dict

class EmbeddingGenerator:
    def __init__(self, api_key: str):
        self.client = openai.OpenAI(api_key=api_key)
        self.model = "text-embedding-ada-002"
    
    def generate_embedding(self, text: str) -> List[float]:
        """Generate embedding for a single text."""
        response = self.client.embeddings.create(
            input=text,
            model=self.model
        )
        return response.data[0].embedding
    
    def generate_embeddings_batch(self, texts: List[str]) -> List[List[float]]:
        """Generate embeddings for multiple texts efficiently."""
        response = self.client.embeddings.create(
            input=texts,
            model=self.model
        )
        return [item.embedding for item in response.data]

# Initialize the embedding generator
embedder = EmbeddingGenerator("your-openai-api-key")

# Generate embeddings for our document chunks
chunk_texts = [chunk['text'] for chunk in chunks]
embeddings = embedder.generate_embeddings_batch(chunk_texts)

# Each embedding is a list of 1536 numbers
print(f"Embedding dimensions: {len(embeddings[0])}")
print(f"First 5 values of first embedding: {embeddings[0][:5]}")

These embeddings capture the semantic meaning of each chunk. Chunks about similar topics will have similar embedding vectors, even if they use different words.

Building the Vector Database

With our embeddings ready, we need a way to store them and quickly find the most similar ones to a query. This is where vector databases excel—they're optimized for similarity search in high-dimensional spaces.

For our tutorial, we'll build a simple in-memory vector database using cosine similarity:

import numpy as np
from scipy.spatial.distance import cosine

class SimpleVectorDB:
    def __init__(self):
        self.embeddings = []
        self.chunks = []
        self.metadata = []
    
    def add_documents(self, chunks: List[Dict], embeddings: List[List[float]]):
        """Add chunks and their embeddings to the database."""
        self.chunks.extend(chunks)
        self.embeddings.extend(embeddings)
        # Store metadata like original document source
        self.metadata.extend([{"chunk_id": len(self.chunks) + i} 
                             for i in range(len(chunks))])
    
    def similarity_search(self, query_embedding: List[float], 
                         top_k: int = 5) -> List[Dict]:
        """Find the most similar chunks to a query embedding."""
        if not self.embeddings:
            return []
        
        # Calculate cosine similarity with all stored embeddings
        similarities = []
        for stored_embedding in self.embeddings:
            # Cosine similarity = 1 - cosine distance
            similarity = 1 - cosine(query_embedding, stored_embedding)
            similarities.append(similarity)
        
        # Get indices of top-k most similar embeddings
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        
        results = []
        for idx in top_indices:
            results.append({
                'chunk': self.chunks[idx],
                'similarity': similarities[idx],
                'metadata': self.metadata[idx]
            })
        
        return results

# Create and populate our vector database
vector_db = SimpleVectorDB()
vector_db.add_documents(chunks, embeddings)

# Test similarity search
query = "How many vacation days do new employees get?"
query_embedding = embedder.generate_embedding(query)
similar_chunks = vector_db.similarity_search(query_embedding, top_k=3)

print("Most relevant chunks for:", query)
for i, result in enumerate(similar_chunks):
    print(f"\n{i+1}. Similarity: {result['similarity']:.3f}")
    print(f"Text: {result['chunk']['text'][:200]}...")

Understanding cosine similarity: Cosine similarity measures the angle between two vectors, regardless of their magnitude. A similarity of 1 means the vectors point in the same direction (identical meaning), while 0 means they're perpendicular (unrelated meaning).

Generating Answers with Language Models

Now comes the magic: using the retrieved relevant chunks to generate natural language answers. We'll send both the user's question and the relevant context to a language model.

class AnswerGenerator:
    def __init__(self, api_key: str):
        self.client = openai.OpenAI(api_key=api_key)
    
    def generate_answer(self, question: str, relevant_chunks: List[Dict], 
                       max_context_length: int = 3000) -> str:
        """Generate an answer using retrieved context."""
        
        # Combine relevant chunks into context
        context_parts = []
        total_length = 0
        
        for result in relevant_chunks:
            chunk_text = result['chunk']['text']
            # Stop adding context if we exceed length limit
            if total_length + len(chunk_text) > max_context_length:
                break
            context_parts.append(chunk_text)
            total_length += len(chunk_text)
        
        context = "\n\n".join(context_parts)
        
        # Create the prompt
        system_prompt = """You are a helpful assistant that answers questions based on provided context. 
        Use only the information given in the context to answer questions. 
        If the context doesn't contain enough information to answer the question, say so clearly.
        Be concise but complete in your answers."""
        
        user_prompt = f"""Context:
{context}

Question: {question}

Answer:"""
        
        # Generate response
        response = self.client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=0.1,  # Lower temperature for more consistent answers
            max_tokens=500
        )
        
        return response.choices[0].message.content

# Create answer generator and test
answer_gen = AnswerGenerator("your-openai-api-key")

# Get answer for our vacation question
answer = answer_gen.generate_answer(query, similar_chunks)
print(f"Question: {query}")
print(f"Answer: {answer}")

Putting It All Together: Complete Q&A System

Let's combine all components into a complete document Q&A system:

class DocumentQASystem:
    def __init__(self, openai_api_key: str):
        self.embedder = EmbeddingGenerator(openai_api_key)
        self.vector_db = SimpleVectorDB()
        self.answer_generator = AnswerGenerator(openai_api_key)
        self.processed_documents = []
    
    def add_document(self, text: str, document_name: str = "Unknown"):
        """Add a document to the system."""
        print(f"Processing document: {document_name}")
        
        # 1. Chunk the document
        chunks = chunk_document(text)
        print(f"Created {len(chunks)} chunks")
        
        # 2. Generate embeddings
        chunk_texts = [chunk['text'] for chunk in chunks]
        embeddings = self.embedder.generate_embeddings_batch(chunk_texts)
        
        # 3. Add metadata
        for chunk in chunks:
            chunk['source_document'] = document_name
        
        # 4. Store in vector database
        self.vector_db.add_documents(chunks, embeddings)
        self.processed_documents.append(document_name)
        
        print(f"Successfully added {document_name} to the system")
    
    def ask_question(self, question: str, top_k: int = 3) -> Dict:
        """Ask a question and get an answer with source information."""
        if not self.processed_documents:
            return {"error": "No documents have been added to the system"}
        
        print(f"Processing question: {question}")
        
        # 1. Generate query embedding
        query_embedding = self.embedder.generate_embedding(question)
        
        # 2. Find relevant chunks
        relevant_chunks = self.vector_db.similarity_search(
            query_embedding, top_k=top_k
        )
        
        if not relevant_chunks:
            return {"error": "No relevant information found"}
        
        # 3. Generate answer
        answer = self.answer_generator.generate_answer(question, relevant_chunks)
        
        # 4. Compile response with sources
        sources = []
        for result in relevant_chunks:
            sources.append({
                'document': result['chunk']['source_document'],
                'similarity': result['similarity'],
                'text_preview': result['chunk']['text'][:150] + "..."
            })
        
        return {
            'question': question,
            'answer': answer,
            'sources': sources,
            'confidence': relevant_chunks[0]['similarity'] if relevant_chunks else 0
        }

# Initialize the complete system
qa_system = DocumentQASystem("your-openai-api-key")

# Add our policy document
qa_system.add_document(policy_text, "Employee Handbook")

# Ask questions
questions = [
    "How many vacation days do new employees get?",
    "What's the policy for sick leave?",
    "Do I need supervisor approval for vacation time?",
    "Can I use sick days to care for family members?"
]

for question in questions:
    print("\n" + "="*50)
    result = qa_system.ask_question(question)
    
    if 'error' in result:
        print(f"Error: {result['error']}")
    else:
        print(f"Q: {result['question']}")
        print(f"A: {result['answer']}")
        print(f"Confidence: {result['confidence']:.3f}")
        print("\nSources:")
        for i, source in enumerate(result['sources']):
            print(f"  {i+1}. {source['document']} (similarity: {source['similarity']:.3f})")

Hands-On Exercise

Now it's time to build your own document Q&A system! Follow these steps:

Step 1: Set up your environment

# Install required packages
# pip install openai numpy scipy

import openai
import numpy as np
from scipy.spatial.distance import cosine

# Get your OpenAI API key from https://platform.openai.com/api-keys
API_KEY = "your-api-key-here"

Step 2: Create a test document

Create a document about a topic you're familiar with—maybe a project manual, company guidelines, or even a detailed recipe collection. Make it at least 1000 words so you can see chunking in action.

# Example: Create a comprehensive document about your company's remote work policy
test_document = """
Remote Work Policy - Effective 2024

1. Eligibility and Approval Process
All full-time employees with at least 6 months of tenure are eligible to request remote work arrangements. Part-time employees may be considered on a case-by-case basis...

[Continue with several more sections covering equipment, expectations, communication, etc.]
"""

Step 3: Initialize and test your system

# Initialize your Q&A system
qa_system = DocumentQASystem(API_KEY)

# Add your document
qa_system.add_document(test_document, "Remote Work Policy")

# Test with questions
test_questions = [
    "Who is eligible for remote work?",
    "What equipment does the company provide?",
    "How often do remote workers need to come to the office?"
]

for question in test_questions:
    result = qa_system.ask_question(question)
    print(f"Q: {question}")
    print(f"A: {result['answer']}")
    print(f"Confidence: {result['confidence']:.3f}")
    print("---")

Step 4: Experiment with different parameters

Try modifying the chunk size, overlap, and number of retrieved chunks (top_k) to see how they affect answer quality:

# Test different chunking strategies
small_chunks = chunk_document(test_document, chunk_size=500, overlap=50)
large_chunks = chunk_document(test_document, chunk_size=1500, overlap=100)

print(f"Small chunks: {len(small_chunks)}")
print(f"Large chunks: {len(large_chunks)}")

# Compare answers with different top_k values
for k in [1, 3, 5]:
    result = qa_system.ask_question("Who is eligible for remote work?", top_k=k)
    print(f"With top_k={k}: {result['answer']}")

Common Mistakes & Troubleshooting

1. Chunk Size Problems

Mistake: Using chunks that are too small or too large.

Symptoms:

Too small: Answers lack context and seem incomplete
Too large: Irrelevant information included, increased costs

Solution: Start with 800-1200 characters and adjust based on your document structure. Technical documents might need larger chunks, while FAQ-style content works with smaller ones.

# Test different chunk sizes with the same question
def test_chunk_sizes(document, question):
    sizes = [500, 1000, 1500]
    for size in sizes:
        chunks = chunk_document(document, chunk_size=size)
        print(f"Chunk size {size}: {len(chunks)} chunks created")
        # Test retrieval quality with each size

2. Embedding API Errors

Mistake: Sending too much text at once or hitting rate limits.

Symptoms: API errors about token limits or rate limiting

Solution: Batch your requests and add retry logic:

import time
from typing import List

def safe_generate_embeddings(texts: List[str], batch_size: int = 100):
    """Generate embeddings with batching and retry logic."""
    all_embeddings = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        
        for attempt in range(3):  # Retry up to 3 times
            try:
                embeddings = embedder.generate_embeddings_batch(batch)
                all_embeddings.extend(embeddings)
                break
            except Exception as e:
                print(f"Attempt {attempt + 1} failed: {e}")
                time.sleep(2 ** attempt)  # Exponential backoff
    
    return all_embeddings

3. Poor Retrieval Quality

Mistake: Getting irrelevant chunks for questions.

Symptoms: Answers are off-topic or the system says it can't find relevant information

Solutions:

Check your similarity threshold—you might need to filter out low-similarity results
Improve your chunking strategy to maintain better context
Consider preprocessing your text to remove noise

def filtered_similarity_search(self, query_embedding, top_k=5, min_similarity=0.7):
    """Only return chunks above a similarity threshold."""
    results = self.similarity_search(query_embedding, top_k=10)  # Get more candidates
    filtered_results = [r for r in results if r['similarity'] >= min_similarity]
    return filtered_results[:top_k]  # Return only top k of filtered results

4. Generic or Unhelpful Answers

Mistake: The language model gives vague responses or admits it doesn't know when relevant information exists.

Solutions:

Improve your system prompt to be more specific about using the context
Ensure you're passing enough relevant context
Experiment with different temperature settings

improved_system_prompt = """You are an expert assistant specializing in document analysis. 
Your job is to provide accurate, specific answers based solely on the provided context.

Guidelines:
- Quote specific phrases from the context when relevant
- If the context contains partial information, state what you know and what's missing
- Never make up information not present in the context
- Be specific with numbers, dates, and requirements when they appear in the context"""

Summary & Next Steps

Congratulations! You've built a complete document Q&A system that can understand and answer questions about any text document. Let's recap what you've accomplished:

You learned how embeddings transform text into mathematical representations that capture semantic meaning, enabling computers to understand similarity between different pieces of text. You implemented document chunking strategies that balance context preservation with processing efficiency. You built a vector database that can quickly find relevant information using similarity search, and you integrated a language model to generate natural, accurate answers.

Your system can now:

Process documents of any size by intelligently chunking them
Find semantically relevant information even when questions use different words than the source text
Generate natural language answers with source attribution
Handle multiple documents and maintain source tracking

Immediate next steps to improve your system:

Add document format support: Extend your system to handle PDFs, Word documents, and web pages using libraries like PyPDF2 or BeautifulSoup.
Implement persistent storage: Replace the in-memory vector database with a persistent solution like Pinecone, Weaviate, or Chroma for production use.
Add conversation memory: Allow follow-up questions by maintaining conversation context and referring back to previous exchanges.

Advanced improvements to explore:

Hybrid search: Combine semantic search with traditional keyword search for better retrieval
Answer validation: Add mechanisms to verify answer quality and flag uncertain responses
Multi-document synthesis: Enable the system to combine information from multiple sources in a single answer
Real-time updates: Implement incremental indexing so you can add new documents without reprocessing everything

The document Q&A system you've built represents a foundation that scales to handle enterprise-level document collections. Companies use exactly these techniques to build internal knowledge bases, customer support systems, and research tools that save countless hours of manual document review.

Your next learning milestone might be exploring retrieval-augmented generation (RAG) architectures in more depth, or diving into vector database optimization for large-scale systems. The principles you've learned here—embeddings, semantic search, and context-aware answer generation—form the backbone of modern AI-powered information retrieval systems.

Building a Document Q&A System with Embeddings: A Complete Beginner's Guide