Building a Local RAG System with Ollama and Gemma: A Complete Guide
Retrieval-Augmented Generation (RAG) has revolutionized how we interact with large language models by combining the power of information retrieval with text generation. In this comprehensive guide, we’ll walk through creating a complete RAG system that runs entirely on your local machine using Ollama and the Gemma 2B model.
Why Build a Local RAG System?
Before diving into the implementation, let’s understand why building a local RAG system is beneficial:
-
Data Privacy: Your sensitive documents never leave your machine
-
Cost Efficiency: No API costs or usage limits
-
Offline Capability: Works without internet connectivity
-
Customization: Full control over the model and parameters
-
Scalability: Process large document collections without external constraints
What is RAG?
RAG (Retrieval-Augmented Generation) combines two key components:
-
Retrieval System: Searches for relevant information from a knowledge base
-
Generation Model: Uses the retrieved information to generate contextually accurate responses
This approach allows language models to access specific, up-to-date information while maintaining their natural language generation capabilities.
RAG System Architecture
Here’s how our RAG system works:
Step-by-Step Process:
-
Document Processing: PDF documents are loaded and split into manageable chunks
-
Embedding Creation: Text chunks are converted to numerical vectors using the embedding model
-
Vector Storage: Embeddings are stored in ChromaDB for efficient similarity search
-
Query Processing: User query is embedded and searched against the vector store
-
Context Retrieval: Most relevant chunks are retrieved as context
-
Prompt Construction: User query + retrieved context are combined in a prompt template
-
Response Generation: LLM generates a response based on the prompt and context
-
Output: Final response is returned to the user
Prerequisites and Setup
System Requirements
-
Python 3.8 or higher
-
Sufficient RAM (minimum 8GB recommended for Gemma 2B)
-
Available disk space for models and vector databases
Installing Ollama
First, install Ollama on your system by visiting https://ollama.com/download and following the installation instructions for your operating system.
After installation, verify it’s working correctly:
ollama --version
Choosing Your Model
We’ll use Gemma 2B for this tutorial, but here’s a comparison of available models:
Downloading the Models
Pull the required models to your local system:
# Download the main language model
ollama pull gemma:2b
# Download the embedding model
ollama pull nomic-embed-text
# Verify downloaded models
ollama list
Note: If Ollama isn’t running, start the server with:
ollama serve
Setting Up the Python Environment
Create an isolated Python environment for your RAG system:
# Create virtual environment
python3 -m venv ragenv
# Activate the environment
source ragenv/bin/activate # On Linux/Mac
# or
ragenv\Scripts\activate # On Windows
# Install required packages
pip install langchain langchain-community langchain-core langchain-ollama chromadb sentence-transformers pypdf python-dotenv unstructured[pdf] tiktoken fastapi uvicorn
Implementation: Step-by-Step Guide
1. Document Loading
The first step is loading your PDF documents. We’ll use LangChain’s UnstructuredPDFLoader
for handling complex layouts:
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import UnstructuredPDFLoader
load_dotenv() # Optional: Load environment variables from .env file
DATA_PATH = "data/"
PDF_FILENAME = "Company_profile.pdf" # Replace with your PDF filename
def load_documents():
"""Loads documents from the specified data path."""
pdf_path = os.path.join(DATA_PATH, PDF_FILENAME)
loader = UnstructuredPDFLoader(pdf_path)
documents = loader.load()
print(f"Loaded {len(documents)} page(s) from {pdf_path}")
return documents
2. Document Chunking
Large documents need to be split into smaller, manageable chunks for effective retrieval:
from langchain_text_splitters import RecursiveCharacterTextSplitter
def split_documents(documents):
"""Splits documents into smaller chunks."""
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
is_separator_regex=False,)
all_splits = text_splitter.split_documents(documents)
print(f"Split into {len(all_splits)} chunks")
return all_splits
Key Parameters Explained:
-
chunk_size=1000
: Maximum characters per chunk -
chunk_overlap=200
: Overlap between chunks to maintain context -
RecursiveCharacterTextSplitter
: Attempts semantic splitting before fixed-size splits
3. Embedding Function
Embeddings convert text into numerical vectors that capture semantic meaning:
from langchain_ollama import OllamaEmbeddings
def get_embedding_function(model_name="nomic-embed-text"):
"""Initializes the Ollama embedding function."""
embeddings = OllamaEmbeddings(model=model_name)
print(f"Initialized Ollama embeddings with model: {model_name}")
return embeddings
4. Vector Store Setup
ChromaDB provides local vector storage and similarity search capabilities:
from langchain_community.vectorstores import Chroma
CHROMA_PATH = "chroma_db" # Directory to store ChromaDB data
def get_vector_store(embedding_function, persist_directory=CHROMA_PATH):
"""Initializes or loads the Chroma vector store."""
vectorstore = Chroma(
persist_directory=persist_directory,
embedding_function=embedding_function
)
print(f"Vector store initialized/loaded from: {persist_directory}")
return vectorstore
5. Document Indexing
Index your document chunks into the vector store:
def index_documents(chunks, embedding_function, persist_directory=CHROMA_PATH):
"""Indexes document chunks into the Chroma vector store."""
print(f"Indexing {len(chunks)} chunks...")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embedding_function,
persist_directory=persist_directory
)
vectorstore.persist() # Ensure data is saved
print(f"Indexing complete. Data saved to: {persist_directory}")
return vectorstore
6. Creating the RAG Chain
The RAG chain combines retrieval and generation:
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
def create_rag_chain(vector_store, llm_model_name="gemma:2b", context_window=8192):
"""Creates the RAG chain."""
# Initialize the LLM
llm = ChatOllama(
model=llm_model_name,
temperature=0.1, # Lower temperature for more factual responses
num_ctx=context_window # Set context window size
)
print(f"Initialized ChatOllama with model: {llm_model_name}, context window: {context_window}")
# Create the retriever
retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={'k': 3} # Retrieve top 3 relevant chunks
)
print("Retriever initialized.")
# Define the prompt template
template = """You are a helpful and informative bot that answers questions using text from the reference Context included below. Answer the questions as you are a team member of the company. Don't mention about the context or the document. Be sure to respond in a complete sentence, being comprehensive, including all relevant background information. However, you are talking to a non-technical audience, so be sure to break down complicated concepts and strike a friendly and conversational tone. If the Context is irrelevant to the answer, tell them to contact the company to know more.
Question: {question}
Context: {context}
"""
prompt = ChatPromptTemplate.from_template(template)
print("Prompt template created.")
# Define the RAG chain using LCEL
rag_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
print("RAG chain created.")
return rag_chain
Building the API
Create a FastAPI application to serve your RAG system:
from fastapi import FastAPI, Request
app = FastAPI()
# Initialize components
docs = load_documents()
chunks = split_documents(docs)
embedding_function = get_embedding_function()
vector_store = get_vector_store(embedding_function)
rag_chain = create_rag_chain(vector_store, llm_model_name="gemma:2b")
@app.post("/rag")
async def rag_endpoint(request: Request):
"""Query the RAG system."""
data = await request.json()
question = data["question"]
answer = rag_chain.invoke(question)
return {"answer": answer}
@app.post("/index_db")
async def index_endpoint(request: Request):
"""Re-index the database."""
vector_store = index_documents(chunks, embedding_function)
return {"message": "Indexing completed successfully"}
Running Your RAG System
Start the API server:
uvicorn your_api_file:app --host 0.0.0.0 --port 8080
Replace your_api_file
with your actual Python file name.
Testing Your System
You can test your RAG system using tools like Postman, curl, or any HTTP client:
Query Example:
curl -X POST "http://localhost:8080/rag" \
-H "Content-Type: application/json" \
-d '{"question": "What services does the company provide?"}'
Re-indexing Example:
curl -X POST "http://localhost:8080/index_db" \
-H "Content-Type: application/json"
Performance Optimization Tips
1. Chunk Size Optimization
-
Small chunks (200-500 chars): Better for precise retrieval but may lose context
-
Large chunks (1000-2000 chars): Better context retention but less precise retrieval
-
Experiment with different sizes based on your document type
2. Retrieval Parameters
-
Adjust
k
value in search_kwargs based on your needs -
Consider using MMR (Maximum Marginal Relevance) for diverse results:
retriever = vector_store.as_retriever(
search_type="mmr",
search_kwargs={'k': 3, 'fetch_k': 10}
)
3. Model Selection
-
Use smaller models (Gemma 2B, TinyLlama) for faster responses
-
Use larger models (Llama3 8B) for better quality when speed isn’t critical
Common Issues and Solutions
Issue 1: Ollama Server Not Running
Solution: Start Ollama server before running your application:
ollama serve
Issue 2: Memory Issues
Solution: Use a smaller model or reduce context window size:
llm = ChatOllama(
model="gemma:2b",
num_ctx=4096 # Reduce from 8192
)
Issue 3: Poor Retrieval Quality
Solution:
-
Experiment with different chunk sizes
-
Adjust the number of retrieved chunks
-
Use better embedding models if available
Best Practices
-
Document Preparation: Clean your PDFs and remove unnecessary content before indexing
-
Prompt Engineering: Customize the prompt template for your specific use case
-
Monitoring: Log queries and responses to improve system performance
-
Regular Updates: Re-index documents when content changes
-
Testing: Test with various question types to ensure robust performance
Conclusion
Building a local RAG system with Ollama and Gemma provides a powerful, privacy-focused solution for document-based question answering. This setup offers complete control over your data while maintaining the sophisticated capabilities of modern language models.
The system we’ve built is production-ready and can be extended with additional features like:
-
Multiple document format support
-
Advanced retrieval strategies
-
User authentication
-
Query caching
-
Performance monitoring
With this foundation, you can customize and scale your RAG system to meet specific requirements while keeping everything running locally and securely.
Comments
Post a Comment