Building a Local RAG System with Ollama and Gemma: A Complete Guide

Retrieval-Augmented Generation (RAG) has revolutionized how we interact with large language models by combining the power of information retrieval with text generation. In this comprehensive guide, we’ll walk through creating a complete RAG system that runs entirely on your local machine using Ollama and the Gemma 2B model.

Why Build a Local RAG System?

Before diving into the implementation, let’s understand why building a local RAG system is beneficial:

Data Privacy: Your sensitive documents never leave your machine
Cost Efficiency: No API costs or usage limits
Offline Capability: Works without internet connectivity
Customization: Full control over the model and parameters
Scalability: Process large document collections without external constraints

What is RAG?

RAG (Retrieval-Augmented Generation) combines two key components:

Retrieval System: Searches for relevant information from a knowledge base
Generation Model: Uses the retrieved information to generate contextually accurate responses

This approach allows language models to access specific, up-to-date information while maintaining their natural language generation capabilities.

RAG System Architecture

Here’s how our RAG system works:

Step-by-Step Process:

Document Processing: PDF documents are loaded and split into manageable chunks
Embedding Creation: Text chunks are converted to numerical vectors using the embedding model
Vector Storage: Embeddings are stored in ChromaDB for efficient similarity search
Query Processing: User query is embedded and searched against the vector store
Context Retrieval: Most relevant chunks are retrieved as context
Prompt Construction: User query + retrieved context are combined in a prompt template
Response Generation: LLM generates a response based on the prompt and context
Output: Final response is returned to the user

Prerequisites and Setup

System Requirements

Python 3.8 or higher
Sufficient RAM (minimum 8GB recommended for Gemma 2B)
Available disk space for models and vector databases

Installing Ollama

First, install Ollama on your system by visiting https://ollama.com/download and following the installation instructions for your operating system.

After installation, verify it’s working correctly:

ollama  --version

Choosing Your Model

We’ll use Gemma 2B for this tutorial, but here’s a comparison of available models:

Downloading the Models

Pull the required models to your local system:

# Download the main language model
ollama  pull  gemma:2b

# Download the embedding model
ollama  pull  nomic-embed-text

# Verify downloaded models
ollama  list

Note: If Ollama isn’t running, start the server with:

ollama  serve

Setting Up the Python Environment

Create an isolated Python environment for your RAG system:

# Create virtual environment
python3  -m  venv  ragenv

# Activate the environment
source  ragenv/bin/activate  # On Linux/Mac

# or
ragenv\Scripts\activate  # On Windows

# Install required packages
pip  install  langchain  langchain-community  langchain-core  langchain-ollama  chromadb  sentence-transformers  pypdf  python-dotenv  unstructured[pdf]  tiktoken  fastapi  uvicorn

Implementation: Step-by-Step Guide

1. Document Loading

The first step is loading your PDF documents. We’ll use LangChain’s UnstructuredPDFLoader for handling complex layouts:

import os
from dotenv import load_dotenv
from langchain_community.document_loaders import UnstructuredPDFLoader

load_dotenv() # Optional: Load environment variables from .env file

DATA_PATH = "data/"
PDF_FILENAME = "Company_profile.pdf"  # Replace with your PDF filename

def  load_documents():
	"""Loads documents from the specified data path."""
	pdf_path = os.path.join(DATA_PATH, PDF_FILENAME)
	loader = UnstructuredPDFLoader(pdf_path)
	documents = loader.load()
	print(f"Loaded {len(documents)} page(s) from {pdf_path}")
	return documents

2. Document Chunking

Large documents need to be split into smaller, manageable chunks for effective retrieval:

from langchain_text_splitters import RecursiveCharacterTextSplitter

def  split_documents(documents):
	"""Splits documents into smaller chunks."""
	text_splitter = RecursiveCharacterTextSplitter(
		chunk_size=1000,
		chunk_overlap=200,
		length_function=len,
		is_separator_regex=False,)
		
	all_splits = text_splitter.split_documents(documents)
	print(f"Split into {len(all_splits)} chunks")
	return all_splits

Key Parameters Explained:

chunk_size=1000: Maximum characters per chunk
chunk_overlap=200: Overlap between chunks to maintain context
RecursiveCharacterTextSplitter: Attempts semantic splitting before fixed-size splits

3. Embedding Function

Embeddings convert text into numerical vectors that capture semantic meaning:

from langchain_ollama import OllamaEmbeddings

def  get_embedding_function(model_name="nomic-embed-text"):
	"""Initializes the Ollama embedding function."""
	embeddings = OllamaEmbeddings(model=model_name)
	print(f"Initialized Ollama embeddings with model: {model_name}")
	return embeddings

4. Vector Store Setup

ChromaDB provides local vector storage and similarity search capabilities:

from langchain_community.vectorstores import Chroma

CHROMA_PATH = "chroma_db"  # Directory to store ChromaDB data
def  get_vector_store(embedding_function, persist_directory=CHROMA_PATH):
	"""Initializes or loads the Chroma vector store."""
	vectorstore = Chroma(
		persist_directory=persist_directory,
		embedding_function=embedding_function
	)
print(f"Vector store initialized/loaded from: {persist_directory}")
return vectorstore

5. Document Indexing

Index your document chunks into the vector store:

def  index_documents(chunks, embedding_function, persist_directory=CHROMA_PATH):
	"""Indexes document chunks into the Chroma vector store."""
	print(f"Indexing {len(chunks)} chunks...")
	vectorstore = Chroma.from_documents(
		documents=chunks,
		embedding=embedding_function,
		persist_directory=persist_directory
		)
	vectorstore.persist() # Ensure data is saved
	print(f"Indexing complete. Data saved to: {persist_directory}")
	return vectorstore

6. Creating the RAG Chain

The RAG chain combines retrieval and generation:

from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

def  create_rag_chain(vector_store, llm_model_name="gemma:2b", context_window=8192):
	"""Creates the RAG chain."""

	# Initialize the LLM
	llm = ChatOllama(
		model=llm_model_name,
		temperature=0.1, # Lower temperature for more factual responses
		num_ctx=context_window # Set context window size
		)

	print(f"Initialized ChatOllama with model: {llm_model_name}, context window: {context_window}")

	# Create the retriever
	retriever = vector_store.as_retriever(
		search_type="similarity",
		search_kwargs={'k': 3} # Retrieve top 3 relevant chunks
		)
	print("Retriever initialized.")

	# Define the prompt template
	template = """You are a helpful and informative bot that answers questions using text from the reference Context included below. Answer the questions as you are a team member of the company. Don't mention about the context or the document. Be sure to respond in a complete sentence, being comprehensive, including all relevant background information. However, you are talking to a non-technical audience, so be sure to break down complicated concepts and strike a friendly and conversational tone. If the Context is irrelevant to the answer, tell them to contact the company to know more.
	Question: {question}
	Context: {context}
"""
	prompt = ChatPromptTemplate.from_template(template)
	print("Prompt template created.")

	# Define the RAG chain using LCEL
	rag_chain = (
		{"context": retriever, "question": RunnablePassthrough()}
		| prompt
		| llm
		| StrOutputParser()
	)

	print("RAG chain created.")
	return rag_chain

Building the API

Create a FastAPI application to serve your RAG system:

from fastapi import FastAPI, Request

app = FastAPI()

# Initialize components
docs = load_documents()

chunks = split_documents(docs)

embedding_function = get_embedding_function()

vector_store = get_vector_store(embedding_function)

rag_chain = create_rag_chain(vector_store, llm_model_name="gemma:2b")

@app.post("/rag")
async  def  rag_endpoint(request: Request):
	"""Query the RAG system."""
	data = await request.json()
	question = data["question"]
	answer = rag_chain.invoke(question)
	return {"answer": answer}

@app.post("/index_db")
async  def  index_endpoint(request: Request):
	"""Re-index the database."""
	vector_store = index_documents(chunks, embedding_function)
	return {"message": "Indexing completed successfully"}

Running Your RAG System

Start the API server:

uvicorn  your_api_file:app  --host  0.0.0.0  --port  8080

Replace your_api_file with your actual Python file name.

Testing Your System

You can test your RAG system using tools like Postman, curl, or any HTTP client:

Query Example:

curl  -X  POST  "http://localhost:8080/rag"  \
-H "Content-Type: application/json" \
-d  '{"question": "What services does the company provide?"}'

Re-indexing Example:

curl  -X  POST  "http://localhost:8080/index_db"  \
-H "Content-Type: application/json"

Performance Optimization Tips

1. Chunk Size Optimization

Small chunks (200-500 chars): Better for precise retrieval but may lose context
Large chunks (1000-2000 chars): Better context retention but less precise retrieval
Experiment with different sizes based on your document type

2. Retrieval Parameters

Adjust k value in search_kwargs based on your needs
Consider using MMR (Maximum Marginal Relevance) for diverse results:

retriever = vector_store.as_retriever(
	search_type="mmr",
	search_kwargs={'k': 3, 'fetch_k': 10}
	)

3. Model Selection

Use smaller models (Gemma 2B, TinyLlama) for faster responses
Use larger models (Llama3 8B) for better quality when speed isn’t critical

Common Issues and Solutions

Issue 1: Ollama Server Not Running

Solution: Start Ollama server before running your application:

ollama  serve

Issue 2: Memory Issues

Solution: Use a smaller model or reduce context window size:

llm = ChatOllama(
	model="gemma:2b",
	num_ctx=4096  # Reduce from 8192
	)

Issue 3: Poor Retrieval Quality

Solution:

Experiment with different chunk sizes
Adjust the number of retrieved chunks
Use better embedding models if available

Best Practices

Document Preparation: Clean your PDFs and remove unnecessary content before indexing
Prompt Engineering: Customize the prompt template for your specific use case
Monitoring: Log queries and responses to improve system performance
Regular Updates: Re-index documents when content changes
Testing: Test with various question types to ensure robust performance

Conclusion

Building a local RAG system with Ollama and Gemma provides a powerful, privacy-focused solution for document-based question answering. This setup offers complete control over your data while maintaining the sophisticated capabilities of modern language models.

The system we’ve built is production-ready and can be extended with additional features like:

Multiple document format support
Advanced retrieval strategies
User authentication
Query caching
Performance monitoring

With this foundation, you can customize and scale your RAG system to meet specific requirements while keeping everything running locally and securely.

Search This Blog

Developers Logs