Deploying Your Local RAG System with Chat Memory to Google Cloud Platform

This is the third installment in our comprehensive series on building and deploying RAG (Retrieval-Augmented Generation) systems. In Part 1, we built a foundational RAG system using Ollama and Gemma. In Part 2, we enhanced it with Redis-based chat memory functionality. Now, we’ll take the next crucial step: deploying our memory-enhanced RAG system to Google Cloud Platform (GCP) for production use.

Moving from local development to cloud deployment opens up new possibilities for your RAG system. You’ll gain better accessibility, scalability, and the ability to serve multiple users simultaneously while maintaining the same powerful local AI capabilities we’ve built.

Why Deploy to Google Cloud Platform?

Before diving into the deployment process, let’s understand why GCP is an excellent choice for hosting your RAG system:

Cost-Effective Scaling: GCP offers flexible pricing models and free tier options that make it accessible for both experimentation and production deployments. You can start small and scale as your needs grow.

High Performance Infrastructure: Google’s robust infrastructure ensures reliable uptime and fast response times for your AI applications, crucial for maintaining good user experience.

Security and Compliance: Enterprise-grade security features protect your data and applications, while compliance certifications meet various regulatory requirements.

Integration Ecosystem: Seamless integration with other Google services and third-party tools provides flexibility for future enhancements and integrations.

Prerequisites

Before starting the deployment process, ensure you have:

A Google Cloud Platform account with billing enabled
Access to GCP free credits (available for new users)
Your RAG system code from Parts 1 and 2 of this series
PDF documents you want to use with your RAG system
Basic familiarity with command-line operations

Step-by-Step Deployment Guide

1. Setting Up Your GCP Virtual Machine

The first step is creating a virtual machine that will host your RAG system. We’ll use a cost-effective configuration suitable for our AI workload.

Creating the VM Instance:

Navigate to the GCP Console
Go to Compute Engine > VM Instances
Click "Create Instance"
Configure your instance with these settings:
- Name: rag-system-vm (or your preferred name)
- Machine type: e2-medium (suitable for small to medium workloads)
- Boot disk: Debian 12 (stable and well-supported)
- Firewall: Check “Allow HTTP traffic”
Click Create

💡 Cost Optimization Tip: If you’re using GCP’s free tier, consider starting with an e2-micro instance and upgrading if needed. The e2-medium provides better performance for AI workloads but may exceed free tier limits.

2. Connecting to Your Virtual Machine

Once your VM is running, you’ll need to access it to install and configure your RAG system.

Accessing via SSH:

From the VM instances list, locate your newly created VM
Click the “SSH” button to open a browser-based terminal
This will open a secure shell session directly in your browser

The browser-based SSH is convenient and requires no additional software installation. You’re now ready to begin configuring your server environment.

3. Preparing the Python Environment

Your RAG system requires Python 3 and several dependencies. Let’s set up a clean, isolated environment.

Installing Python and Creating Virtual Environment:

# Update system packages
sudo apt update

# Install Python 3 and essential tools
sudo apt install -y python3 python3-pip python3-venv

# Create a dedicated virtual environment
python3 -m venv ragenv

# Activate the virtual environment
source ragenv/bin/activate

📝 Note: Always use virtual environments for Python projects in production. This prevents dependency conflicts and makes your deployment more maintainable.

4. Installing Required Dependencies

Our RAG system relies on several Python packages for document processing, embeddings, and API functionality.

Installing Core Dependencies:

# LangChain ecosystem packages
pip install langchain langchain-community langchain-core langchain-ollama

# Vector database and embeddings
pip install chromadb sentence-transformers

# Document processing
pip install pypdf unstructured[pdf] tiktoken

# API framework and utilities  
pip install fastapi uvicorn python-dotenv

# Redis for chat memory
pip install redis

This installation process may take several minutes as it downloads and compiles various machine learning libraries.

5. Installing and Configuring Ollama

Ollama serves as our local AI model runtime, providing the language model capabilities for our RAG system.

Installing Ollama:

# Download and install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama service in background
ollama serve &

Downloading Required Models:

# Pull the Gemma 2B model for text generation
ollama pull gemma:2b

# Pull embedding model for document vectorization
ollama pull nomic-embed-text

Testing Your Installation:

# Test the model interactively
ollama run gemma:2b

Type a simple question to verify the model is working, then exit with /bye.

⚠️ Important: Model downloads can be large (several GB). Ensure you have sufficient disk space and a stable internet connection.

6. Setting Up Redis for Chat Memory

Redis provides fast, in-memory storage for our chat history functionality, enabling contextual conversations.

Installing and Configuring Redis:

# Install Redis server
sudo apt install -y redis

# Enable Redis to start automatically on boot
sudo systemctl enable redis-server

# Start Redis service
sudo systemctl start redis-server

# Verify Redis is running
sudo systemctl status redis-server

Testing Redis Connection:

# Access Redis CLI
redis-cli

# Test basic operations
set testkey "Hello Redis"
get testkey

# Exit Redis CLI
exit

If you see “Hello Redis” returned from the get command, Redis is working correctly.

7. Uploading Your Application Code

Now you need to transfer your RAG system code from your local machine to the GCP VM.

Method 1: Using gcloud CLI (from your local machine):

# Upload your Python API file
gcloud compute scp your_api_file.py your-vm-name:~/ --zone=your-vm-zone

# Upload your documents folder
gcloud compute scp --recurse ./data your-vm-name:~/ --zone=your-vm-zone

Method 2: Using Browser SSH Upload:

In your browser SSH session, you’ll see an upload icon (folder with up arrow)
Click it and select your Python files and documents
Files will be uploaded to your home directory

Creating the Data Directory:

# Create directory for your documents
mkdir ~/data

# Verify your files are uploaded correctly
ls -la ~/
ls -la ~/data/

8. Configuring Network Access

To access your RAG API from external sources, you need to configure GCP firewall rules.

Setting Up Firewall Rules:

In GCP Console, navigate to VPC Network > Firewall
Click "Create Firewall Rule"
Configure the rule:
- Name: allow-rag-api
- Direction: Ingress
- Action: Allow
- Targets: All instances in the network
- Source IP ranges: 0.0.0.0/0 (or restrict to your IP for security)
- Protocols and ports: Check TCP, specify port 8000
Click Create

🔒 Security Note: For production deployments, consider restricting source IP ranges to specific networks or implementing authentication mechanisms.

9. Running Your RAG System

With all components installed and configured, you can now start your RAG API server.

Starting the API Server:

# Make sure your virtual environment is activated
source ragenv/bin/activate

# Start the FastAPI server
uvicorn your_api_file:app --host 0.0.0.0 --port 8000

For Persistent Operation (survives SSH disconnection):

# Run in background with logging
nohup uvicorn your_api_file:app --host 0.0.0.0 --port 8000 > api.log 2>&1 &

# Check if it's running
ps aux | grep uvicorn

# View logs
tail -f api.log

10. Testing Your Deployed System

Once your server is running, you can test it using the external IP address of your GCP VM.

Finding Your External IP:

In GCP Console, go to Compute Engine > VM Instances
Note the “External IP” column for your VM

Testing API Endpoints:

# Test chat functionality
curl -X POST "http://YOUR_EXTERNAL_IP:8000/rag_chat" \
-H "Content-Type: application/json" \
-d '{
    "user_id": "test_user",
    "question": "What services does the company provide?"
}'

# Test follow-up question (memory functionality)
curl -X POST "http://YOUR_EXTERNAL_IP:8000/rag_chat" \
-H "Content-Type: application/json" \
-d '{
    "user_id": "test_user", 
    "question": "Can you tell me more about the first service?"
}'

# Clear chat history
curl -X POST "http://YOUR_EXTERNAL_IP:8000/clear_chat" \
-H "Content-Type: application/json" \
-d '{"user_id": "test_user"}'

Monitoring and Maintenance

Performance Monitoring

Keep track of your system’s performance and resource usage:

# Monitor system resources
htop

# Check disk usage
df -h

# Monitor API logs
tail -f api.log

# Check Redis memory usage
redis-cli info memory

Security Considerations

Access Control

Implement API Authentication:

Consider adding authentication middleware to your FastAPI application:

from fastapi import Depends, HTTPException, status
from fastapi.security import HTTPBearer

security = HTTPBearer()

def verify_token(token: str = Depends(security)):
    # Implement your token verification logic
    if token.credentials != "your-secret-token":
        raise HTTPException(
            status_code=status.HTTP_401_UNAUTHORIZED,
            detail="Invalid authentication credentials"
        )
    return token

Network Security:

Restrict firewall rules to specific IP ranges when possible
Consider using HTTPS with SSL certificates for production
Regularly update system packages: sudo apt update && sudo apt upgrade

Data Privacy

Ensure your documents don’t contain sensitive information
Implement proper logging practices that don’t expose user data
Consider encrypting data at rest if handling sensitive documents

Scaling Considerations

Vertical Scaling

If you need more performance, you can upgrade your VM:

Stop your VM instance
Click “Edit” in the GCP Console
Change machine type to a larger size (e.g., e2-standard-2)
Start the instance and restart your services

Horizontal Scaling

For high-traffic scenarios:

Use GCP Load Balancer to distribute traffic across multiple VM instances
Implement Redis clustering for distributed memory management
Consider using Google Kubernetes Engine (GKE) for container orchestration

Conclusion

Congratulations! You’ve successfully deployed a production-ready RAG system with chat memory to Google Cloud Platform. Your system now offers:

Global Accessibility: Users can access your AI system from anywhere with an internet connection
Scalable Infrastructure: GCP’s infrastructure can grow with your needs
Persistent Memory: Redis ensures conversation context is maintained across sessions
Professional API: RESTful endpoints ready for integration with web and mobile applications
Cost-Effective Operation: Optimized for reasonable operational costs while maintaining performance

This deployment represents a significant milestone in your AI journey. You’ve moved from local experimentation to a cloud-based solution that can serve real users with real-world applications.

The architecture you’ve built is robust and extensible. Whether you’re using this for customer support, document analysis, educational tools, or any other application, you now have a solid foundation that can evolve with your needs.

Your RAG system is now ready for production use. The combination of Ollama’s local AI capabilities with Redis memory management and GCP’s infrastructure provides a powerful, scalable solution for intelligent document interaction.

💡 Pro Tip: Keep your GCP free credits in mind and monitor your usage regularly. The system we’ve built is designed to be cost-effective, but always stay aware of your resource consumption to avoid unexpected charges.

Ready to take your RAG system to the next level? Start experimenting with different models, expanding your document collection, and exploring the advanced features that make your AI assistant even more powerful and useful for your specific use case.

Search This Blog

Developers Logs

Building a Local RAG System with Ollama and Gemma: A Complete Guide - Part 3