🐳 Ollama no Google Cloud: Guia Completo¶

❓ Pergunta: "Como rodar Ollama no Google Cloud? Cloud Run serve?"¶

Resposta Rápida: - Cloud Run: ❌ Não é ideal (ephemeral, stateless) - Compute Engine: ✅ Perfeito (persistente, barato) - Ollama Cloud: ⭐ Melhor (managed, 0 ops)

📊 Comparação: Onde Rodar Ollama no GCP¶

Plataforma	Cold Start	Price/mês	Setup	State	GPU	Recomendação
Cloud Run	30-60s	$5-20	Trivial	❌ Ephemeral	❌	❌ Não
Compute E.	<1s	$20-30	30min	✅ Persistent	⚠️	✅ SIM
GKE	<1s	$50-150	2h	✅ Persistent	✅	Overkill
Ollama Cloud	<100ms	$5-15	2min	N/A	✅	⭐ BEST
Local (dev)	N/A	$0	5min	✅	Sim	Dev only

❌ Por Que Cloud Run É Ruim para Ollama¶

Problema 1: Stateless Architecture¶

Cloud Run design:
└─ Starts container on request
└─ Runs handler
└─ Stops container after ~15 minutes idle

Mas Ollama precisa:
├─ Modelo carregado em memória
├─ Cache persistente entre requests
└─ Estado ininterrupto

Resultado:
├─ Request 1: Carrega modelo (~30s) + query (~1s) = 31s ⏱️ LENTO
├─ Request 2 (após 15min): Recarga modelo (~30s) + query (~1s) = 31s
└─ Request 3 (se <15min): Query (~1s) = 1s ✅

SLA: Imprevisível, usuários esperam 30s aleatoriamente

Problema 2: Ephemeral Storage¶

Cloud Run filesystem:
├─ /tmp é um tmpfs de 512MB
├─ Tudo é perdido ao container morrer
└─ Modelos Ollama = 270MB-7GB

Solução: Usar Cloud Storage + mounting
├─ Complexidade: Alta
├─ Latência: 100-500ms para carregar modelo
├─ Custo: $0.020/GB/mês storage + egress
└─ Overhead: Toda request começa com I/O

Realidade:
└─ Mais lento que simplesmente usar Compute Engine

Problema 3: Compute Engine¶

Cloud Run max: 4 vCPU, 16GB RAM
Ollama eficiente: 2 vCPU, 4GB RAM

Porém, problema = reuso:
├─ Cloud Run: Mata depois de 15min → reload
├─ Compute E.: Roda 24/7 → sempre pronto

Trade-off:
├─ Cloud Run: Economizar $10/mês custa +30s latência
└─ Compute E.: +$20/mês custa -30s latência (WIN)

✅ Solução Recomendada: Compute Engine¶

Arquitetura¶

┌──────────────────────────────────┐
│ Slack in Google Cloud Run        │
│ (ifriend-agents)                 │
└──────────────────┬───────────────┘
                   │
        ┌──────────┴──────────┐
        ▼                     ▼
    ┌────────────┐    ┌─────────────┐
    │ Supabase   │    │  Compute    │
    │ PostgreSQL │    │  Engine VM  │
    │ + pgvector │    │  + Ollama   │
    │  $25/mth   │    │  $25/mth    │
    └────────────┘    └─────────────┘

Total: $50/mth (vs $150+ Vertex AI)
Performance: Sub-200ms (vs Cloud Run 500ms+)

Setup Compute Engine¶

Passo 1: Criar VM¶

# Via Console GCP ou CLI:
gcloud compute instances create ollama-server \
  --image-family=debian-12 \
  --image-project=debian-cloud \
  --machine-type=e2-medium \           # 2 vCPU, 4GB RAM = $25/mth
  --zone=us-central1-a \
  --boot-disk-size=50GB \              # 50GB SSD para modelos
  --tags=ollama,http,https \
  --scopes=https://www.googleapis.com/auth/cloud-platform

# Output:
# NAME           ZONE           MACHINE_TYPE  PREEMPTIBLE  INTERNAL_IP    EXTERNAL_IP
# ollama-server  us-central1-a  e2-medium                  10.128.0.2     35.192.xxx.xxx

Passo 2: Configurar Firewall¶

# Allow HTTP from Cloud Run
gcloud compute firewall-rules create ollama-internal \
  --allow=tcp:11434 \
  --source-ranges=10.128.0.0/9 \       # GCP internal
  --target-tags=ollama

# Allow SSH (para admin)
gcloud compute firewall-rules create ollama-ssh \
  --allow=tcp:22 \
  --source-ranges=YOUR_IP/32 \         # Seu IP ou bastion
  --target-tags=ollama

Passo 3: SSH e Instalar Docker¶

gcloud compute ssh ollama-server --zone=us-central1-a

# Inside VM:
sudo apt update
sudo apt install -y docker.io docker-compose git

# Permitir user executar docker
sudo usermod -aG docker $USER
exit
gcloud compute ssh ollama-server --zone=us-central1-a

Passo 4: Setup Docker Compose¶

# Criar docker-compose.yml
cat > docker-compose.yml << 'EOF'
version: '3'
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - /mnt/data/ollama:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
      - OLLAMA_NUM_GPU=0              # CPU-only (mais barato)
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 10s
      timeout: 5s
      retries: 3
    networks:
      - ollama_network

networks:
  ollama_network:
    driver: bridge
EOF

Passo 5: Criar Diretório de Dados¶

sudo mkdir -p /mnt/data/ollama
sudo chown $USER:$USER /mnt/data/ollama

Passo 6: Iniciar Ollama¶

docker-compose up -d

# Verificar
docker ps
curl http://localhost:11434/api/tags

# Output: {"models":[]}  ← Nenhum modelo ainda

Passo 7: Puxar Modelos¶

# Pull embedding model (274MB)
docker exec ollama ollama pull nomic-embed-text

# Pull reasoning model (opcional, para fallback)
docker exec ollama ollama pull mistral

# Verificar
curl http://localhost:11434/api/tags
# Output: {"models":[{"name":"nomic-embed-text:latest",...}]}

Passo 8: Testar API¶

# Teste embedding
curl http://localhost:11434/api/embeddings \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text",
    "input": "Hello world"
  }'

# Output: {"embedding":[0.123, -0.456, ...]}

🔗 Conectar do Cloud Run¶

1. Dentro da VPC do GCP¶

# Cloud Run precisa rodar na mesma VPC
gcloud run deploy ifriend-agents \
  --vpc-connector=projects/PROJECT_ID/locations/us-central1/connectors/default \
  --set-env-vars="OLLAMA_URL=http://ollama-server.default.svc.cluster.local:11434"

2. Via IP Externo (Menos seguro)¶

# Adicionar autenticação ao Ollama
# docker-compose.yml:
services:
  ollama:
    environment:
      - OLLAMA_AUTH=token:SECRET_TOKEN  # ⚠️ Não é padrão

# Cloud Run:
export OLLAMA_URL="http://EXTERNAL_IP:11434"

3. Recomendado: Via Secret Manager + VPC¶

# 1. Criar secret
echo "http://ollama-server:11434" | \
  gcloud secrets create ollama-url --data-file=-

# 2. Deploy Cloud Run com acesso a secret
gcloud run deploy ifriend-agents \
  --set-env-vars="OLLAMA_URL=http://ollama-server:11434" \
  --vpc-connector=default \
  --service-account=ifriend-sa \
  --update-secrets="OLLAMA_URL=ollama-url:latest"

💻 Código Python: Usar Ollama¶

Integração com CustomMemoryService¶

import aiohttp
import asyncio
from typing import List

class OllamaEmbedder:
    def __init__(self, ollama_url: str = "http://ollama-server:11434"):
        self.ollama_url = ollama_url

    async def embed(self, text: str, model: str = "nomic-embed-text") -> List[float]:
        """Generate embedding using Ollama"""
        try:
            async with aiohttp.ClientSession() as session:
                async with session.post(
                    f"{self.ollama_url}/api/embeddings",
                    json={"model": model, "input": text},
                    timeout=aiohttp.ClientTimeout(total=30)
                ) as resp:
                    if resp.status != 200:
                        raise Exception(f"Ollama error {resp.status}")

                    data = await resp.json()
                    return data["embedding"]

        except asyncio.TimeoutError:
            raise Exception("Ollama timeout - model may be overloaded")
        except aiohttp.ClientConnectorError as e:
            raise Exception(f"Cannot connect to Ollama: {e}")

# Usage
embedder = OllamaEmbedder(os.getenv("OLLAMA_URL"))

embedding = asyncio.run(
    embedder.embed("Customer feedback about product X")
)
print(f"Generated embedding: {len(embedding)} dimensions")

⭐ Alternativa: Ollama Cloud (RECOMENDADO)¶

Por Que é Melhor¶

Cloud Run + Compute Engine:
├─ Setup: 1-2 horas
├─ Maintenance: ~1h/semana
├─ Cost: $25-50/mth
└─ Performance: Sub-200ms

Ollama Cloud:
├─ Setup: 2 minutos
├─ Maintenance: $0
├─ Cost: $5-15/mth
└─ Performance: Sub-100ms

Setup Ollama Cloud¶

# Ir para https://ollama.ai/cloud
# Sign in com GitHub
# Create account

2. Gerar API Key¶

# Dashboard → API Keys → Create
# Copy key: ollama_xxxxxxxxxxxxx

3. Usar API¶

import os
import aiohttp

class OllamaCloudEmbedder:
    def __init__(self):
        self.api_key = os.getenv("OLLAMA_API_KEY")
        self.base_url = "https://api.ollama.cloud/v1"

    async def embed(self, text: str):
        """Generate embedding using Ollama Cloud API"""
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json"
        }

        async with aiohttp.ClientSession() as session:
            async with session.post(
                f"{self.base_url}/embeddings",
                headers=headers,
                json={
                    "model": "nomic-embed-text",
                    "input": text
                }
            ) as resp:
                data = await resp.json()
                return data["embedding"]

# Usage
embedder = OllamaCloudEmbedder()
embedding = await embedder.embed("Your text here")

Pricing¶

Free tier: 1000 requests/day

Pro: $5-15/mth
├─ 1M+ requests/day
├─ Priority support
└─ Custom models

📊 Comparação Final: Onde Rodar Ollama¶

Option A: Ollama Cloud ⭐ RECOMENDADO¶

Vantagens:
├─ ✅ Setup: 2 minutos
├─ ✅ No maintenance
├─ ✅ Performance: <100ms
├─ ✅ Auto-scaling
├─ ✅ SLA: 99.9%
└─ ✅ Custo: $5-15/mth

Desvantagens:
├─ ❌ API cost per token (small)
└─ ❌ Vendor dependency

Option B: Compute Engine e2-medium¶

Vantagens:
├─ ✅ Setup: 1-2 horas
├─ ✅ Performance: 1-5ms (local)
├─ ✅ Full control
├─ ✅ No per-token cost
└─ ✅ Custo fixo: $25/mth

Desvantagens:
├─ ❌ Maintenance: ~1h/semana
├─ ⚠️ Cold restart latency
└─ ⚠️ Manual scaling

Option C: Local + CLI (Development)¶

Vantagens:
├─ ✅ Custo: $0
├─ ✅ Performance: Best
└─ ✅ Easy debugging

Desvantagens:
├─ ❌ Não escalável
└─ ❌ Só para dev

🎯 Recomendação Final¶

Fase MVP (Agora):¶

Ollama Cloud + Supabase
├─ Custo: $30-40/mth
├─ Setup: 2-3 horas
├─ Manutenção: 0
└─ Time to market: Rápido

Fase Escala (Quando tiver tráfego):¶

Compute Engine + Supabase
├─ Custo: $50-60/mth
├─ Setup: 1-2 horas
├─ Manutenção: ~5h/semana
└─ Economia: 50-70% vs cloud APIs

✅ Checklist: Deploy Ollama no GCP¶

Ollamah Cloud Route¶

[ ] Criar conta em ollama.ai/cloud
[ ] Gerar API key
[ ] Testar API endpoint
[ ] Adicionar ao .env
[ ] Update CustomMemoryService com OllamaCloudEmbedder

Compute Engine Route¶

[ ] Criar VM e2-medium
[ ] Configurar firewall
[ ] Instalar Docker
[ ] Deploy docker-compose
[ ] Puxar nomic-embed-text
[ ] Testar http://IP:11434/api/embeddings
[ ] Conectar do Cloud Run
[ ] Monitorar disk space

Recomendação: Comece com Ollama Cloud (rápido), migre para Compute Engine depois