Retrieval-Augmented Generation (RAG) combines the power of Large Language Models (LLMs) with your own custom data. This guide walks through building a complete RAG pipeline on Chutes using ChromaDB for vector storage, vLLM for embeddings, and SGLang/vLLM for generation.
Architecture
A standard RAG pipeline on Chutes consists of three components:
Embedding Service: Converts text into vector representations.
Vector Database (Chroma): Stores vectors and performs similarity search.
LLM (Generation): Takes the query + retrieved context and generates an answer.
You can deploy these as separate chutes for scalability, or combine them for simplicity. Here, we'll deploy them as modular components.
Step 1: Deploy Embedding Service
Use the embedding template to deploy a high-performance embedding model like bge-large-en-v1.5.
We'll create a custom chute that runs ChromaDB. Chroma is persistent, so we'll use a Job or a persistent storage pattern if we need data to survive restarts. For this example, we'll set up an ephemeral vector DB that ingests data on startup (great for read-only knowledge bases).
# deploy_chroma.pyfrom chutes.image import Image
from chutes.chute import Chute, NodeSelector
from pydantic import BaseModel, Field
from typing importList
image = (
Image(username="myuser", name="chroma-db", tag="0.1")
.from_base("parachutes/base-python:3.12.7")
.run_command("pip install chromadb")
)
chute = Chute(
username="myuser",
name="rag-vector-db",
image=image,
node_selector=NodeSelector(gpu_count=0, min_cpu_count=2, min_memory_gb=8),
)
classQuery(BaseModel):
query_embeddings: List[List[float]]
n_results: int = 5@chute.on_startup()asyncdefsetup_db(self):
import chromadb
self.client = chromadb.Client()
self.collection = self.client.create_collection("knowledge_base")
# INGESTION: In a real app, you might fetch this from S3 or a database
documents = [
"Chutes is a serverless GPU platform.",
"You can deploy LLMs, diffusion models, and custom code on Chutes.",
"Chutes uses a decentralized network of GPUs."
]
ids = [f"doc_{i}"for i inrange(len(documents))]
# Note: In a real setup, you'd generate embeddings for these docs first# For simplicity, we assume you send pre-computed embeddings or compute them here# self.collection.add(documents=documents, ids=ids, embeddings=...)print("ChromaDB initialized!")
@chute.cord(public_api_path="/query", method="POST")asyncdefquery(self, q: Query):
results = self.collection.query(
query_embeddings=q.query_embeddings,
n_results=q.n_results
)
return results
Step 3: The RAG Controller (Client-Side or Chute)
You can orchestrate the RAG flow from your client application, or deploy a "Controller Chute" that talks to the other services. Here is a Python client example that ties it all together.
You can also use ComfyUI on Chutes to build visual RAG pipelines. The chroma.py example in the Chutes examples directory demonstrates how to wrap a ComfyUI workflow (which can include RAG nodes) inside a Chute API.
Build a ComfyUI workflow that includes text loading, embedding, and LLM query nodes.
Export the workflow as JSON API format.
Use the chroma.py pattern to load this workflow into a Chute, exposing inputs (like "prompt") as API parameters.
This allows you to drag-and-drop your RAG logic and deploy it as a scalable API instantly.