The VLLM template is the most popular way to deploy large language models on Chutes. It provides a high-performance, OpenAI-compatible API server powered by vLLM, optimized for fast inference and high throughput.
What is VLLM?
VLLM is a fast and memory-efficient inference engine for large language models that provides:
📈 High throughput serving with PagedAttention
🧠 Memory efficiency with optimized attention algorithms
🔄 Continuous batching for better GPU utilization
🌐 OpenAI-compatible API for easy integration
⚡ Multi-GPU support for large models
Quick Start
from chutes.chute import NodeSelector
from chutes.chute.template.vllm import build_vllm_chute
chute = build_vllm_chute(
username="myuser",
model_name="microsoft/DialoGPT-medium",
revision="main", # Required: locks model to specific version
node_selector=NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=16
)
)
That's it! This creates a complete VLLM deployment with:
The engine_args parameter allows you to configure VLLM's behavior:
Memory and Performance
engine_args = {
# Memory utilization (0.0-1.0)
"gpu_memory_utilization": 0.95,
# Maximum sequence length
"max_model_len": 4096,
# Maximum number of sequences to process in parallel
"max_num_seqs": 256,
# Enable chunked prefill for long sequences
"enable_chunked_prefill": True,
# Maximum number of tokens in a single chunk
"max_num_batched_tokens": 8192,
}
Model Loading
engine_args = {
# Tensor parallelism (automatically set based on GPU count)
"tensor_parallel_size": 2,
# Pipeline parallelism
"pipeline_parallel_size": 1,
# Data type for model weights
"dtype": "auto", # or "float16", "bfloat16", "float32"
# Quantization method
"quantization": "awq", # or "gptq", "squeezellm", etc.
# Trust remote code (for custom models)
"trust_remote_code": True,
}
include=["a100", "l40", "a6000"] # Good performance/cost ratio
Budget:
exclude=["h100"] # Exclude most expensive GPUs
API Endpoints
The VLLM template provides OpenAI-compatible endpoints:
Chat Completions
POST /v1/chat/completions
import aiohttp
async def chat_completion():
url = "https://myuser-mychute.chutes.ai/v1/chat/completions"
payload = {
"model": "microsoft/DialoGPT-medium",
"messages": [
{"role": "user", "content": "Hello! How are you?"}
],
"max_tokens": 100,
"temperature": 0.7,
"stream": False
}
async with aiohttp.ClientSession() as session:
async with session.post(url, json=payload) as response:
result = await response.json()
print(result["choices"][0]["message"]["content"])
Streaming Chat
async def streaming_chat():
url = "https://myuser-mychute.chutes.ai/v1/chat/completions"
payload = {
"model": "microsoft/DialoGPT-medium",
"messages": [
{"role": "user", "content": "Tell me a story"}
],
"max_tokens": 200,
"temperature": 0.8,
"stream": True
}
async with aiohttp.ClientSession() as session:
async with session.post(url, json=payload) as response:
async for line in response.content:
if line.startswith(b"data: "):
data = json.loads(line[6:])
if data.get("choices"):
delta = data["choices"][0]["delta"]
if "content" in delta:
print(delta["content"], end="")
Text Completions
POST /v1/completions
payload = {
"model": "microsoft/DialoGPT-medium",
"prompt": "The future of AI is",
"max_tokens": 50,
"temperature": 0.7
}
from chutes.chute import NodeSelector
from chutes.chute.template.vllm import build_vllm_chute
chute = build_vllm_chute(
username="myuser",
model_name="microsoft/DialoGPT-medium",
revision="main",
node_selector=NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=16
),
tagline="Conversational AI chatbot",
readme="""
# My Chat Bot
A conversational AI powered by DialoGPT.
## Usage
Send POST requests to `/v1/chat/completions` with your messages.
""",
concurrency=16
)
engine_args = {
# Use maximum available memory
"gpu_memory_utilization": 0.95,
# Enable memory-efficient attention
"enable_chunked_prefill": True,
# Optimize for your typical sequence length
"max_model_len": 2048, # Adjust based on your use case
}
engine_args = {
"guided_decoding_backend": "outlines",
}
# Then in your requests:
{
"guided_json": {"type": "object", "properties": {"name": {"type": "string"}}}
}
Migration from Other Platforms
From OpenAI
Replace the base URL and use your model name:
# Before (OpenAI)
client = OpenAI(api_key="sk-...")
# After (Chutes)
client = OpenAI(
api_key="dummy", # Not needed for Chutes
base_url="https://myuser-mychute.chutes.ai/v1"
)
From Hugging Face Transformers
VLLM is much faster than transformers for serving:
# Before (Transformers)
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("model-name")
# After (Chutes VLLM)
chute = build_vllm_chute(
username="myuser",
model_name="model-name",
# ... configuration
)