VLLM Template
The VLLM template is the most popular way to deploy large language models on Chutes. It provides a high-performance, OpenAI-compatible API server powered by vLLM, optimized for fast inference and high throughput.
What is VLLM?
VLLM is a fast and memory-efficient inference engine for large language models that provides:
- 📈 High throughput serving with PagedAttention
- 🧠 Memory efficiency with optimized attention algorithms
- 🔄 Continuous batching for better GPU utilization
- 🌐 OpenAI-compatible API for easy integration
- ⚡ Multi-GPU support for large models
Quick Start
from chutes.chute import NodeSelector
from chutes.chute.template.vllm import build_vllm_chute
chute = build_vllm_chute(
username="myuser",
model_name="microsoft/DialoGPT-medium",
revision="main", # Required: locks model to specific version
node_selector=NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=16
)
)That's it! This creates a complete VLLM deployment with:
- ✅ Automatic model downloading and caching
- ✅ OpenAI-compatible endpoint
- ✅ Built-in streaming support
- ✅ Optimized inference settings
- ✅ Auto-scaling based on demand
Function Reference
def build_vllm_chute(
username: str,
model_name: str,
node_selector: NodeSelector,
revision: str,
image: str | Image = VLLM,
tagline: str = "",
readme: str = "",
concurrency: int = 32,
engine_args: Dict[str, Any] = {}) -> VLLMChuteRequired Parameters
Your Chutes username.
HuggingFace model identifier (e.g., ).
Hardware requirements specification.
Required. Git revision/commit hash to lock the model version. Use the current branch commit for reproducible deployments.
# Get current revision from HuggingFace
revision = "cb765b56fbc11c61ac2a82ec777e3036964b975c"Optional Parameters
Docker image to use. Defaults to the official Chutes VLLM image.
Short description for your chute.
Markdown documentation for your chute.
Maximum concurrent requests per instance.
VLLM engine configuration options. See Engine Arguments.
Engine Arguments
The parameter allows you to configure VLLM's behavior:
Memory and Performance
engine_args = {
# Memory utilization (0.0-1.0)
"gpu_memory_utilization": 0.95,
# Maximum sequence length
"max_model_len": 4096,
# Maximum number of sequences to process in parallel
"max_num_seqs": 256,
# Enable chunked prefill for long sequences
"enable_chunked_prefill": True,
# Maximum number of tokens in a single chunk
"max_num_batched_tokens": 8192,
}Model Loading
engine_args = {
# Tensor parallelism (automatically set based on GPU count)
"tensor_parallel_size": 2,
# Pipeline parallelism
"pipeline_parallel_size": 1,
# Data type for model weights
"dtype": "auto", # or "float16", "bfloat16", "float32"
# Quantization method
"quantization": "awq", # or "gptq", "squeezellm", etc.
# Trust remote code (for custom models)
"trust_remote_code": True,
}Advanced Features
engine_args = {
# Enable prefix caching
"enable_prefix_caching": True,
# Speculative decoding
"speculative_model": "microsoft/DialoGPT-small",
"num_speculative_tokens": 5,
# Guided generation
"guided_decoding_backend": "outlines",
# Disable logging for better performance
"disable_log_stats": True,
"disable_log_requests": True,
}Hardware Configuration
GPU Requirements
Choose hardware based on your model size:
Small Models (< 7B parameters)
node_selector = NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=16,
include=["l40", "a6000", "a100"]
)Medium Models (7B - 13B parameters)
node_selector = NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=24,
include=["a100", "h100"]
)Large Models (13B - 70B parameters)
node_selector = NodeSelector(
gpu_count=2,
min_vram_gb_per_gpu=40,
include=["a100", "h100"]
)Huge Models (70B+ parameters)
node_selector = NodeSelector(
gpu_count=4,
min_vram_gb_per_gpu=80,
include=["h100"]
)GPU Type Selection
High Performance:
include=["h100", "a100"] # Latest, fastest GPUsBalanced:
include=["a100", "l40", "a6000"] # Good performance/cost ratioBudget:
exclude=["h100"] # Exclude most expensive GPUsAPI Endpoints
The VLLM template provides OpenAI-compatible endpoints:
Chat Completions
POST
import aiohttp
async def chat_completion():
url = "https://myuser-mychute.chutes.ai/v1/chat/completions"
payload = {
"model": "microsoft/DialoGPT-medium",
"messages": [
{"role": "user", "content": "Hello! How are you?"}
],
"max_tokens": 100,
"temperature": 0.7,
"stream": False
}
async with aiohttp.ClientSession() as session:
async with session.post(url, json=payload) as response:
result = await response.json()
print(result["choices"][0]["message"]["content"])Streaming Chat
async def streaming_chat():
url = "https://myuser-mychute.chutes.ai/v1/chat/completions"
payload = {
"model": "microsoft/DialoGPT-medium",
"messages": [
{"role": "user", "content": "Tell me a story"}
],
"max_tokens": 200,
"temperature": 0.8,
"stream": True
}
async with aiohttp.ClientSession() as session:
async with session.post(url, json=payload) as response:
async for line in response.content:
if line.startswith(b"data: "):
data = json.loads(line[6:])
if data.get("choices"):
delta = data["choices"][0]["delta"]
if "content" in delta:
print(delta["content"], end="")Text Completions
POST
payload = {
"model": "microsoft/DialoGPT-medium",
"prompt": "The future of AI is",
"max_tokens": 50,
"temperature": 0.7
}Tokenization
POST
payload = {
"model": "microsoft/DialoGPT-medium",
"text": "Hello, world!"
}
# Returns: {"tokens": [1, 2, 3, ...]}POST
payload = {
"model": "microsoft/DialoGPT-medium",
"tokens": [1, 2, 3]
}
# Returns: {"text": "Hello, world!"}Complete Examples
Basic Chat Model
from chutes.chute import NodeSelector
from chutes.chute.template.vllm import build_vllm_chute
chute = build_vllm_chute(
username="myuser",
model_name="microsoft/DialoGPT-medium",
revision="main",
node_selector=NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=16
),
tagline="Conversational AI chatbot",
readme="""
# My Chat Bot
A conversational AI powered by DialoGPT.
## Usage
Send POST requests to `/v1/chat/completions` with your messages.
""",
concurrency=16
)High-Performance Large Model
chute = build_vllm_chute(
username="myuser",
model_name="meta-llama/Llama-2-70b-chat-hf",
revision="latest-commit-hash",
node_selector=NodeSelector(
gpu_count=4,
min_vram_gb_per_gpu=80,
include=["h100", "a100"]
),
engine_args={
"gpu_memory_utilization": 0.95,
"max_model_len": 4096,
"max_num_seqs": 128,
"enable_chunked_prefill": True,
"trust_remote_code": True,
},
concurrency=64
)Code Generation Model
chute = build_vllm_chute(
username="myuser",
model_name="Phind/Phind-CodeLlama-34B-v2",
revision="main",
node_selector=NodeSelector(
gpu_count=2,
min_vram_gb_per_gpu=40
),
engine_args={
"max_model_len": 8192, # Longer context for code
"temperature": 0.1, # More deterministic for code
},
tagline="Advanced code generation AI"
)Quantized Model for Efficiency
chute = build_vllm_chute(
username="myuser",
model_name="TheBloke/Llama-2-13B-chat-AWQ",
revision="main",
node_selector=NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=16 # Much less VRAM needed
),
engine_args={
"quantization": "awq",
"gpu_memory_utilization": 0.9,
}
)Testing Your Deployment
Local Testing
Before deploying, test your configuration:
# Add to your chute file
if __name__ == "__main__":
import asyncio
async def test():
response = await chute.chat({
"model": "your-model-name",
"messages": [
{"role": "user", "content": "Hello!"}
]
})
print(response)
asyncio.run(test())Run locally:
chutes run my_vllm_chute:chute --devProduction Testing
After deployment:
curl -X POST https://myuser-mychute.chutes.ai/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/DialoGPT-medium",
"messages": [{"role": "user", "content": "Test message"}],
"max_tokens": 50
}'Performance Optimization
Memory Optimization
engine_args = {
# Use maximum available memory
"gpu_memory_utilization": 0.95,
# Enable memory-efficient attention
"enable_chunked_prefill": True,
# Optimize for your typical sequence length
"max_model_len": 2048, # Adjust based on your use case
}Throughput Optimization
engine_args = {
# Increase parallel sequences
"max_num_seqs": 512,
# Larger batch sizes
"max_num_batched_tokens": 16384,
# Disable logging in production
"disable_log_stats": True,
"disable_log_requests": True,
}Latency Optimization
engine_args = {
# Smaller batch sizes for lower latency
"max_num_seqs": 32,
# Enable prefix caching
"enable_prefix_caching": True,
# Use speculative decoding for faster generation
"speculative_model": "smaller-model-name",
"num_speculative_tokens": 5,
}Troubleshooting
Common Issues
Out of Memory Errors
# Reduce memory usage
engine_args = {
"gpu_memory_utilization": 0.8, # Lower from 0.95
"max_model_len": 2048, # Reduce max length
"max_num_seqs": 64, # Fewer parallel sequences
}Slow Model Loading
# The model downloads on first startup
# Check logs: chutes chutes get your-chute
# Subsequent starts are fast due to cachingModel Not Found
# Ensure model exists and is public
# Check: https://huggingface.co/microsoft/DialoGPT-medium
# Use exact model name from HuggingFaceDeployment Fails
# Check image build status
chutes images list --name your-image
# Verify configuration
python -c "from my_chute import chute; print(chute.node_selector)"Performance Issues
Low Throughput
- Increase and
- Use more GPUs with
- Enable
High Latency
- Reduce for lower batching
- Enable
- Use faster GPU types (H100 > A100 > L40)
Memory Issues
- Lower
- Reduce
- Consider quantized models (AWQ, GPTQ)
Best Practices
1. Model Selection
- Use quantized models (AWQ/GPTQ) for better efficiency
- Choose the smallest model that meets your quality requirements
- Test with different model variants
2. Hardware Sizing
- Start with minimum requirements and scale up
- Monitor GPU utilization in the dashboard
- Use / filters for cost optimization
3. Performance Tuning
- Set to lock model versions
- Tune for your specific use case
- Enable logging initially, disable in production
4. Monitoring
- Check the Chutes dashboard for metrics
- Monitor request latency and throughput
- Set up alerts for failures
Advanced Features
Custom Chat Templates
engine_args = {
"chat_template": """
{%- for message in messages %}
{%- if message['role'] == 'user' %}
Human: {{ message['content'] }}
{%- elif message['role'] == 'assistant' %}
Assistant: {{ message['content'] }}
{%- endif %}
{%- endfor %}
Assistant:
"""
}Tool Calling
engine_args = {
"tool_call_parser": "mistral",
"enable_auto_tool_choice": True,
}Guided Generation
engine_args = {
"guided_decoding_backend": "outlines",
}
# Then in your requests:
{
"guided_json": {"type": "object", "properties": {"name": {"type": "string"}}}
}Migration from Other Platforms
From OpenAI
Replace the base URL and use your model name:
# Before (OpenAI)
client = OpenAI(api_key="sk-...")
# After (Chutes)
client = OpenAI(
api_key="dummy", # Not needed for Chutes
base_url="https://myuser-mychute.chutes.ai/v1"
)From Hugging Face Transformers
VLLM is much faster than transformers for serving:
# Before (Transformers)
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("model-name")
# After (Chutes VLLM)
chute = build_vllm_chute(
username="myuser",
model_name="model-name",
# ... configuration
)Next Steps
- SGLang Template - Alternative high-performance LLM serving
- Custom Images - Build your own VLLM images
- Streaming Guide - Advanced streaming patterns
- Examples - Complete application examples