The VLLM template is the most popular way to deploy large language models on Chutes. It provides a high-performance, OpenAI-compatible API server powered by vLLM, optimized for fast inference and high throughput.
What is VLLM?
VLLM is a fast and memory-efficient inference engine for large language models that provides:
📈 High throughput serving with PagedAttention
🧠 Memory efficiency with optimized attention algorithms
🔄 Continuous batching for better GPU utilization
🌐 OpenAI-compatible API for easy integration
⚡ Multi-GPU support for large models
Quick Start
from chutes.chute import NodeSelector
from chutes.chute.template.vllm import build_vllm_chute
chute = build_vllm_chute(
username="myuser",
model_name="microsoft/DialoGPT-medium",
revision="main", # Required: locks model to specific version
node_selector=NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=16
)
)
That's it! This creates a complete VLLM deployment with:
The engine_args parameter allows you to configure VLLM's behavior:
Memory and Performance
engine_args = {
# Memory utilization (0.0-1.0)"gpu_memory_utilization": 0.95,
# Maximum sequence length"max_model_len": 4096,
# Maximum number of sequences to process in parallel"max_num_seqs": 256,
# Enable chunked prefill for long sequences"enable_chunked_prefill": True,
# Maximum number of tokens in a single chunk"max_num_batched_tokens": 8192,
}
Model Loading
engine_args = {
# Tensor parallelism (automatically set based on GPU count)"tensor_parallel_size": 2,
# Pipeline parallelism"pipeline_parallel_size": 1,
# Data type for model weights"dtype": "auto", # or "float16", "bfloat16", "float32"# Quantization method"quantization": "awq", # or "gptq", "squeezellm", etc.# Trust remote code (for custom models)"trust_remote_code": True,
}
from chutes.chute import NodeSelector
from chutes.chute.template.vllm import build_vllm_chute
chute = build_vllm_chute(
username="myuser",
model_name="microsoft/DialoGPT-medium",
revision="main",
node_selector=NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=16
),
tagline="Conversational AI chatbot",
readme="""
# My Chat Bot
A conversational AI powered by DialoGPT.
## Usage
Send POST requests to `/v1/chat/completions` with your messages.
""",
concurrency=16
)
engine_args = {
# Use maximum available memory"gpu_memory_utilization": 0.95,
# Enable memory-efficient attention"enable_chunked_prefill": True,
# Optimize for your typical sequence length"max_model_len": 2048, # Adjust based on your use case
}
engine_args = {
"guided_decoding_backend": "outlines",
}
# Then in your requests:
{
"guided_json": {"type": "object", "properties": {"name": {"type": "string"}}}
}
Migration from Other Platforms
From OpenAI
Replace the base URL and use your model name:
# Before (OpenAI)
client = OpenAI(api_key="sk-...")
# After (Chutes)
client = OpenAI(
api_key="dummy", # Not needed for Chutes
base_url="https://myuser-mychute.chutes.ai/v1"
)
From Hugging Face Transformers
VLLM is much faster than transformers for serving:
# Before (Transformers)from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("model-name")
# After (Chutes VLLM)
chute = build_vllm_chute(
username="myuser",
model_name="model-name",
# ... configuration
)