Using Pre-built Templates
This guide covers how to effectively use Chutes' pre-built templates to rapidly deploy AI applications with minimal configuration while maintaining flexibility for customization.
Overview
Pre-built templates provide:
- Rapid Deployment: Get AI models running in minutes
- Best Practices: Optimized configurations and performance tuning
- Proven Architectures: Battle-tested model serving patterns
- Easy Customization: Modify templates to fit your needs
- Production Ready: Built-in scaling, monitoring, and error handling
Available Templates
VLLM Template
High-performance large language model serving with OpenAI compatibility.
from chutes.chute import NodeSelector
from chutes.chute.template.vllm import build_vllm_chute
# Basic VLLM deployment
chute = build_vllm_chute(
username="myuser",
readme="microsoft/DialoGPT-medium for conversational AI",
model_name="microsoft/DialoGPT-medium",
node_selector=NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=24
),
concurrency=4
)
Key Features:
- OpenAI-compatible API endpoints
- Automatic batching and CUDA graph optimization
- Support for all major open-source LLMs
- Built-in streaming and function calling
- Multi-GPU distributed inference
SGLang Template
Advanced structured generation with programmable text generation.
from chutes.chute import NodeSelector
from chutes.chute.template.sglang import build_sglang_chute
chute = build_sglang_chute(
username="myuser",
readme="Qwen2.5-7B-Instruct with SGLang",
model_name="Qwen/Qwen2.5-7B-Instruct",
node_selector=NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=16
),
concurrency=8
)
Key Features:
- Advanced structured generation
- Custom sampling and constraints
- Batch processing optimizations
- Memory-efficient serving
- Real-time streaming responses
TEI Template (Text Embeddings Inference)
High-performance text embedding generation for similarity search and RAG.
from chutes.chute import NodeSelector
from chutes.chute.template.tei import build_tei_chute
chute = build_tei_chute(
username="myuser",
readme="sentence-transformers/all-MiniLM-L6-v2 embeddings",
model_name="sentence-transformers/all-MiniLM-L6-v2",
node_selector=NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=8
),
concurrency=16
)
Key Features:
- Optimized embedding generation
- Batch processing for efficiency
- Multiple pooling strategies
- Built-in similarity computation
- Support for various embedding models
Diffusion Template
Image generation using state-of-the-art diffusion models.
from chutes.chute import NodeSelector
from chutes.chute.template.diffusion import build_diffusion_chute
chute = build_diffusion_chute(
username="myuser",
readme="Stable Diffusion XL for image generation",
model_name="stabilityai/stable-diffusion-xl-base-1.0",
node_selector=NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=24
),
concurrency=2
)
Key Features:
- Support for various diffusion architectures
- Text-to-image and image-to-image generation
- Optimized memory usage and inference
- Built-in image processing and validation
- Support for ControlNet and LoRA
Template Customization
Basic Parameter Tuning
All templates support common parameters for customization:
from chutes.chute import NodeSelector
from chutes.chute.template.vllm import build_vllm_chute
# Customized VLLM deployment
chute = build_vllm_chute(
username="myuser",
readme="Customized Llama 2 deployment",
model_name="meta-llama/Llama-2-7b-chat-hf",
# Hardware configuration
node_selector=NodeSelector(
gpu_count=2, # Multi-GPU setup
min_vram_gb_per_gpu=40, # High memory requirement
include=["h100", "a100"], # Prefer specific GPU types
exclude=["k80", "v100"] # Exclude older GPUs
),
# Performance settings
concurrency=8, # Handle 8 concurrent requests
# Model-specific arguments
engine_args=dict(
gpu_memory_utilization=0.95, # Use 95% of GPU memory
max_model_len=4096, # Context length
max_num_seqs=16, # Batch size
temperature=0.7, # Default temperature
trust_remote_code=True, # Enable custom models
quantization="awq", # Use AWQ quantization
tensor_parallel_size=2, # Use both GPUs
),
# Custom image (optional)
image="chutes/vllm:0.8.0",
# Revision pinning for reproducibility
revision="main"
)
Advanced Engine Configuration
VLLM Advanced Settings
# Production VLLM configuration
chute = build_vllm_chute(
username="myuser",
model_name="microsoft/WizardLM-2-8x22B",
node_selector=NodeSelector(
gpu_count=8,
min_vram_gb_per_gpu=80,
include=["h100", "h200"]
),
engine_args=dict(
# Memory optimization
gpu_memory_utilization=0.97,
cpu_offload_gb=0,
# Performance tuning
max_model_len=32768,
max_num_seqs=32,
max_paddings=256,
# Advanced features
enable_prefix_caching=True,
use_v2_block_manager=True,
enable_chunked_prefill=True,
# Model loading
load_format="auto",
dtype="auto",
quantization="fp8",
# Distributed settings
tensor_parallel_size=8,
pipeline_parallel_size=1,
# API compatibility
served_model_name="wizardlm-2-8x22b",
chat_template="chatml",
# Logging and monitoring
disable_log_requests=False,
max_log_len=2048),
concurrency=16
)
SGLang Optimization
# Optimized SGLang configuration
chute = build_sglang_chute(
username="myuser",
model_name="mistralai/Mistral-7B-Instruct-v0.2",
engine_args=(
"--host 0.0.0.0 "
"--port 30000 "
"--model-path mistralai/Mistral-7B-Instruct-v0.2 "
"--tokenizer-path mistralai/Mistral-7B-Instruct-v0.2 "
"--context-length 32768 "
"--mem-fraction-static 0.9 "
"--tp-size 1 "
"--stream-interval 1 "
"--disable-flashinfer " # For compatibility
"--trust-remote-code"
),
node_selector=NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=16
)
)
Custom Images with Templates
You can combine templates with custom images for additional dependencies:
from chutes.image import Image
from chutes.chute.template.vllm import build_vllm_chute
# Build custom image with additional packages
custom_image = (
Image(username="myuser", name="custom-vllm", tag="1.0")
.from_base("chutes/vllm:0.8.0")
.run_command("pip install langchain openai tiktoken")
.run_command("pip install numpy pandas matplotlib")
.with_env("CUSTOM_CONFIG", "production")
)
# Use custom image with template
chute = build_vllm_chute(
username="myuser",
model_name="meta-llama/Llama-2-7b-chat-hf",
image=custom_image, # Use our custom image
node_selector=NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=24
)
)
Template Patterns
Multi-Model Deployment
Deploy multiple models using templates:
# Deploy different models for different use cases
from chutes.chute.template.vllm import build_vllm_chute
from chutes.chute.template.tei import build_tei_chute
# Chat model
chat_chute = build_vllm_chute(
username="myuser",
name="chat-service",
model_name="microsoft/DialoGPT-medium",
node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=16)
)
# Code model
code_chute = build_vllm_chute(
username="myuser",
name="code-service",
model_name="codellama/CodeLlama-7b-Python-hf",
node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=16)
)
# Embedding model
embedding_chute = build_tei_chute(
username="myuser",
name="embedding-service",
model_name="sentence-transformers/all-mpnet-base-v2",
node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=8)
)
Template Inheritance and Extension
Create your own template patterns based on existing ones:
from chutes.chute.template.vllm import build_vllm_chute
from chutes.chute import NodeSelector
from chutes.image import Image
def build_chat_template(
username: str,
model_name: str,
system_prompt: str = "You are a helpful assistant.",
**kwargs
):
"""Custom template for chat applications."""
# Custom image with chat-specific tools
image = (
Image(username=username, name="chat-optimized", tag="1.0")
.from_base("chutes/vllm:latest")
.run_command("pip install tiktoken langchain")
.with_env("SYSTEM_PROMPT", system_prompt)
.with_env("CHAT_MODE", "true")
)
# Default settings optimized for chat
default_engine_args = {
"max_model_len": 8192,
"temperature": 0.8,
"top_p": 0.9,
"max_tokens": 1024,
"stream": True
}
# Merge with user-provided args
engine_args = kwargs.pop("engine_args", {})
engine_args = {**default_engine_args, **engine_args}
return build_vllm_chute(
username=username,
model_name=model_name,
image=image,
engine_args=engine_args,
**kwargs
)
# Use custom template
chat_chute = build_chat_template(
username="myuser",
model_name="microsoft/DialoGPT-medium",
system_prompt="You are a friendly customer service assistant.",
node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=16)
)
Template-Based Microservices
Build a complete AI system using multiple templates:
# microservices_deployment.py
from chutes.chute.template.vllm import build_vllm_chute
from chutes.chute.template.tei import build_tei_chute
from chutes.chute.template.diffusion import build_diffusion_chute
class AIServiceSuite:
"""Complete AI service suite using templates."""
def __init__(self, username: str):
self.username = username
self.services = {}
def deploy_text_services(self):
"""Deploy text processing services."""
# Main chat model
self.services["chat"] = build_vllm_chute(
username=self.username,
name="chat-llm",
model_name="microsoft/DialoGPT-medium",
node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=24),
concurrency=8
)
# Specialized reasoning model
self.services["reasoning"] = build_vllm_chute(
username=self.username,
name="reasoning-llm",
model_name="deepseek-ai/deepseek-llm-7b-chat",
node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=16),
concurrency=4
)
# Embeddings for RAG
self.services["embeddings"] = build_tei_chute(
username=self.username,
name="text-embeddings",
model_name="sentence-transformers/all-mpnet-base-v2",
node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=8),
concurrency=16
)
def deploy_multimodal_services(self):
"""Deploy multimodal AI services."""
# Image generation
self.services["image_gen"] = build_diffusion_chute(
username=self.username,
name="image-generator",
model_name="stabilityai/stable-diffusion-xl-base-1.0",
node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=24),
concurrency=2
)
# Vision-language model
self.services["vision"] = build_vllm_chute(
username=self.username,
name="vision-llm",
model_name="llava-hf/llava-1.5-7b-hf",
node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=16),
concurrency=4
)
def get_deployment_script(self):
"""Generate deployment script for all services."""
script_lines = ["#!/bin/bash", "set -e", ""]
for service_name, chute in self.services.items():
script_lines.extend([
f"echo 'Deploying {service_name}...'",
f"chutes deploy {chute.name}:chute --wait",
f"echo '{service_name} deployed successfully'",
""
])
return "\n".join(script_lines)
# Usage
suite = AIServiceSuite("myuser")
suite.deploy_text_services()
suite.deploy_multimodal_services()
# Generate deployment script
deployment_script = suite.get_deployment_script()
with open("deploy_ai_suite.sh", "w") as f:
f.write(deployment_script)
Template Configuration Best Practices
1. Hardware Selection
Choose appropriate hardware for each template:
# Memory requirements by model size
hardware_configs = {
"small_models": { # <7B parameters
"node_selector": NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=16,
include=["rtx4090", "a40", "l40"]
),
"concurrency": 8
},
"medium_models": { # 7B-30B parameters
"node_selector": NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=48,
include=["a100", "h100"]
),
"concurrency": 4
},
"large_models": { # 30B+ parameters
"node_selector": NodeSelector(
gpu_count=2,
min_vram_gb_per_gpu=80,
include=["h100", "h200"]
),
"concurrency": 2
}
}
def select_hardware(model_name: str):
"""Select hardware configuration based on model."""
# Simple heuristic based on model name
if "7b" in model_name.lower():
return hardware_configs["small_models"]
elif any(size in model_name.lower() for size in ["13b", "30b"]):
return hardware_configs["medium_models"]
else:
return hardware_configs["large_models"]
2. Environment-Specific Configurations
import os
def get_config_for_environment(env: str = "production"):
"""Get configuration based on deployment environment."""
configs = {
"development": {
"concurrency": 2,
"engine_args": {
"gpu_memory_utilization": 0.8,
"max_model_len": 2048,
"disable_log_requests": False
}
},
"staging": {
"concurrency": 4,
"engine_args": {
"gpu_memory_utilization": 0.9,
"max_model_len": 4096,
"disable_log_requests": False
}
},
"production": {
"concurrency": 8,
"engine_args": {
"gpu_memory_utilization": 0.95,
"max_model_len": 8192,
"disable_log_requests": True,
"enable_prefix_caching": True
}
}
}
return configs.get(env, configs["production"])
# Usage
env = os.getenv("DEPLOYMENT_ENV", "production")
config = get_config_for_environment(env)
chute = build_vllm_chute(
username="myuser",
model_name="meta-llama/Llama-2-7b-chat-hf",
**config
)
3. Model-Specific Optimizations
def get_model_optimizations(model_name: str):
"""Get model-specific optimizations."""
optimizations = {
# Llama models
"llama": {
"engine_args": {
"quantization": "awq",
"enable_prefix_caching": True,
"use_v2_block_manager": True
}
},
# Mistral models
"mistral": {
"engine_args": {
"tokenizer_mode": "mistral",
"config_format": "mistral",
"trust_remote_code": True
}
},
# CodeLlama models
"code": {
"engine_args": {
"max_model_len": 16384, # Longer context for code
"temperature": 0.1, # Lower temperature for code
"enable_prefix_caching": True
}
},
# Chat models
"chat": {
"engine_args": {
"temperature": 0.8,
"top_p": 0.9,
"max_tokens": 2048,
"stream": True
}
}
}
# Detect model type from name
model_lower = model_name.lower()
if "llama" in model_lower:
return optimizations["llama"]
elif "mistral" in model_lower:
return optimizations["mistral"]
elif "code" in model_lower:
return optimizations["code"]
elif any(term in model_lower for term in ["chat", "instruct", "dialog"]):
return optimizations["chat"]
else:
return {"engine_args": {}}
# Usage
model_name = "codellama/CodeLlama-7b-Python-hf"
optimizations = get_model_optimizations(model_name)
chute = build_vllm_chute(
username="myuser",
model_name=model_name,
**optimizations
)
Monitoring and Debugging Templates
Template Health Checks
import requests
import time
async def check_template_health(chute_url: str, template_type: str):
"""Check health of deployed template."""
health_checks = {
"vllm": {
"endpoint": "/v1/models",
"expected_status": 200
},
"sglang": {
"endpoint": "/health",
"expected_status": 200
},
"tei": {
"endpoint": "/health",
"expected_status": 200
},
"diffusion": {
"endpoint": "/health",
"expected_status": 200
}
}
if template_type not in health_checks:
return {"status": "unknown", "error": "Unknown template type"}
check_config = health_checks[template_type]
try:
response = requests.get(
f"{chute_url}{check_config['endpoint']}",
timeout=10
)
if response.status_code == check_config["expected_status"]:
return {"status": "healthy", "response_time": response.elapsed.total_seconds()}
else:
return {"status": "unhealthy", "status_code": response.status_code}
except Exception as e:
return {"status": "error", "error": str(e)}
# Usage
health = await check_template_health(
"https://myuser-my-model.chutes.ai",
"vllm"
)
print(f"Service health: {health}")
Performance Monitoring
def monitor_template_performance(chute_name: str, duration_minutes: int = 60):
"""Monitor template performance over time."""
import subprocess
import json
# Collect metrics
metrics_cmd = f"chutes chutes metrics {chute_name} --duration {duration_minutes}m --format json"
result = subprocess.run(metrics_cmd, shell=True, capture_output=True, text=True)
if result.returncode == 0:
metrics = json.loads(result.stdout)
# Analyze metrics
analysis = {
"avg_response_time": metrics.get("avg_response_time", 0),
"request_count": metrics.get("request_count", 0),
"error_rate": metrics.get("error_rate", 0),
"gpu_utilization": metrics.get("gpu_utilization", 0),
"memory_usage": metrics.get("memory_usage", 0)
}
# Performance recommendations
recommendations = []
if analysis["avg_response_time"] > 5:
recommendations.append("Consider increasing concurrency or using faster GPUs")
if analysis["gpu_utilization"] < 50:
recommendations.append("GPU underutilized - consider reducing instance size")
if analysis["error_rate"] > 5:
recommendations.append("High error rate - check logs and model configuration")
return {
"metrics": analysis,
"recommendations": recommendations
}
else:
return {"error": "Failed to collect metrics", "details": result.stderr}
Template Migration and Updates
Upgrading Template Versions
def upgrade_template_safely(
current_chute_name: str,
new_template_version: str,
model_name: str,
username: str
):
"""Safely upgrade a template to a new version."""
# Create new chute with updated template
staging_name = f"{current_chute_name}-staging"
new_chute = build_vllm_chute(
username=username,
name=staging_name,
model_name=model_name,
image=f"chutes/vllm:{new_template_version}",
# Copy current configuration
node_selector=get_current_node_selector(current_chute_name),
engine_args=get_current_engine_args(current_chute_name)
)
# Deployment script
upgrade_script = f"""
# Deploy staging version
chutes deploy {staging_name}:chute --wait
# Test staging deployment
python test_template.py --target {staging_name}
# If tests pass, switch traffic
if [ $? -eq 0 ]; then
echo "Tests passed, deploying to production"
chutes deploy {current_chute_name}:chute --wait
chutes chutes delete {staging_name}
else
echo "Tests failed, keeping current version"
chutes chutes delete {staging_name}
fi
"""
return upgrade_script
Troubleshooting Templates
Common Issues and Solutions
def diagnose_template_issues(chute_name: str, template_type: str):
"""Diagnose common template deployment issues."""
issues = []
# Check deployment status
status_cmd = f"chutes chutes get {chute_name}"
status_result = subprocess.run(status_cmd, shell=True, capture_output=True, text=True)
if "Failed" in status_result.stdout:
issues.append({
"issue": "Deployment failed",
"solution": "Check logs with: chutes chutes logs " + chute_name
})
# Check resource usage
metrics_cmd = f"chutes chutes metrics {chute_name}"
metrics_result = subprocess.run(metrics_cmd, shell=True, capture_output=True, text=True)
if "OutOfMemory" in metrics_result.stdout:
issues.append({
"issue": "GPU out of memory",
"solution": "Reduce gpu_memory_utilization or increase GPU size"
})
# Template-specific checks
if template_type == "vllm":
# Check for VLLM-specific issues
if "CUDA_ERROR_OUT_OF_MEMORY" in metrics_result.stdout:
issues.append({
"issue": "VLLM CUDA memory error",
"solution": "Reduce max_model_len or batch size (max_num_seqs)"
})
elif template_type == "sglang":
# Check for SGLang-specific issues
if "RuntimeError" in metrics_result.stdout:
issues.append({
"issue": "SGLang runtime error",
"solution": "Check model compatibility and reduce memory usage"
})
return issues
# Quick diagnostics
issues = diagnose_template_issues("my-llm-service", "vllm")
for issue in issues:
print(f"Issue: {issue['issue']}")
print(f"Solution: {issue['solution']}\n")
Next Steps
- Custom Templates: Build your own reusable templates
- Production Scaling: Monitor and optimize template performance
- Advanced Patterns: Combine templates for complex architectures
- CI/CD Integration: Automate template deployments
For more advanced topics, see: