Templates

Templates in Chutes are pre-built, optimized configurations for common AI workloads. They provide production-ready setups with just a few lines of code, handling complex configurations like Docker images, model loading, API endpoints, and hardware requirements.

What are Templates?

Templates are factory functions that create complete Chute configurations for specific use cases:

🚀 One-line deployment of complex AI systems
🔧 Pre-optimized configurations for performance and cost
📦 Batteries-included with all necessary dependencies
🎯 Best practices built-in by default
🔄 Customizable for specific needs

Available Templates

Language Model Templates

VLLM Template

High-performance language model serving with OpenAI-compatible API.

from chutes.chute.template.vllm import build_vllm_chute

chute = build_vllm_chute(
    username="myuser",
    model_name="microsoft/DialoGPT-medium",
    revision="main",
    node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=16)
)

SGLang Template

Structured generation for complex prompting and reasoning.

from chutes.chute.template.sglang import build_sglang_chute

chute = build_sglang_chute(
    username="myuser",
    model_name="microsoft/DialoGPT-medium",
    node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=16)
)

Embedding Templates

Text Embeddings Inference (TEI)

Optimized text embedding generation.

from chutes.chute.template.tei import build_tei_chute

chute = build_tei_chute(
    username="myuser",
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=8)
)

Image Generation Templates

Diffusion Template

Stable Diffusion and other diffusion model serving.

from chutes.chute.template.diffusion import build_diffusion_chute

chute = build_diffusion_chute(
    username="myuser",
    model_name="stabilityai/stable-diffusion-xl-base-1.0",
    node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=12)
)

Template Categories

🗣️ Language Models

Use Cases: Text generation, chat, completion, code generation

VLLM: Production-scale LLM serving
SGLang: Complex reasoning and structured generation
Transformers: Custom model implementations

🔤 Text Processing

Use Cases: Embeddings, classification, named entity recognition

TEI: Fast embedding generation
Sentence Transformers: Semantic similarity
BERT: Classification and encoding

🎨 Image Generation

Use Cases: Image synthesis, editing, style transfer

Diffusion: Stable Diffusion variants
GAN: Generative adversarial networks
ControlNet: Controlled image generation

🎵 Audio Processing

Use Cases: Speech recognition, text-to-speech, music generation

Whisper: Speech-to-text
TTS: Text-to-speech synthesis
MusicGen: Music generation

🎬 Video Processing

Use Cases: Video analysis, generation, editing

Video Analysis: Object detection, tracking
Video Generation: Text-to-video models
Video Enhancement: Upscaling, stabilization

Template Benefits

1. Instant Deployment

# Without templates (complex setup)
image = (
    Image(username="myuser", name="vllm-app", tag="1.0")
    .from_base("nvidia/cuda:12.1-devel-ubuntu22.04")
    .with_python("3.11")
    .run_command("pip install vllm==0.2.0")
    .run_command("pip install transformers torch")
    # ... 50+ more lines of configuration
)

chute = Chute(
    username="myuser",
    name="llm-service",
    image=image,
    node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=16)
)

@chute.on_startup()
async def load_model(self):
    # ... complex model loading logic

@chute.cord(public_api_path="/v1/chat/completions")
async def chat_completions(self, request: ChatRequest):
    # ... OpenAI API compatibility logic

# With templates (one line)
chute = build_vllm_chute(
    username="myuser",
    model_name="microsoft/DialoGPT-medium",
    node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=16)
)

2. Production-Ready Defaults

# Templates include:
# ✅ Optimized Docker images
# ✅ Proper error handling
# ✅ Logging and monitoring
# ✅ Health checks
# ✅ Resource optimization
# ✅ Security best practices

3. Hardware Optimization

# Templates automatically optimize for:
# - GPU memory usage
# - CPU utilization
# - Network throughput
# - Storage requirements

Template Customization

Basic Customization

from chutes.chute.template.vllm import build_vllm_chute

# Customize standard parameters
chute = build_vllm_chute(
    username="myuser",
    model_name="microsoft/DialoGPT-medium",
    revision="main",
    node_selector=NodeSelector(
        gpu_count=2,
        min_vram_gb_per_gpu=24
    ),
    concurrency=8,
    tagline="Custom LLM API",
    readme="# My Custom LLM\nPowered by VLLM"
)

Advanced Customization

# Custom engine arguments
chute = build_vllm_chute(
    username="myuser",
    model_name="microsoft/DialoGPT-medium",
    engine_args={
        "max_model_len": 4096,
        "gpu_memory_utilization": 0.9,
        "max_num_seqs": 32,
        "temperature": 0.7
    }
)

# Custom Docker image
custom_image = (
    Image(username="myuser", name="custom-vllm", tag="1.0")
    .from_base("nvidia/cuda:12.1-devel-ubuntu22.04")
    .with_python("3.11")
    .run_command("pip install vllm==0.2.0")
    .run_command("pip install my-custom-package")
)

chute = build_vllm_chute(
    username="myuser",
    model_name="microsoft/DialoGPT-medium",
    image=custom_image
)

Template Extension

# Extend a template with custom functionality
base_chute = build_vllm_chute(
    username="myuser",
    model_name="microsoft/DialoGPT-medium",
    node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=16)
)

# Add custom endpoints
@base_chute.cord(public_api_path="/custom/analyze")
async def analyze_text(self, text: str) -> dict:
    # Custom text analysis logic
    return {"analysis": "custom_result"}

# Add custom startup logic
@base_chute.on_startup()
async def custom_initialization(self):
    # Additional setup
    self.custom_processor = CustomProcessor()

Template Parameters

Common Parameters

All templates support these standard parameters:

def build_template_chute(
    username: str,              # Required: Your Chutes username
    model_name: str,           # Required: HuggingFace model name
    revision: str = "main",    # Git revision/branch
    node_selector: NodeSelector = None,  # Hardware requirements
    image: str | Image = None, # Custom Docker image
    tagline: str = "",         # Short description
    readme: str = "",          # Markdown documentation
    concurrency: int = 1,      # Concurrent requests per instance
    **kwargs                   # Template-specific options
)

Template-Specific Parameters

VLLM Template

build_vllm_chute(
    # Standard parameters...
    engine_args: dict = None,   # VLLM engine configuration
    trust_remote_code: bool = False,  # Allow remote code execution
    max_model_len: int = None,  # Maximum sequence length
    gpu_memory_utilization: float = 0.85,  # GPU memory usage
    max_num_seqs: int = 128     # Maximum concurrent sequences
)

Diffusion Template

build_diffusion_chute(
    # Standard parameters...
    pipeline_type: str = "text2img",  # Pipeline type
    scheduler: str = "euler",         # Diffusion scheduler
    safety_checker: bool = True,      # Content safety
    guidance_scale: float = 7.5,      # CFG scale
    num_inference_steps: int = 50     # Generation steps
)

TEI Template

build_tei_chute(
    # Standard parameters...
    pooling: str = "mean",      # Pooling strategy
    normalize: bool = True,     # Normalize embeddings
    batch_size: int = 32,       # Inference batch size
    max_length: int = 512       # Maximum input length
)

Template Comparison

Language Model Templates

Template	Best For	Performance	Memory	API
VLLM	Production LLM serving	Highest	Optimized	OpenAI-compatible
SGLang	Complex reasoning	High	Standard	Custom structured
Transformers	Custom implementations	Medium	High	Flexible

Image Templates

Template	Best For	Speed	Quality	Customization
Diffusion	General image generation	Fast	High	Extensive
Stable Diffusion XL	High-resolution images	Medium	Highest	Good
ControlNet	Controlled generation	Medium	High	Specialized

Creating Custom Templates

Simple Template Function

def build_custom_nlp_chute(
    username: str,
    model_name: str,
    node_selector: NodeSelector,
    task_type: str = "classification"
) -> Chute:
    """Custom NLP template for classification and NER"""

    # Create custom image
    image = (
        Image(username=username, name="custom-nlp", tag="1.0")
        .from_base("nvidia/cuda:12.1-runtime-ubuntu22.04")
        .with_python("3.11")
        .run_command("pip install transformers torch scikit-learn")
    )

    # Create chute
    chute = Chute(
        username=username,
        name=f"nlp-{task_type}",
        image=image,
        node_selector=node_selector,
        tagline=f"Custom {task_type} service"
    )

    # Add model loading
    @chute.on_startup()
    async def load_model(self):
        from transformers import AutoTokenizer, AutoModelForSequenceClassification
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)

    # Add API endpoint
    @chute.cord(public_api_path=f"/{task_type}")
    async def classify(self, text: str) -> dict:
        inputs = self.tokenizer(text, return_tensors="pt")
        outputs = self.model(**inputs)
        predictions = outputs.logits.softmax(dim=-1)
        return {"predictions": predictions.tolist()}

    return chute

# Use the custom template
custom_chute = build_custom_nlp_chute(
    username="myuser",
    model_name="distilbert-base-uncased-finetuned-sst-2-english",
    node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=8),
    task_type="sentiment"
)

Advanced Template with Configuration

from dataclasses import dataclass
from typing import Optional

@dataclass
class CustomNLPConfig:
    batch_size: int = 32
    max_length: int = 512
    use_gpu: bool = True
    cache_size: int = 1000

def build_advanced_nlp_chute(
    username: str,
    model_name: str,
    node_selector: NodeSelector,
    config: CustomNLPConfig = None
) -> Chute:
    """Advanced NLP template with configuration"""

    if config is None:
        config = CustomNLPConfig()

    # Build image with config-specific optimizations
    image = (
        Image(username=username, name="advanced-nlp", tag="1.0")
        .from_base("nvidia/cuda:12.1-runtime-ubuntu22.04")
        .with_python("3.11")
        .run_command("pip install transformers torch accelerate")
    )

    if config.use_gpu:
        image = image.with_env("CUDA_VISIBLE_DEVICES", "0")

    chute = Chute(
        username=username,
        name="advanced-nlp",
        image=image,
        node_selector=node_selector
    )

    @chute.on_startup()
    async def setup(self):
        # Initialize with configuration
        self.config = config
        self.cache = {}  # Simple caching

        # Load model
        from transformers import AutoTokenizer, AutoModel
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)

        if config.use_gpu:
            self.model = self.model.cuda()

    @chute.cord(public_api_path="/process")
    async def process_text(self, texts: list[str]) -> dict:
        # Batch processing with configuration
        results = []

        for i in range(0, len(texts), self.config.batch_size):
            batch = texts[i:i + self.config.batch_size]

            # Check cache
            cached_results = []
            new_texts = []

            for text in batch:
                if text in self.cache and len(self.cache) < self.config.cache_size:
                    cached_results.append(self.cache[text])
                else:
                    new_texts.append(text)

            # Process new texts
            if new_texts:
                inputs = self.tokenizer(
                    new_texts,
                    return_tensors="pt",
                    padding=True,
                    truncation=True,
                    max_length=self.config.max_length
                )

                if self.config.use_gpu:
                    inputs = {k: v.cuda() for k, v in inputs.items()}

                with torch.no_grad():
                    outputs = self.model(**inputs)

                # Cache results
                for text, output in zip(new_texts, outputs.last_hidden_state):
                    result = output.mean(dim=0).cpu().tolist()
                    self.cache[text] = result
                    cached_results.append(result)

            results.extend(cached_results)

        return {"embeddings": results, "count": len(results)}

    return chute

Template Best Practices

1. Use Appropriate Templates

# For LLM inference
vllm_chute = build_vllm_chute(...)

# For embedding generation
tei_chute = build_tei_chute(...)

# For image generation
diffusion_chute = build_diffusion_chute(...)

2. Customize Hardware Requirements

# Small models
small_selector = NodeSelector(
    gpu_count=1,
    min_vram_gb_per_gpu=8
)

# Large models
large_selector = NodeSelector(
    gpu_count=2,
    min_vram_gb_per_gpu=40
)

3. Version Control Your Models

# Always specify revision
chute = build_vllm_chute(
    username="myuser",
    model_name="microsoft/DialoGPT-medium",
    revision="main"  # or specific commit hash
)

4. Document Your Deployments

chute = build_vllm_chute(
    username="myuser",
    model_name="microsoft/DialoGPT-medium",
    tagline="Customer service chatbot",
    readme="""
    # Customer Service Bot

    This chute provides automated customer service responses
    using DialoGPT-medium.

    ## Usage
    Send POST requests to `/v1/chat/completions`
    """
)

Next Steps

VLLM Template - Detailed VLLM documentation
Diffusion Template - Image generation guide
TEI Template - Text embeddings guide
Custom Templates Guide - Build your own templates