Developers

Templates API Reference

Chutes provides pre-built templates for common AI/ML frameworks and use cases. Templates are factory functions that create pre-configured instances with optimized settings for specific AI frameworks.

Overview

Templates provide:

  • Quick Setup: Instant deployment of popular AI models
  • Best Practices: Pre-configured optimization settings
  • Standard APIs: OpenAI-compatible endpoints for LLMs
  • Customization: Override any parameter as needed

Available Templates

TemplateUse CaseImport
LLM serving with vLLM
LLM serving with SGLang
Image generation
Text embeddings

vLLM Template

Create a chute optimized for vLLM (high-performance LLM serving) with OpenAI-compatible API endpoints.

Import:

from chutes.chute.template import build_vllm_chute

Signature:

def build_vllm_chute(
    username: str,
    model_name: str,
    node_selector: NodeSelector,
    image: str | Image = VLLM,
    tagline: str = "",
    readme: str = "",
    concurrency: int = 64,
    engine_args: Dict[str, Any] = {},
    revision: str = None,
    max_instances: int = 1,
    scaling_threshold: float = 0.75,
    shutdown_after_seconds: int = 300,
    allow_external_egress: bool = False
) -> Chute

Parameters:

  • - Your Chutes username (required)
  • - HuggingFace model identifier (required)
  • - Hardware requirements (required)
  • - Custom vLLM image (defaults to built-in)
  • - Brief description
  • - Detailed documentation
  • - Max concurrent requests (default: 64)
  • - vLLM engine configuration
  • - Model revision
  • - Max scaling instances (default: 1)
  • - Scaling trigger threshold (default: 0.75)
  • - Idle shutdown time (default: 300)
  • - Allow external network access (default: False)

Basic Example:

from chutes.chute.template import build_vllm_chute
from chutes.chute import NodeSelector

chute = build_vllm_chute(
    username="myuser",
    model_name="mistralai/Mistral-7B-Instruct-v0.3",
    node_selector=NodeSelector(
        gpu_count=1,
        min_vram_gb_per_gpu=24
    )
)

Advanced Example:

from chutes.chute.template import build_vllm_chute
from chutes.chute import NodeSelector

chute = build_vllm_chute(
    username="myuser",
    model_name="meta-llama/Llama-2-70b-chat-hf",
    node_selector=NodeSelector(
        gpu_count=8,
        min_vram_gb_per_gpu=48,
        exclude=["l40", "a6000"]
    ),
    engine_args={
        "gpu_memory_utilization": 0.97,
        "max_model_len": 4096,
        "max_num_seqs": 8,
        "trust_remote_code": True,
        "tensor_parallel_size": 8
    },
    concurrency=8,
    max_instances=3
)

Common vLLM Engine Arguments:

engine_args = {
    # Memory management
    "gpu_memory_utilization": 0.95,   # Use 95% of GPU memory
    "swap_space": 4,                   # GB of CPU swap space

    # Model configuration
    "max_model_len": 4096,             # Maximum sequence length
    "max_num_seqs": 256,               # Maximum concurrent sequences
    "trust_remote_code": False,        # Allow custom model code

    # Performance optimization
    "enable_prefix_caching": True,     # Cache prefixes for efficiency
    "use_v2_block_manager": True,      # Improved block manager

    # Quantization
    "quantization": None,              # e.g., "awq", "gptq", "fp8"
    "dtype": "auto",                   # Model data type

    # Distributed inference
    "tensor_parallel_size": 1,         # GPUs for tensor parallelism

    # Tokenizer
    "tokenizer_mode": "auto",          # Tokenizer mode
    
    # Mistral-specific
    "config_format": "mistral",        # For Mistral models
    "load_format": "mistral",
    "tool_call_parser": "mistral",
    "enable_auto_tool_choice": True
}

Provided Endpoints:

vLLM template provides OpenAI-compatible endpoints:

  • - Chat completions
  • - Text completions
  • - Tokenization
  • - Detokenization
  • - List available models

SGLang Template

Create a chute optimized for SGLang (structured generation language serving).

Import:

from chutes.chute.template.sglang import build_sglang_chute

Signature:

def build_sglang_chute(
    username: str,
    model_name: str,
    node_selector: NodeSelector,
    image: str | Image = SGLANG,
    tagline: str = "",
    readme: str = "",
    concurrency: int = 64,
    engine_args: Dict[str, Any] = {},
    revision: str = None,
    max_instances: int = 1,
    scaling_threshold: float = 0.75,
    shutdown_after_seconds: int = 300,
    allow_external_egress: bool = False
) -> Chute

Example:

from chutes.chute.template.sglang import build_sglang_chute
from chutes.chute import NodeSelector

chute = build_sglang_chute(
    username="myuser",
    model_name="deepseek-ai/DeepSeek-R1",
    node_selector=NodeSelector(
        gpu_count=8,
        include=["h200"],
        min_vram_gb_per_gpu=141
    ),
    engine_args={
        "host": "0.0.0.0",
        "port": 30000,
        "tp_size": 8,
        "trust_remote_code": True,
        "context_length": 65536,
        "mem_fraction_static": 0.8
    },
    concurrency=4
)

Common SGLang Engine Arguments:

engine_args = {
    # Server configuration
    "host": "0.0.0.0",
    "port": 30000,

    # Model configuration
    "context_length": 4096,
    "trust_remote_code": True,

    # Performance
    "tp_size": 1,                    # Tensor parallelism
    "mem_fraction_static": 0.9,      # Static memory fraction
    "chunked_prefill_size": 512,

    # Features
    "enable_flashinfer": True
}

Diffusion Template

Create a chute optimized for diffusion model inference (image generation).

Import:

from chutes.chute.template.diffusion import build_diffusion_chute

Example:

from chutes.chute.template.diffusion import build_diffusion_chute
from chutes.chute import NodeSelector

chute = build_diffusion_chute(
    username="myuser",
    model_name="black-forest-labs/FLUX.1-dev",
    node_selector=NodeSelector(
        gpu_count=1,
        min_vram_gb_per_gpu=48,
        include=["l40", "a100"]
    ),
    engine_args={
        "torch_dtype": "bfloat16",
        "guidance_scale": 3.5,
        "num_inference_steps": 28
    },
    concurrency=1  # Image generation is typically 1 concurrent request
)

Generation Input Schema:

from pydantic import BaseModel, Field

class GenerationInput(BaseModel):
    prompt: str
    negative_prompt: str = ""
    height: int = Field(default=1024, ge=128, le=2048)
    width: int = Field(default=1024, ge=128, le=2048)
    num_inference_steps: int = Field(default=25, ge=1, le=50)
    guidance_scale: float = Field(default=7.5, ge=1.0, le=20.0)
    seed: Optional[int] = Field(default=None, ge=0, le=2**32 - 1)

Provided Endpoints:

  • - Generate image from prompt

Embedding Template

Create a chute optimized for text embeddings using vLLM.

Import:

from chutes.chute.template.embedding import build_embedding_chute

Signature:

def build_embedding_chute(
    username: str,
    model_name: str,
    node_selector: NodeSelector,
    image: str | Image = VLLM,
    tagline: str = "",
    readme: str = "",
    concurrency: int = 32,
    engine_args: Dict[str, Any] = {},
    revision: str = None,
    max_instances: int = 1,
    scaling_threshold: float = 0.75,
    shutdown_after_seconds: int = 300,
    pooling_type: str = "auto",
    max_embed_len: int = 3072000,
    enable_chunked_processing: bool = True,
    allow_external_egress: bool = False
) -> Chute

Example:

from chutes.chute.template.embedding import build_embedding_chute
from chutes.chute import NodeSelector

chute = build_embedding_chute(
    username="myuser",
    model_name="BAAI/bge-large-en-v1.5",
    node_selector=NodeSelector(
        gpu_count=1,
        min_vram_gb_per_gpu=16
    ),
    pooling_type="auto",  # Auto-detect optimal pooling
    concurrency=32
)

Pooling Types:

  • - Auto-detect based on model name
  • - Mean pooling (E5, Jina models)
  • - CLS token pooling (BGE models)
  • - Last token pooling (GTE, Qwen models)

Provided Endpoints:

  • - OpenAI-compatible embeddings endpoint

Extending Templates

Templates can be extended with custom functionality:

from chutes.chute.template import build_vllm_chute
from chutes.chute import NodeSelector

# Create base chute from template
chute = build_vllm_chute(
    username="myuser",
    model_name="mistralai/Mistral-7B-Instruct-v0.3",
    node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=24)
)

# Add custom endpoint
@chute.cord(public_api_path="/summarize", public_api_method="POST")
async def summarize(self, text: str) -> dict:
    """Summarize text using the loaded model."""
    prompt = f"Summarize the following text:\n\n{text}\n\nSummary:"

    # Use the template's built-in generation
    result = await self.generate(prompt=prompt, max_tokens=200)

    return {"summary": result}

# Add custom startup logic
@chute.on_startup(priority=90)  # Run after template initialization
async def custom_setup(self):
    """Custom initialization after model loads."""
    print("Custom setup complete!")

Model-Specific Configurations

Mistral Models

    chute = build_vllm_chute(
    username="myuser",
    model_name="mistralai/Mistral-7B-Instruct-v0.3",
    node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=24),
    engine_args={
        "tokenizer_mode": "mistral",
        "config_format": "mistral",
        "load_format": "mistral",
        "tool_call_parser": "mistral",
        "enable_auto_tool_choice": True
    }
)

Llama Models

chute = build_vllm_chute(
    username="myuser",
    model_name="meta-llama/Llama-2-70b-chat-hf",
        node_selector=NodeSelector(
        gpu_count=4,
        min_vram_gb_per_gpu=48
    ),
        engine_args={
        "max_model_len": 4096,
        "gpu_memory_utilization": 0.95,
        "tensor_parallel_size": 4
    }
    )

DeepSeek Models

from chutes.chute.template.sglang import build_sglang_chute

chute = build_sglang_chute(
    username="myuser",
    model_name="deepseek-ai/DeepSeek-R1",
    node_selector=NodeSelector(
                gpu_count=8,
        min_vram_gb_per_gpu=141,
        include=["h200"]
            ),
    engine_args={
        "tp_size": 8,
        "trust_remote_code": True,
        "context_length": 65536
    }
    )

FLUX Image Generation

from chutes.chute.template.diffusion import build_diffusion_chute

chute = build_diffusion_chute(
    username="myuser",
    model_name="black-forest-labs/FLUX.1-dev",
    node_selector=NodeSelector(
        gpu_count=1,
        min_vram_gb_per_gpu=48
    ),
    engine_args={
        "torch_dtype": "bfloat16",
        "guidance_scale": 3.5,
        "num_inference_steps": 28
    }
)

Best Practices

1. Choose the Right Template

# For OpenAI-compatible LLM API
vllm_chute = build_vllm_chute(...)

# For structured generation and reasoning
sglang_chute = build_sglang_chute(...)

# For text embeddings
embedding_chute = build_embedding_chute(...)

# For image generation
diffusion_chute = build_diffusion_chute(...)

2. Match Hardware to Model

# 7B model - single GPU
node_selector = NodeSelector(gpu_count=1, min_vram_gb_per_gpu=24)

# 70B model - multiple GPUs with tensor parallelism
node_selector = NodeSelector(gpu_count=4, min_vram_gb_per_gpu=48)
engine_args = {"tensor_parallel_size": 4}

3. Set Appropriate Concurrency

# vLLM/SGLang with continuous batching - high concurrency
chute = build_vllm_chute(..., concurrency=64)

# Image generation - low concurrency
chute = build_diffusion_chute(..., concurrency=1)

# Embeddings - medium-high concurrency
chute = build_embedding_chute(..., concurrency=32)

4. Use Auto-Scaling for Production

chute = build_vllm_chute(
 ...,
 max_instances=10,
 scaling_threshold=0.75,
 shutdown_after_seconds=300
)

See Also