Developers

LLM Chat Applications

This guide shows how to build powerful chat applications using Large Language Models (LLMs) with Chutes. We'll cover both high-performance VLLM serving and flexible SGLang implementations.

Overview

Chutes provides pre-built templates for popular LLM serving frameworks:

  • VLLM: High-performance serving with OpenAI-compatible APIs
  • SGLang: Advanced serving with structured generation capabilities

Both frameworks support:

  • Multi-GPU scaling for large models
  • OpenAI-compatible endpoints
  • Streaming responses
  • Custom model configurations

Quick Start: VLLM Chat Service

Basic VLLM Setup

from chutes.chute import NodeSelector
from chutes.chute.template.vllm import build_vllm_chute

# Create a high-performance chat service
chute = build_vllm_chute(
    username="myuser",
    model_name="microsoft/DialoGPT-medium",
    revision="main",  # Required parameter
    node_selector=NodeSelector(
        gpu_count=1,
        min_vram_gb_per_gpu=24
    ),
    concurrency=4
)

Production VLLM Configuration

For production workloads with larger models:

from chutes.chute import NodeSelector
from chutes.chute.template.vllm import build_vllm_chute

# Production-ready Mistral deployment
chute = build_vllm_chute(
    username="myuser",
    model_name="chutesai/Mistral-Small-3.1-24B-Instruct-2503",
    revision="cb765b56fbc11c61ac2a82ec777e3036964b975c",  # Required parameter moved to top level
    image="chutes/vllm:0.9.2.dev0",
    readme="Mistral-Small-3.1-24B-Instruct-2503 - High-performance chat model",
    node_selector=NodeSelector(
        gpu_count=8,
        min_vram_gb_per_gpu=48,
        exclude=["l40", "a6000", "b200", "mi300x"],  # Exclude slower GPUs
    ),
    engine_args=dict(
        gpu_memory_utilization=0.97,
        max_model_len=96000,
        limit_mm_per_prompt={"image": 8},
        max_num_seqs=8,
        trust_remote_code=True,
        tokenizer_mode="mistral",
        config_format="mistral",
        load_format="mistral",
        tool_call_parser="mistral",
        enable_auto_tool_choice=True),
    concurrency=8)

Advanced: SGLang with Custom Image

For more control and advanced features, use SGLang with a custom image:

import os
from chutes.chute import NodeSelector
from chutes.chute.template.sglang import build_sglang_chute
from chutes.image import Image

# Optimize networking for multi-GPU setups
os.environ["NO_PROXY"] = "localhost,127.0.0.1"
for key in ["NCCL_P2P_DISABLE", "NCCL_IB_DISABLE", "NCCL_NET_GDR_LEVEL"]:
    if key in os.environ:
        del os.environ[key]

# Build custom SGLang image with optimizations
image = (
    Image(
        username="myuser",
        name="sglang-optimized",
        tag="0.4.9.dev1",
        readme="SGLang with performance optimizations for large models")
    .from_base("parachutes/python:3.12")
    .run_command("pip install --upgrade pip")
    .run_command("pip install --upgrade 'sglang[all]'")
    .run_command(
        "git clone https://github.com/sgl-project/sglang sglang_src && "
        "cd sglang_src && pip install -e python[all]"
    )
    .run_command(
        "pip install torch torchvision torchaudio "
        "--index-url https://download.pytorch.org/whl/cu128 --upgrade"
    )
    .run_command("pip install datasets blobfile accelerate tiktoken")
    .run_command("pip install nvidia-nccl-cu12==2.27.6 --force-reinstall --no-deps")
    .with_env("SGL_ENABLE_JIT_DEEPGEMM", "1")
)

# Deploy Kimi K2 Instruct model
chute = build_sglang_chute(
    username="myuser",
    readme="Moonshot AI Kimi K2 Instruct - Advanced reasoning model",
    model_name="moonshotai/Kimi-K2-Instruct",
    image=image,
    concurrency=3,
    node_selector=NodeSelector(
        gpu_count=8,
        include=["h200"],  # Use latest H200 GPUs
    ),
    engine_args=(
        "--trust-remote-code "
        "--cuda-graph-max-bs 3 "
        "--mem-fraction-static 0.97 "
        "--context-length 65536 "
        "--revision d1e2b193ddeae7776463443e7a9aa3c3cdc51003 "
    ))

Reasoning Models: DeepSeek R1

For advanced reasoning capabilities:

from chutes.chute import NodeSelector
from chutes.chute.template.sglang import build_sglang_chute

# Deploy DeepSeek R1 reasoning model
chute = build_sglang_chute(
    username="myuser",
    readme="DeepSeek R1 - Advanced reasoning and problem-solving model",
    model_name="deepseek-ai/DeepSeek-R1",
    image="chutes/sglang:0.4.6.post5b",
    concurrency=24,
    node_selector=NodeSelector(
        gpu_count=8,
        min_vram_gb_per_gpu=140,  # Large memory requirement
        include=["h200"]),
    engine_args=(
        "--trust-remote-code "
        "--revision f7361cd9ff99396dbf6bd644ad846015e59ed4fc"
    ))

Using Your Chat Service

Deploy the Service

# Build and deploy your chat service
chutes deploy my_chat:chute

# Monitor deployment
chutes chutes get my-chat

OpenAI-Compatible API

Both VLLM and SGLang provide OpenAI-compatible endpoints:

# Chat completions endpoint
curl -X POST "https://myuser-my-chat.chutes.ai/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/DialoGPT-medium",
    "messages": [
      {"role": "user", "content": "Hello! How are you?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

Streaming Responses

Enable real-time streaming for better user experience:

curl -X POST "https://myuser-my-chat.chutes.ai/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/DialoGPT-medium",
    "messages": [
      {"role": "user", "content": "Write a short story about AI"}
    ],
    "stream": true,
    "max_tokens": 500
  }'

Python Client Example

import openai

# Configure client to use your Chutes deployment
client = openai.OpenAI(
    base_url="https://myuser-my-chat.chutes.ai/v1",
    api_key="your-api-key"  # Or use environment variable
)

# Chat completion
response = client.chat.completions.create(
    model="microsoft/DialoGPT-medium",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    max_tokens=200,
    temperature=0.7
)

print(response.choices[0].message.content)

# Streaming chat
stream = client.chat.completions.create(
    model="microsoft/DialoGPT-medium",
    messages=[
        {"role": "user", "content": "Tell me a joke"}
    ],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Performance Optimization

GPU Selection

Choose appropriate hardware for your model size:

# For smaller models (7B-13B parameters)
node_selector = NodeSelector(
    gpu_count=1,
    min_vram_gb_per_gpu=24
)

# For medium models (30B-70B parameters)
node_selector = NodeSelector(
    gpu_count=4,
    min_vram_gb_per_gpu=80
)

# For large models (100B+ parameters)
node_selector = NodeSelector(
    gpu_count=8,
    min_vram_gb_per_gpu=140,
    include=["h200"]  # Use latest hardware
)

Engine Optimization

Tune engine parameters for best performance:

# VLLM optimizations
engine_args = dict(
    gpu_memory_utilization=0.97,  # Use most GPU memory
    max_model_len=32768,          # Context length
    max_num_seqs=16,              # Batch size
    trust_remote_code=True,       # Enable custom models
    enforce_eager=False,          # Use CUDA graphs
    disable_log_requests=True,    # Reduce logging overhead
)

# SGLang optimizations
engine_args = (
    "--trust-remote-code "
    "--cuda-graph-max-bs 8 "      # CUDA graph batch size
    "--mem-fraction-static 0.95 " # Memory allocation
    "--context-length 32768 "     # Context window
)

Concurrency Settings

Balance throughput and resource usage:

# High throughput setup
chute = build_vllm_chute(
    # ... other parameters
    concurrency=16,  # Handle many concurrent requests
    engine_args=dict(
        max_num_seqs=32,         # Large batch size
        gpu_memory_utilization=0.90)
)

# Low latency setup
chute = build_vllm_chute(
    # ... other parameters
    concurrency=4,   # Fewer concurrent requests
    engine_args=dict(
        max_num_seqs=8,          # Smaller batch size
        gpu_memory_utilization=0.95)
)

Monitoring and Troubleshooting

Check Service Status

# View service health
chutes chutes get my-chat

# View recent logs
chutes chutes logs my-chat

# Monitor resource usage
chutes chutes metrics my-chat

Common Issues

Out of Memory (OOM)

# Reduce memory usage
engine_args = dict(
    gpu_memory_utilization=0.85,  # Lower memory usage
    max_model_len=16384,          # Shorter context
    max_num_seqs=4,               # Smaller batch
)

Slow Response Times

# Optimize for speed
engine_args = dict(
    enforce_eager=False,          # Enable CUDA graphs
    disable_log_requests=True,    # Reduce logging
    quantization="awq",           # Use quantization
)

Connection Timeouts

# Increase timeouts
chute = build_vllm_chute(
    # ... other parameters
    concurrency=8,  # Increase concurrent capacity
    engine_args=dict(
        max_num_seqs=16,  # Larger batches
    )
)

Best Practices

1. Model Selection

  • For general chat: Mistral, Llama, or Qwen models
  • For reasoning: DeepSeek R1, GPT-4 style models
  • For coding: CodeLlama, DeepSeek Coder
  • For multilingual: Qwen, multilingual Mistral variants

2. Resource Planning

  • Start with smaller configurations and scale up
  • Monitor GPU utilization and adjust concurrency
  • Use appropriate GPU types for your model size
  • Consider cost vs. performance trade-offs

3. Development Workflow

# 1. Test locally with small model
chutes deploy test-chat:chute --wait

# 2. Validate API endpoints
curl https://myuser-test-chat.chutes.ai/v1/models

# 3. Load test with production model
chutes deploy prod-chat:chute --wait

# 4. Monitor and optimize
chutes chutes metrics prod-chat

4. Security Considerations

  • Use API keys for authentication
  • Implement rate limiting if needed
  • Monitor usage and costs
  • Keep model revisions pinned for reproducibility

Next Steps

  • Advanced Features: Explore function calling and tool use
  • Custom Templates: Build specialized chat applications
  • Integration: Connect with web frontends and mobile apps
  • Scaling: Implement load balancing across multiple deployments

For more examples, see: