LLM Chat Applications

This guide shows how to build powerful chat applications using Large Language Models (LLMs) with Chutes. We'll cover both high-performance VLLM serving and flexible SGLang implementations.

Overview

Chutes provides pre-built templates for popular LLM serving frameworks:

VLLM: High-performance serving with OpenAI-compatible APIs
SGLang: Advanced serving with structured generation capabilities

Both frameworks support:

Multi-GPU scaling for large models
OpenAI-compatible endpoints
Streaming responses
Custom model configurations

Quick Start: VLLM Chat Service

Basic VLLM Setup

from chutes.chute import NodeSelector
from chutes.chute.template.vllm import build_vllm_chute

# Create a high-performance chat service
chute = build_vllm_chute(
    username="myuser",
    model_name="microsoft/DialoGPT-medium",
    revision="main",  # Required parameter
    node_selector=NodeSelector(
        gpu_count=1,
        min_vram_gb_per_gpu=24
    ),
    concurrency=4
)

Production VLLM Configuration

For production workloads with larger models:

from chutes.chute import NodeSelector
from chutes.chute.template.vllm import build_vllm_chute

# Production-ready Mistral deployment
chute = build_vllm_chute(
    username="myuser",
    model_name="chutesai/Mistral-Small-3.1-24B-Instruct-2503",
    revision="cb765b56fbc11c61ac2a82ec777e3036964b975c",  # Required parameter moved to top level
    image="chutes/vllm:0.9.2.dev0",
    readme="Mistral-Small-3.1-24B-Instruct-2503 - High-performance chat model",
    node_selector=NodeSelector(
        gpu_count=8,
        min_vram_gb_per_gpu=48,
        exclude=["l40", "a6000", "b200", "mi300x"],  # Exclude slower GPUs
    ),
    engine_args=dict(
        gpu_memory_utilization=0.97,
        max_model_len=96000,
        limit_mm_per_prompt={"image": 8},
        max_num_seqs=8,
        trust_remote_code=True,
        tokenizer_mode="mistral",
        config_format="mistral",
        load_format="mistral",
        tool_call_parser="mistral",
        enable_auto_tool_choice=True),
    concurrency=8)

Advanced: SGLang with Custom Image

For more control and advanced features, use SGLang with a custom image:

import os
from chutes.chute import NodeSelector
from chutes.chute.template.sglang import build_sglang_chute
from chutes.image import Image

# Optimize networking for multi-GPU setups
os.environ["NO_PROXY"] = "localhost,127.0.0.1"
for key in ["NCCL_P2P_DISABLE", "NCCL_IB_DISABLE", "NCCL_NET_GDR_LEVEL"]:
    if key in os.environ:
        del os.environ[key]

# Build custom SGLang image with optimizations
image = (
    Image(
        username="myuser",
        name="sglang-optimized",
        tag="0.4.9.dev1",
        readme="SGLang with performance optimizations for large models")
    .from_base("parachutes/python:3.12")
    .run_command("pip install --upgrade pip")
    .run_command("pip install --upgrade 'sglang[all]'")
    .run_command(
        "git clone https://github.com/sgl-project/sglang sglang_src && "
        "cd sglang_src && pip install -e python[all]"
    )
    .run_command(
        "pip install torch torchvision torchaudio "
        "--index-url https://download.pytorch.org/whl/cu128 --upgrade"
    )
    .run_command("pip install datasets blobfile accelerate tiktoken")
    .run_command("pip install nvidia-nccl-cu12==2.27.6 --force-reinstall --no-deps")
    .with_env("SGL_ENABLE_JIT_DEEPGEMM", "1")
)

# Deploy Kimi K2 Instruct model
chute = build_sglang_chute(
    username="myuser",
    readme="Moonshot AI Kimi K2 Instruct - Advanced reasoning model",
    model_name="moonshotai/Kimi-K2-Instruct",
    image=image,
    concurrency=3,
    node_selector=NodeSelector(
        gpu_count=8,
        include=["h200"],  # Use latest H200 GPUs
    ),
    engine_args=(
        "--trust-remote-code "
        "--cuda-graph-max-bs 3 "
        "--mem-fraction-static 0.97 "
        "--context-length 65536 "
        "--revision d1e2b193ddeae7776463443e7a9aa3c3cdc51003 "
    ))

Reasoning Models: DeepSeek R1

For advanced reasoning capabilities:

from chutes.chute import NodeSelector
from chutes.chute.template.sglang import build_sglang_chute

# Deploy DeepSeek R1 reasoning model
chute = build_sglang_chute(
    username="myuser",
    readme="DeepSeek R1 - Advanced reasoning and problem-solving model",
    model_name="deepseek-ai/DeepSeek-R1",
    image="chutes/sglang:0.4.6.post5b",
    concurrency=24,
    node_selector=NodeSelector(
        gpu_count=8,
        min_vram_gb_per_gpu=140,  # Large memory requirement
        include=["h200"]),
    engine_args=(
        "--trust-remote-code "
        "--revision f7361cd9ff99396dbf6bd644ad846015e59ed4fc"
    ))

Using Your Chat Service

Deploy the Service

# Build and deploy your chat service
chutes deploy my_chat:chute

# Monitor deployment
chutes chutes get my-chat

OpenAI-Compatible API

Both VLLM and SGLang provide OpenAI-compatible endpoints:

# Chat completions endpoint
curl -X POST "https://myuser-my-chat.chutes.ai/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/DialoGPT-medium",
    "messages": [
      {"role": "user", "content": "Hello! How are you?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

Streaming Responses

Enable real-time streaming for better user experience:

curl -X POST "https://myuser-my-chat.chutes.ai/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/DialoGPT-medium",
    "messages": [
      {"role": "user", "content": "Write a short story about AI"}
    ],
    "stream": true,
    "max_tokens": 500
  }'

Python Client Example

import openai

# Configure client to use your Chutes deployment
client = openai.OpenAI(
    base_url="https://myuser-my-chat.chutes.ai/v1",
    api_key="your-api-key"  # Or use environment variable
)

# Chat completion
response = client.chat.completions.create(
    model="microsoft/DialoGPT-medium",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    max_tokens=200,
    temperature=0.7
)

print(response.choices[0].message.content)

# Streaming chat
stream = client.chat.completions.create(
    model="microsoft/DialoGPT-medium",
    messages=[
        {"role": "user", "content": "Tell me a joke"}
    ],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Performance Optimization

GPU Selection

Choose appropriate hardware for your model size:

# For smaller models (7B-13B parameters)
node_selector = NodeSelector(
    gpu_count=1,
    min_vram_gb_per_gpu=24
)

# For medium models (30B-70B parameters)
node_selector = NodeSelector(
    gpu_count=4,
    min_vram_gb_per_gpu=80
)

# For large models (100B+ parameters)
node_selector = NodeSelector(
    gpu_count=8,
    min_vram_gb_per_gpu=140,
    include=["h200"]  # Use latest hardware
)

Engine Optimization

Tune engine parameters for best performance:

# VLLM optimizations
engine_args = dict(
    gpu_memory_utilization=0.97,  # Use most GPU memory
    max_model_len=32768,          # Context length
    max_num_seqs=16,              # Batch size
    trust_remote_code=True,       # Enable custom models
    enforce_eager=False,          # Use CUDA graphs
    disable_log_requests=True,    # Reduce logging overhead
)

# SGLang optimizations
engine_args = (
    "--trust-remote-code "
    "--cuda-graph-max-bs 8 "      # CUDA graph batch size
    "--mem-fraction-static 0.95 " # Memory allocation
    "--context-length 32768 "     # Context window
)

Concurrency Settings

Balance throughput and resource usage:

# High throughput setup
chute = build_vllm_chute(
    # ... other parameters
    concurrency=16,  # Handle many concurrent requests
    engine_args=dict(
        max_num_seqs=32,         # Large batch size
        gpu_memory_utilization=0.90)
)

# Low latency setup
chute = build_vllm_chute(
    # ... other parameters
    concurrency=4,   # Fewer concurrent requests
    engine_args=dict(
        max_num_seqs=8,          # Smaller batch size
        gpu_memory_utilization=0.95)
)

Monitoring and Troubleshooting

Check Service Status

# View service health
chutes chutes get my-chat

# View recent logs
chutes chutes logs my-chat

# Monitor resource usage
chutes chutes metrics my-chat

Common Issues

Out of Memory (OOM)

# Reduce memory usage
engine_args = dict(
    gpu_memory_utilization=0.85,  # Lower memory usage
    max_model_len=16384,          # Shorter context
    max_num_seqs=4,               # Smaller batch
)

Slow Response Times

# Optimize for speed
engine_args = dict(
    enforce_eager=False,          # Enable CUDA graphs
    disable_log_requests=True,    # Reduce logging
    quantization="awq",           # Use quantization
)

Connection Timeouts

# Increase timeouts
chute = build_vllm_chute(
    # ... other parameters
    concurrency=8,  # Increase concurrent capacity
    engine_args=dict(
        max_num_seqs=16,  # Larger batches
    )
)

Best Practices

1. Model Selection

For general chat: Mistral, Llama, or Qwen models
For reasoning: DeepSeek R1, GPT-4 style models
For coding: CodeLlama, DeepSeek Coder
For multilingual: Qwen, multilingual Mistral variants

2. Resource Planning

Start with smaller configurations and scale up
Monitor GPU utilization and adjust concurrency
Use appropriate GPU types for your model size
Consider cost vs. performance trade-offs

3. Development Workflow

# 1. Test locally with small model
chutes deploy test-chat:chute --wait

# 2. Validate API endpoints
curl https://myuser-test-chat.chutes.ai/v1/models

# 3. Load test with production model
chutes deploy prod-chat:chute --wait

# 4. Monitor and optimize
chutes chutes metrics prod-chat

4. Security Considerations

Use API keys for authentication
Implement rate limiting if needed
Monitor usage and costs
Keep model revisions pinned for reproducibility

Next Steps

Advanced Features: Explore function calling and tool use
Custom Templates: Build specialized chat applications
Integration: Connect with web frontends and mobile apps
Scaling: Implement load balancing across multiple deployments

For more examples, see: