# Chutes Full Documentation Reference This file contains the complete, unabridged documentation for Chutes. For a summary and API reference, see https://chutes.ai/llms.txt --- ## SOURCE: https://chutes.ai/docs/getting-started/installation # Installation & Setup This guide will walk you through installing the Chutes SDK and setting up your development environment. ## Prerequisites Before installing Chutes, ensure you have: - **Python 3.10+** (Python 3.11 or 3.12 recommended) - **pip** package manager - A **Bittensor wallet** (required for authentication) ## Installing the Chutes SDK ### Option 1: Install from PyPI (Recommended) ```bash pip install chutes ``` ### Option 2: Install from Source If you want the latest development features: ```bash git clone https://github.com/chutesai/chutes.git cd chutes pip install -e . ``` ### Verify Installation Check that Chutes was installed correctly: ```bash chutes --help ``` You should see the Chutes CLI help menu. ## Setting Up Authentication Chutes uses **Bittensor** for secure authentication. You'll need a Bittensor wallet with a hotkey. ### Creating a Bittensor Wallet If you don't already have a Bittensor wallet: #### Option 1: Automatic Setup (Recommended) Visit [chutes.ai](https://chutes.ai) and create an account. The platform will automatically create and manage your wallet for you. #### Option 2: Manual Setup If you prefer to manage your own wallet: 1. Install Bittensor (older version recommended for compatibility/ease of install): ```bash pip install 'bittensor<8' ``` 2. Create a coldkey and hotkey: ```bash # Create a coldkey (your main wallet) btcli wallet new_coldkey --n_words 24 --wallet.name my-chutes-wallet # Create a hotkey (for signing transactions) btcli wallet new_hotkey --wallet.name my-chutes-wallet --n_words 24 --wallet.hotkey my-hotkey ``` ### Registering with Chutes Once you have a Bittensor wallet, register with the Chutes platform: ```bash chutes register ``` Follow the interactive prompts to: 1. Enter your desired username 2. Select your Bittensor wallet 3. Choose your hotkey 4. Complete the registration process After successful registration, you'll find your configuration at `~/.chutes/config.ini`. ## Configuration Your Chutes configuration is stored in `~/.chutes/config.ini`: ```ini [auth] user_id = your-user-id username = your-username hotkey_seed = your-hotkey-seed hotkey_name = your-hotkey-name hotkey_ss58address = your-hotkey-address [api] base_url = https://api.chutes.ai ``` ### Environment Variables You can override configuration with environment variables: ```bash export CHUTES_CONFIG_PATH=/custom/path/to/config.ini export CHUTES_API_URL=https://api.chutes.ai export CHUTES_DEV_URL=http://localhost:8000 # For local development ``` ## Creating API Keys For programmatic access, create API keys: ### Full Admin Access ```bash chutes keys create --name admin-key --admin ``` ### Limited Access ```bash # Access to specific chutes (requires action parameter) chutes keys create --name my-app-key --chute-ids --action read # Access to images only (requires action parameter) chutes keys create --name image-key --images --action write ``` ### Using API Keys Use your API keys in HTTP requests: ```bash curl -H "Authorization: Bearer cpk_your_api_key" \ https://api.chutes.ai/chutes/ ``` Or in Python: ```python import aiohttp headers = {"Authorization": "Bearer cpk_your_api_key"} async with aiohttp.ClientSession() as session: async with session.get("https://api.chutes.ai/chutes/", headers=headers) as resp: data = await resp.json() ``` ## IDE Setup ### VS Code For the best development experience with VS Code: 1. Install the **Python extension** 2. Set up your Python interpreter to use the environment where you installed Chutes 3. Add this to your `.vscode/settings.json`: ```json { "python.linting.enabled": true, "python.linting.pylintEnabled": true, "python.formatting.provider": "black", "python.analysis.typeCheckingMode": "basic" } ``` ### PyCharm For PyCharm users: 1. Configure your Python interpreter 2. Add Chutes to your project dependencies 3. Enable type checking for better IntelliSense ## Troubleshooting ### Common Issues **"Command not found: chutes"** - Make sure your Python `Scripts` directory is in your `PATH` - Try `python -m chutes` instead **"Invalid hotkey" during registration** - Ensure your Bittensor wallet is properly created - Check that you're using the correct wallet and hotkey names **"Permission denied" errors** - You might need to use `sudo` on some systems - Consider using a virtual environment **"API connection failed"** - Check your internet connection - Verify the API URL in your config - Ensure you have the latest version of Chutes ### Getting Help If you encounter issues: 1. Check the [FAQ](../help/faq) 2. Search existing [GitHub issues](https://github.com/chutesai/chutes/issues) 3. Join our [Discord community](https://discord.gg/wHrXwWkCRz) 4. Email `support@chutes.ai` ## Next Steps Now that you have Chutes installed and configured: 1. **[Quick Start Guide](quickstart)** - Deploy your first chute in minutes 2. **[Your First Chute](first-chute)** - Build a complete application from scratch 3. **[Core Concepts](../core-concepts/chutes)** - Understand the fundamentals --- Ready to build something amazing? Let's move on to the [Quick Start Guide](quickstart)! --- ## SOURCE: https://chutes.ai/docs/getting-started/quickstart # Quick Start Guide Get your first chute deployed in under 10 minutes! This guide will walk you through creating, building, and deploying a simple AI application. ## Prerequisites Make sure you've completed the [Installation & Setup](installation) guide first. ## Step 1: Create Your First Chute Let's build a simple text generation chute using a pre-built template. Create a new file called `my_first_chute.py`: ```python from chutes.chute import NodeSelector from chutes.chute.template.vllm import build_vllm_chute # Build a chute using the VLLM template chute = build_vllm_chute( username="your-username", # Replace with your Chutes username readme="## Meta Llama 3.2 1B Instruct\n### Hello.", model_name="unsloth/Llama-3.2-1B-Instruct", node_selector=NodeSelector( gpu_count=1, ), concurrency=4, readme=""" # My First Chute A simple conversational AI powered by Llama 3.2. ## Usage Send a POST request to `/v1/chat/completions` with your message. """ ) ``` That's it! You've just defined a complete AI application with: - ✅ A pre-configured VLLM server - ✅ Automatic model downloading - ✅ OpenAI-compatible API endpoints - ✅ GPU resource requirements - ✅ Auto-scaling configuration ## Step 2: Build Your Image Build the Docker image for your chute: ```bash chutes build my_first_chute:chute --wait ``` This will: - 📦 Create a Docker image with all dependencies - 🔧 Install VLLM and required libraries - ⬇️ Pre-download your model - ✅ Validate the configuration The `--wait` flag streams the build logs to your terminal so you can monitor progress. ## Step 3: Deploy Your Chute Deploy your chute to the Chutes platform: ```bash chutes deploy my_first_chute:chute ``` After deployment, you'll see output like: ``` ✅ Chute deployed successfully! 🌐 Public API: https://your-username-my-first-chute.chutes.ai 📋 Chute ID: 12345678-1234-5678-9abc-123456789012 ``` ## Step 4: Test Your Chute Your chute is now live! Test it with a simple chat completion: ### Option 1: Using curl ```bash curl -X POST https://your-username-my-first-chute.chutes.ai/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "unsloth/Llama-3.2-1B-Instruct", "messages": [ {"role": "user", "content": "Hello! How are you today?"} ], "max_tokens": 100, "temperature": 0.7 }' ``` ### Option 2: Using Python ```python import asyncio import aiohttp import json async def chat_with_chute(): url = "https://your-username-my-first-chute.chutes.ai/v1/chat/completions" payload = { "model": "unsloth/Llama-3.2-1B-Instruct", "messages": [ {"role": "user", "content": "Hello! How are you today?"} ], "max_tokens": 100, "temperature": 0.7 } async with aiohttp.ClientSession() as session: async with session.post(url, json=payload) as response: result = await response.json() print(json.dumps(result, indent=2)) # Run the test asyncio.run(chat_with_chute()) ``` ### Option 3: Test Locally You can also test your chute locally before deploying using the CLI: ```bash # Run your chute locally chutes run my_first_chute:chute --dev # Then in another terminal, test with curl curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "unsloth/Llama-3.2-1B-Instruct", "messages": [ {"role": "user", "content": "Hello! How are you today?"} ], "max_tokens": 100, "temperature": 0.7 }' ``` ## Step 5: Monitor and Manage ### View Your Chutes ```bash chutes chutes list ``` ### Get Detailed Information ```bash chutes chutes get my-first-chute ``` ### Check Logs Visit the [Chutes Dashboard](https://chutes.ai) to view real-time logs and metrics. ### Deleting Resources When you're done with a chute, it's good practice to clean up your resources. - **Note:** You must remove a chute before you can delete its image. Images tied to running chutes cannot be deleted. ```bash # 1. Delete the chute chutes chutes delete # 2. Delete the image (after chute is removed) chutes images delete ``` ## What Just Happened? Congratulations! You just: 1. 🎯 **Defined** an AI application with just a few lines of Python 2. 🏗️ **Built** a production-ready Docker image 3. 🚀 **Deployed** to GPU-accelerated infrastructure 4. 🌐 **Exposed** OpenAI-compatible API endpoints 5. 💰 **Pay-per-use** - only charged when your chute receives requests ## Next Steps Now that you have a working chute, explore more advanced features: ### 🎨 Try Different Models Replace `unsloth/Llama-3.2-1B-Instruct` with: - `unsloth/Llama-3.1-8B-Instruct` (requires more VRAM) - `deepseek-ai/DeepSeek-R1-Distill-Llama-8B` - `Qwen/Qwen2.5-7B-Instruct` ### 🔧 Customize Hardware Adjust your `NodeSelector`: ```python NodeSelector( gpu_count=1, # Use 1 GPU min_vram_gb_per_gpu=24, # Require 24GB VRAM per GPU include=["a100", "h100"], # Prefer specific GPU types exclude=["k80"] # Avoid older GPUs ) ``` ### 🎛️ Tune Performance Modify engine arguments: ```python chute = build_vllm_chute( # ... other parameters ... engine_args={ "max_model_len": 4096, "gpu_memory_utilization": 0.9, "max_num_seqs": 32 } ) ``` ### 📚 Learn Core Concepts - **[Understanding Chutes](../core-concepts/chutes)** - Deep dive into the Chute class - **[Security Architecture](../core-concepts/security-architecture)** - Learn about our TEE and hardware attestation security - **[Cords (API Endpoints)](../core-concepts/cords)** - Custom API endpoints - **[Custom Images](../core-concepts/images)** - Build your own Docker images ### 🏗️ Build Custom Applications - **[Your First Custom Chute](first-chute)** - Build from scratch - **[Custom Image Building](../guides/custom-images)** - Advanced Docker setups - **[Input/Output Schemas](../guides/schemas)** - Type-safe APIs ### 🔗 Integrations - **[Vercel AI SDK](../integrations/vercel-ai-sdk)** - Use Chutes with the Vercel AI SDK for streaming, tool calling, and more ## Common Questions **Q: How much does this cost?** A: You only pay for GPU time when your chute is processing requests. Idle time is free! **Q: Can I use my own models?** A: Yes! Upload models to HuggingFace or use the custom image building features. **Q: What about scaling?** A: Chutes automatically scales based on demand. Configure `concurrency` to control how many requests each instance handles. **Q: How do I debug issues?** A: Check the logs in the [Chutes Dashboard](https://chutes.ai) or use the CLI: `chutes chutes get my-chute` ## Troubleshooting **Build failed?** - Check that your model name is correct - Try with a smaller model first **Deployment failed?** - Verify your image built successfully - Check your username and chute name are valid - Ensure you have proper permissions **Can't access your chute?** - Wait a few minutes for DNS propagation - Check the exact URL from `chutes chutes get` - Verify the chute is in "running" status ## Get Help - 📖 **Detailed Guides**: Continue with [Your First Custom Chute](first-chute) - 💬 **Community**: [Join our Discord](https://discord.gg/wHrXwWkCRz) - 🐛 **Issues**: [GitHub Issues](https://github.com/chutesai/chutes/issues) - 📧 **Support**: `support@chutes.ai` --- Ready to build something more advanced? Check out [Your First Custom Chute](first-chute) to learn how to build completely custom applications! --- ## SOURCE: https://chutes.ai/docs/getting-started/authentication # Authentication & Account Setup This guide covers setting up authentication for the Chutes platform using Bittensor wallets and managing API keys. ## Overview Chutes uses **Bittensor** for secure, decentralized authentication. This provides: - 🔐 **Cryptographic Security**: Wallet-based authentication - 🌐 **Decentralized Identity**: No central password database - 🔑 **API Key Management**: Granular access control - 💰 **Integrated Billing**: Seamless payment integration ## Bittensor Wallet Setup ### Option 1: Automatic Setup (Recommended) The easiest way to get started: 1. **Visit [chutes.ai](https://chutes.ai)** 2. **Click "Create Account"** 3. **Follow the guided setup** The platform will automatically: - Create your Bittensor wallet - Generate secure keys - Set up your account - Provide you with wallet credentials ### Option 2: Manual Wallet Creation If you prefer to manage your own wallet: #### Install Bittensor ```bash # Install older version (required for easy wallet creation) pip install 'bittensor<8' ``` > **Note**: We use an older Bittensor version because newer versions require Rust compilation, which can be complex to set up. #### Create Wallet and Hotkey ```bash # Create a coldkey (your main wallet) btcli wallet new_coldkey \ --n_words 24 \ --wallet.name chutes-wallet # Create a hotkey (for signing transactions) btcli wallet new_hotkey \ --wallet.name chutes-wallet \ --wallet.hotkey default \ --n_words 24 ``` #### Secure Your Keys ```bash # Your wallets are stored in: ls ~/.bittensor/wallets/ # Back up your coldkey and hotkey files # Store them securely - they cannot be recovered if lost! ``` ## Account Registration Once you have a Bittensor wallet, register with Chutes: ```bash chutes register ``` ### Interactive Registration Process The registration wizard will ask for: 1. **Username**: Your desired Chutes username 2. **Wallet Selection**: Choose from available wallets 3. **Hotkey Selection**: Choose from available hotkeys 4. **Confirmation**: Verify your selections ### Example Registration Session ```bash $ chutes register Enter desired username: myawesomeai Found wallets: ['chutes-wallet', 'other-wallet'] Select wallet (chutes-wallet): chutes-wallet Found hotkeys: ['default', 'backup'] Select hotkey (default): default ✅ Registration successful! ``` ## Configuration File After registration, you'll find your config at `~/.chutes/config.ini`: ```ini [auth] user_id = usr_1234567890abcdef username = myawesomeai hotkey_seed = your-encrypted-hotkey-seed hotkey_name = default hotkey_ss58address = 5GrwvaEF5zXb26Fz9rcQpDWS57CtERHpNehXCPcNoHGKutQY [api] base_url = https://api.chutes.ai ``` ### Environment Variable Overrides You can override configuration with environment variables: ```bash # Custom config location export CHUTES_CONFIG_PATH=/custom/path/config.ini # Custom API endpoint export CHUTES_API_URL=https://api.chutes.ai # Development mode export CHUTES_DEV_URL=http://localhost:8000 ``` ## API Key Management For programmatic access and CI/CD, create API keys: ### Creating API Keys #### Full Administrative Access ```bash chutes keys create --name admin-key --admin ``` #### Scoped Access Examples ```bash # Access to specific chutes only chutes keys create \ --name my-app-key \ --chute-ids 12345678-1234-5678-9abc-123456789012 \ --action invoke # Read-only access to images chutes keys create \ --name readonly-images \ --images \ --action read # Multiple chute access chutes keys create \ --name multi-chute-key \ --chute-ids 12345678-1234-5678-9abc-123456789012,87654321-4321-8765-cba9-210987654321 \ --action invoke ``` #### Advanced Scoping ```bash # JSON-based scoping for complex permissions chutes keys create \ --name complex-key \ --json-input '{ "scopes": [ {"object_type": "chutes", "action": "invoke"}, {"object_type": "images", "action": "read", "object_id": "specific-image-id"} ] }' ``` ### Using API Keys #### HTTP Requests ```bash curl -H "Authorization: Bearer cpk_your_api_key_here" \ https://api.chutes.ai/chutes/ ``` #### Python SDK ```python import aiohttp async def call_chutes_api(): headers = {"Authorization": "Bearer cpk_your_api_key_here"} async with aiohttp.ClientSession() as session: async with session.get( "https://api.chutes.ai/chutes/", headers=headers ) as response: return await response.json() ``` #### Environment Variables ```bash # Set API key as environment variable export CHUTES_API_KEY=cpk_your_api_key_here # Use in scripts curl -H "Authorization: Bearer $CHUTES_API_KEY" \ https://api.chutes.ai/chutes/ ``` ### Managing API Keys #### List Your Keys ```bash chutes keys list ``` #### View Key Details ```bash chutes keys get my-app-key ``` #### Delete Keys ```bash chutes keys delete old-key-name ``` ## Developer Deposit To create and deploy chutes, you need a refundable developer deposit: ### Check Required Deposit ```bash curl -s https://api.chutes.ai/developer_deposit | jq . ``` ### Get Your Deposit Address ```bash curl -s https://api.chutes.ai/users/me \ -H "Authorization: Bearer cpk_your_api_key" | jq .deposit_address ``` ### Making the Deposit 1. **Get your deposit address** from the API call above 2. **Transfer TAO** to that address using your preferred wallet 3. **Wait for confirmation** (usually 1-2 blocks) 4. **Verify deposit** status in your account ### Returning Your Deposit After at least 7 days: ```bash curl -X POST https://api.chutes.ai/return_developer_deposit \ -H "Content-Type: application/json" \ -H "Authorization: Bearer cpk_your_api_key" \ -d '{"address": "5EcZsewZSTxUaX8gwyHzkKsqT3NwLP1n2faZPyjttCeaPdYe"}' ``` ## Free Developer Access ### Validator/Subnet Owner Benefits If you own a validator or subnet on Bittensor, you can get free developer access: ```bash chutes link ``` This will: - Link your validator hotkey to your account - Grant free access to Chutes features - Bypass the developer deposit requirement ### Eligibility Requirements - Must own an active validator on Bittensor - Or be a subnet owner - Hotkey must be currently registered and active ## Security Best Practices ### Wallet Security 1. **Backup Your Keys** ```bash # Create secure backups cp -r ~/.bittensor/wallets/ /secure/backup/location/ ``` 2. **Use Separate Hotkeys** ```bash # Create dedicated hotkeys for different purposes btcli wallet new_hotkey --wallet.name chutes-wallet --wallet.hotkey production btcli wallet new_hotkey --wallet.name chutes-wallet --wallet.hotkey development ``` 3. **Secure Storage** - Store coldkey offline when possible - Use hardware wallets for large amounts - Never share your seed phrases ### API Key Security 1. **Principle of Least Privilege** ```bash # Create keys with minimal required permissions chutes keys create --name limited-key --chute-ids specific-id --action read ``` 2. **Regular Rotation** ```bash # Rotate keys regularly chutes keys delete old-key chutes keys create --name new-key --admin ``` 3. **Environment Management** ```bash # Use environment variables, never hardcode keys export CHUTES_API_KEY=cpk_your_key_here # Add to .env files, not source code ``` ## Troubleshooting ### Common Authentication Issues #### "Invalid hotkey" Error ```bash # Check wallet status btcli wallet list # Verify hotkey registration btcli wallet overview --wallet.name your-wallet ``` #### "Config not found" Error ```bash # Check config location echo $CHUTES_CONFIG_PATH ls -la ~/.chutes/ # Re-register if needed chutes register ``` #### "API key invalid" Error ```bash # Verify key exists chutes keys list # Check key permissions chutes keys get your-key-name # Test key curl -H "Authorization: Bearer cpk_your_key" \ https://api.chutes.ai/users/me ``` ### Network Issues #### API Connection Problems ```bash # Test API connectivity curl -v https://api.chutes.ai/ping # Check DNS resolution nslookup api.chutes.ai # Try alternative endpoints export CHUTES_API_URL=https://backup.api.chutes.ai ``` ### Wallet Issues #### Bittensor Installation Problems ```bash # Install specific version pip install bittensor==7.3.0 # Clear cache if needed pip cache purge pip install --no-cache-dir 'bittensor<8' ``` #### Permission Errors ```bash # Fix wallet permissions chmod 600 ~/.bittensor/wallets/*/coldkey chmod 600 ~/.bittensor/wallets/*/hotkeys/* ``` ## Next Steps Now that authentication is set up: 1. **[Quick Start Guide](quickstart)** - Deploy your first chute 2. **[Your First Custom Chute](first-chute)** - Build from scratch 3. **[API Key Management](../cli/account)** - Advanced key management 4. **[Security Best Practices](../guides/best-practices)** - Production security ## Getting Help - 📖 **Documentation**: [Installation Guide](installation) - 💬 **Discord**: [Community Support](https://discord.gg/wHrXwWkCRz) - 🐛 **Issues**: [GitHub Issues](https://github.com/chutesai/chutes/issues) - 📧 **Support**: `support@chutes.ai` --- **Authentication set up?** Great! Now head to the [Quick Start Guide](quickstart) to deploy your first chute. --- ## SOURCE: https://chutes.ai/docs/getting-started/first-chute # Your First Custom Chute This guide walks you through building your first completely custom chute from scratch. Unlike templates, you'll learn to build every component yourself, giving you full control and understanding of the platform. ## What We'll Build We'll create a **sentiment analysis API** that: - 🧠 **Loads a custom model** (DistilBERT for sentiment analysis) - 🔍 **Validates inputs** with Pydantic schemas - 🌐 **Provides REST endpoints** for single and batch processing - 📊 **Returns structured results** with confidence scores - 🏗️ **Uses custom Docker image** with optimized dependencies ## Prerequisites Make sure you've completed: - ✅ [Installation & Setup](installation) - ✅ [Quick Start Guide](quickstart) (recommended) - ✅ [Authentication](authentication) ## Step 1: Plan Your Chute Before coding, let's plan what we need: ### API Endpoints - `POST /analyze` - Analyze single text - `POST /batch` - Analyze multiple texts - `GET /health` - Health check ### Input/Output - **Input**: Text string or array of strings - **Output**: Sentiment label (POSITIVE/NEGATIVE/NEUTRAL) + confidence ### Resources - **Model**: `cardiffnlp/twitter-roberta-base-sentiment-latest` - **GPU**: 1x GPU with 8GB VRAM - **Dependencies**: PyTorch, Transformers, FastAPI, Pydantic ## Step 2: Create Project Structure Create a new directory for your project: ```bash mkdir my-first-chute cd my-first-chute ``` Create the main chute file: ```bash touch sentiment_chute.py ``` ## Step 3: Define Input/Output Schemas Start by defining your data models with Pydantic: ```python # sentiment_chute.py from pydantic import BaseModel, Field, validator from typing import List from enum import Enum class SentimentLabel(str, Enum): POSITIVE = "POSITIVE" NEGATIVE = "NEGATIVE" NEUTRAL = "NEUTRAL" class TextInput(BaseModel): text: str = Field(..., min_length=1, max_length=5000, description="Text to analyze") @validator('text') def text_must_not_be_empty(cls, v): if not v.strip(): raise ValueError('Text cannot be empty or only whitespace') return v.strip() class BatchTextInput(BaseModel): texts: List[str] = Field(..., min_items=1, max_items=50, description="List of texts to analyze") @validator('texts') def validate_texts(cls, v): cleaned_texts = [] for i, text in enumerate(v): if not text or not text.strip(): raise ValueError(f'Text at index {i} cannot be empty') if len(text) > 5000: raise ValueError(f'Text at index {i} is too long (max 5000 characters)') cleaned_texts.append(text.strip()) return cleaned_texts class SentimentResult(BaseModel): text: str sentiment: SentimentLabel confidence: float = Field(..., ge=0.0, le=1.0) processing_time: float class BatchSentimentResult(BaseModel): results: List[SentimentResult] total_texts: int total_processing_time: float average_confidence: float ``` ## Step 4: Build Custom Docker Image Define a custom Docker image with all necessary dependencies: ```python # Add to sentiment_chute.py from chutes.image import Image # Create optimized image for sentiment analysis image = ( Image(username="myuser", name="sentiment-chute", tag="1.0") # Start with CUDA-enabled Ubuntu .from_base("nvidia/cuda:12.2-runtime-ubuntu22.04") # Install Python 3.11 .with_python("3.11") # Install system dependencies .run_command(""" apt-get update && apt-get install -y \\ git curl wget \\ && rm -rf /var/lib/apt/lists/* """) # Install PyTorch with CUDA support .run_command(""" pip install torch torchvision torchaudio \\ --index-url https://download.pytorch.org/whl/cu121 """) # Install transformers and other ML dependencies .run_command(""" pip install \\ transformers>=4.30.0 \\ accelerate>=0.20.0 \\ tokenizers>=0.13.0 \\ numpy>=1.24.0 \\ scikit-learn>=1.3.0 """) # Set up model cache directory .with_env("TRANSFORMERS_CACHE", "/app/models") .with_env("HF_HOME", "/app/models") .run_command("mkdir -p /app/models") # Set working directory .set_workdir("/app") ) ``` ## Step 5: Create the Chute Now create the main chute with proper initialization: ```python # Add to sentiment_chute.py from chutes.chute import Chute, NodeSelector from fastapi import HTTPException import time import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification import numpy as np # Define the chute chute = Chute( username="myuser", # Replace with your username name="sentiment-chute", image=image, tagline="Advanced sentiment analysis with confidence scoring", readme=""" # Sentiment Analysis Chute A production-ready sentiment analysis service using RoBERTa. ## Features - High-accuracy sentiment classification - Confidence scoring for each prediction - Batch processing support - GPU acceleration - Input validation and error handling ## Usage ### Single Text Analysis ```bash curl -X POST https://myuser-sentiment-chute.chutes.ai/analyze \\ -H "Content-Type: application/json" \\ -d '{"text": "I love this new AI service!"}' ``` ### Batch Analysis ```bash curl -X POST https://myuser-sentiment-chute.chutes.ai/batch \\ -H "Content-Type: application/json" \\ -d '{ "texts": [ "This is amazing!", "Not very good...", "It works okay I guess" ] }' ``` ## Response Format ```json { "text": "I love this new AI service!", "sentiment": "POSITIVE", "confidence": 0.9847, "processing_time": 0.045 } ``` """, node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=8, include=["rtx4090", "rtx3090", "a100"] # Prefer these GPUs ), concurrency=4 # Handle up to 4 concurrent requests ) ``` ## Step 6: Add Model Loading Implement the startup function to load your model: ```python # Add to sentiment_chute.py @chute.on_startup() async def load_model(self): """Load the sentiment analysis model and tokenizer.""" print("🚀 Starting sentiment analysis chute...") # Model configuration model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest" print(f"📥 Loading model: {model_name}") try: # Load tokenizer self.tokenizer = AutoTokenizer.from_pretrained(model_name) print("✅ Tokenizer loaded successfully") # Load model self.model = AutoModelForSequenceClassification.from_pretrained(model_name) print("✅ Model loaded successfully") # Set up device self.device = "cuda" if torch.cuda.is_available() else "cpu" print(f"🖥️ Using device: {self.device}") # Move model to device self.model.to(self.device) self.model.eval() # Set to evaluation mode # Label mapping (specific to this model) self.label_mapping = { "LABEL_0": "NEGATIVE", "LABEL_1": "NEUTRAL", "LABEL_2": "POSITIVE" } # Warm up the model with a dummy input print("🔥 Warming up model...") dummy_text = "This is a test." await self._predict_sentiment(dummy_text) print("✅ Model loaded and ready!") except Exception as e: print(f"❌ Error loading model: {str(e)}") raise e async def _predict_sentiment(self, text: str) -> tuple[str, float, float]: """ Internal method to predict sentiment. Returns: (sentiment_label, confidence, processing_time) """ start_time = time.time() try: # Tokenize input inputs = self.tokenizer( text, return_tensors="pt", truncation=True, padding=True, max_length=512 ).to(self.device) # Run inference with torch.no_grad(): outputs = self.model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) # Get predicted class and confidence predicted_class_id = predictions.argmax().item() confidence = predictions[0][predicted_class_id].item() # Map to human-readable label model_label = self.model.config.id2label[predicted_class_id] sentiment_label = self.label_mapping.get(model_label, model_label) processing_time = time.time() - start_time return sentiment_label, confidence, processing_time except Exception as e: processing_time = time.time() - start_time raise HTTPException( status_code=500, detail=f"Sentiment prediction failed: {str(e)}" ) ``` ## Step 7: Implement API Endpoints Add your API endpoints using the `@chute.cord` decorator: ```python # Add to sentiment_chute.py @chute.cord( public_api_path="/analyze", method="POST", input_schema=TextInput, output_content_type="application/json" ) async def analyze_sentiment(self, data: TextInput) -> SentimentResult: """Analyze sentiment of a single text.""" sentiment, confidence, processing_time = await self._predict_sentiment(data.text) return SentimentResult( text=data.text, sentiment=SentimentLabel(sentiment), confidence=confidence, processing_time=processing_time ) @chute.cord( public_api_path="/batch", method="POST", input_schema=BatchTextInput, output_content_type="application/json" ) async def analyze_batch(self, data: BatchTextInput) -> BatchSentimentResult: """Analyze sentiment of multiple texts.""" start_time = time.time() results = [] confidences = [] for text in data.texts: sentiment, confidence, proc_time = await self._predict_sentiment(text) results.append(SentimentResult( text=text, sentiment=SentimentLabel(sentiment), confidence=confidence, processing_time=proc_time )) confidences.append(confidence) total_processing_time = time.time() - start_time average_confidence = np.mean(confidences) if confidences else 0.0 return BatchSentimentResult( results=results, total_texts=len(data.texts), total_processing_time=total_processing_time, average_confidence=average_confidence ) @chute.cord( public_api_path="/health", method="GET", output_content_type="application/json" ) async def health_check(self) -> dict: """Health check endpoint.""" model_loaded = hasattr(self, 'model') and hasattr(self, 'tokenizer') # Quick performance test if model_loaded: try: _, _, test_time = await self._predict_sentiment("Test message") performance_ok = test_time < 1.0 # Should be under 1 second except: performance_ok = False else: performance_ok = False return { "status": "healthy" if model_loaded and performance_ok else "unhealthy", "model_loaded": model_loaded, "device": getattr(self, 'device', 'unknown'), "performance_ok": performance_ok, "gpu_available": torch.cuda.is_available(), "gpu_memory_total": torch.cuda.get_device_properties(0).total_memory / 1024**3 if torch.cuda.is_available() else None } ``` ## Step 8: Add Local Testing Add a local testing function to verify everything works: ```python # Add to sentiment_chute.py if __name__ == "__main__": import asyncio async def test_locally(): """Test the chute locally before deploying.""" print("🧪 Testing chute locally...") # Simulate the startup process await load_model(chute) # Test single analysis print("\n📝 Testing single text analysis...") test_input = TextInput(text="I absolutely love this new technology!") result = await analyze_sentiment(chute, test_input) print(f"Input: {result.text}") print(f"Sentiment: {result.sentiment}") print(f"Confidence: {result.confidence:.4f}") print(f"Processing time: {result.processing_time:.4f}s") # Test batch analysis print("\n📝 Testing batch analysis...") batch_input = BatchTextInput(texts=[ "This is amazing!", "I hate this so much.", "It's okay, nothing special.", "Absolutely fantastic experience!" ]) batch_result = await analyze_batch(chute, batch_input) print(f"Processed {batch_result.total_texts} texts") print(f"Average confidence: {batch_result.average_confidence:.4f}") print(f"Total time: {batch_result.total_processing_time:.4f}s") for i, res in enumerate(batch_result.results): print(f" {i+1}. '{res.text}' -> {res.sentiment} ({res.confidence:.3f})") # Test health check print("\n🏥 Testing health check...") health = await health_check(chute) print(f"Status: {health['status']}") print(f"Device: {health['device']}") print("\n✅ All tests passed! Ready to deploy.") # Run local tests asyncio.run(test_locally()) ``` ## Step 9: Complete File (Refer to the full file structure in Step 8) ## Step 10: Test Locally Before deploying, test your chute locally: ```bash python sentiment_chute.py ``` ## Step 11: Build and Deploy ### Build the Image ```bash chutes build sentiment_chute:chute --wait ``` This will: - 📦 Create your custom Docker image - 🔧 Install all dependencies - ⬇️ Download the model - ✅ Validate the configuration ### Deploy the Chute ```bash chutes deploy sentiment_chute:chute ``` After successful deployment: ``` ✅ Chute deployed successfully! 🌐 Public API: https://myuser-sentiment-chute.chutes.ai 📋 Chute ID: 12345678-1234-5678-9abc-123456789012 ``` ## Step 12: Test Your Live API Test your deployed chute: ### Single Text Analysis ```bash curl -X POST https://myuser-sentiment-chute.chutes.ai/analyze \ -H "Content-Type: application/json" \ -d '{"text": "I absolutely love this new AI service!"}' ``` ### Batch Analysis ```bash curl -X POST https://myuser-sentiment-chute.chutes.ai/batch \ -H "Content-Type: application/json" \ -d '{ "texts": [ "This is amazing technology!", "I hate waiting in long lines.", "The weather is okay today." ] }' ``` ### Health Check ```bash curl https://myuser-sentiment-chute.chutes.ai/health ``` ## Next Steps Now that you understand the fundamentals, explore more advanced topics: ### Immediate Next Steps - **[Streaming Responses](../examples/streaming-responses)** - Add real-time processing - **[Batch Processing](../examples/batch-processing)** - Optimize for high throughput - **[Input/Output Schemas](../guides/schemas)** - Advanced validation patterns ### Advanced Topics - **[Custom Images Guide](../guides/custom-images)** - Advanced Docker configurations - **[Performance Optimization](../guides/performance)** - Speed up your chutes - **[Error Handling](../guides/error-handling)** - Robust error management - **[Best Practices](../guides/best-practices)** - Production deployment patterns ## Troubleshooting **Build fails with dependency errors?** - Check Python package versions - Ensure CUDA compatibility - Verify base image availability **Model loading takes too long?** - Model downloads on first run (normal) - Consider pre-downloading in Docker image - Check internet connection during build **GPU not detected?** - Verify CUDA installation in image - Check NodeSelector GPU requirements - Ensure PyTorch CUDA support ## Getting Help - 📖 **Documentation**: Continue with advanced guides - 💬 **Discord**: [Join our community](https://discord.gg/wHrXwWkCRz) - 🐛 **Issues**: [GitHub Issues](https://github.com/chutesai/chutes/issues) - 📧 **Support**: `support@chutes.ai` --- 🎉 **Congratulations!** You've built your first custom chute from scratch. You now have the foundation to create any AI application you can imagine with Chutes! --- ## SOURCE: https://chutes.ai/docs/getting-started/running-a-chute # Running a Chute This guide demonstrates how to call and run chutes in your applications using various programming languages. We'll cover examples for Python, TypeScript, Go, and Rust. ## Overview Chutes can be invoked via simple HTTP POST requests to the endpoint: ``` POST https://{username}-{chute-name}.chutes.ai/{path} ``` Or using the API endpoint: ``` POST https://api.chutes.ai/chutes/{chute-id}/{path} ``` ## Authentication All requests require authentication using either: - API Key in the `X-API-Key` header - Bearer token in the `Authorization` header ## Python Example (using aiohttp) ### Basic LLM Invocation ```python import aiohttp import asyncio import json async def call_llm_chute(): url = "https://myuser-my-llm.chutes.ai/v1/chat/completions" headers = { "Content-Type": "application/json", "X-API-Key": "your-api-key-here" } payload = { "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello! How are you?"} ], "max_tokens": 100, "temperature": 0.7 } async with aiohttp.ClientSession() as session: async with session.post(url, headers=headers, json=payload) as response: result = await response.json() print(result["choices"][0]["message"]["content"]) # Run the async function asyncio.run(call_llm_chute()) ``` ### Streaming Response ```python import aiohttp import asyncio import json async def stream_llm_response(): url = "https://myuser-my-llm.chutes.ai/v1/chat/completions" headers = { "Content-Type": "application/json", "X-API-Key": "your-api-key-here" } payload = { "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ {"role": "user", "content": "Write a short story about AI"} ], "stream": True, "max_tokens": 500 } async with aiohttp.ClientSession() as session: async with session.post(url, headers=headers, json=payload) as response: async for line in response.content: if line: line_str = line.decode('utf-8').strip() if line_str.startswith("data: "): data = line_str[6:] if data != "[DONE]": try: chunk = json.loads(data) content = chunk["choices"][0]["delta"].get("content", "") print(content, end="", flush=True) except json.JSONDecodeError: pass asyncio.run(stream_llm_response()) ``` ### Image Generation ```python import aiohttp import asyncio import base64 async def generate_image(): url = "https://myuser-my-diffusion.chutes.ai/v1/images/generations" headers = { "Content-Type": "application/json", "X-API-Key": "your-api-key-here" } payload = { "prompt": "A beautiful sunset over mountains, oil painting style", "n": 1, "size": "1024x1024", "response_format": "b64_json" } async with aiohttp.ClientSession() as session: async with session.post(url, headers=headers, json=payload) as response: result = await response.json() # Save the image image_data = base64.b64decode(result["data"][0]["b64_json"]) with open("generated_image.png", "wb") as f: f.write(image_data) print("Image saved as generated_image.png") asyncio.run(generate_image()) ``` ## TypeScript Example > **Tip:** For TypeScript projects, consider using the [Vercel AI SDK Integration](/docs/integrations/vercel-ai-sdk) for a more streamlined developer experience with built-in streaming, tool calling, and type safety. ### Basic LLM Invocation ```typescript async function callLLMChute() { const url = "https://myuser-my-llm.chutes.ai/v1/chat/completions"; const response = await fetch(url, { method: "POST", headers: { "Content-Type": "application/json", "X-API-Key": "your-api-key-here" }, body: JSON.stringify({ model: "meta-llama/Llama-3.1-8B-Instruct", messages: [ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "Hello! How are you?" } ], max_tokens: 100, temperature: 0.7 }) }); const result = await response.json(); console.log(result.choices[0].message.content); } callLLMChute(); ``` ### Streaming Response ```typescript async function streamLLMResponse() { const url = "https://myuser-my-llm.chutes.ai/v1/chat/completions"; const response = await fetch(url, { method: "POST", headers: { "Content-Type": "application/json", "X-API-Key": "your-api-key-here" }, body: JSON.stringify({ model: "meta-llama/Llama-3.1-8B-Instruct", messages: [ { role: "user", content: "Write a short story about AI" } ], stream: true, max_tokens: 500 }) }); const reader = response.body!.getReader(); const decoder = new TextDecoder(); while (true) { const { done, value } = await reader.read(); if (done) break; const chunk = decoder.decode(value); const lines = chunk.split('\n'); for (const line of lines) { if (line.startsWith('data: ')) { const data = line.slice(6); if (data !== '[DONE]') { try { const parsed = JSON.parse(data); const content = parsed.choices[0].delta?.content || ''; process.stdout.write(content); } catch (e) { // Skip invalid JSON } } } } } } streamLLMResponse(); ``` ### Image Generation ```typescript import * as fs from 'fs'; async function generateImage() { const url = "https://myuser-my-diffusion.chutes.ai/v1/images/generations"; const response = await fetch(url, { method: "POST", headers: { "Content-Type": "application/json", "X-API-Key": "your-api-key-here" }, body: JSON.stringify({ prompt: "A beautiful sunset over mountains, oil painting style", n: 1, size: "1024x1024", response_format: "b64_json" }) }); const result = await response.json(); // Save the image const imageData = Buffer.from(result.data[0].b64_json, 'base64'); fs.writeFileSync('generated_image.png', imageData); console.log('Image saved as generated_image.png'); } generateImage(); ``` ## Go Example ### Basic LLM Invocation ```go package main import ( "bytes" "encoding/json" "fmt" "io" "net/http" ) type Message struct { Role string `json:"role"` Content string `json:"content"` } type ChatRequest struct { Model string `json:"model"` Messages []Message `json:"messages"` MaxTokens int `json:"max_tokens"` Temperature float64 `json:"temperature"` } type ChatResponse struct { Choices []struct { Message struct { Content string `json:"content"` } `json:"message"` } `json:"choices"` } func callLLMChute() error { url := "https://myuser-my-llm.chutes.ai/v1/chat/completions" request := ChatRequest{ Model: "meta-llama/Llama-3.1-8B-Instruct", Messages: []Message{ {Role: "system", Content: "You are a helpful assistant."}, {Role: "user", Content: "Hello! How are you?"}, }, MaxTokens: 100, Temperature: 0.7, } jsonData, err := json.Marshal(request) if err != nil { return err } req, err := http.NewRequest("POST", url, bytes.NewBuffer(jsonData)) if err != nil { return err } req.Header.Set("Content-Type", "application/json") req.Header.Set("X-API-Key", "your-api-key-here") client := &http.Client{} resp, err := client.Do(req) if err != nil { return err } defer resp.Body.Close() body, err := io.ReadAll(resp.Body) if err != nil { return err } var response ChatResponse err = json.Unmarshal(body, &response) if err != nil { return err } fmt.Println(response.Choices[0].Message.Content) return nil } func main() { if err := callLLMChute(); err != nil { fmt.Printf("Error: %v\n", err) } } ``` ## Rust Example ### Basic LLM Invocation ```rust use reqwest; use serde::{Deserialize, Serialize}; use tokio; #[derive(Serialize)] struct Message { role: String, content: String, } #[derive(Serialize)] struct ChatRequest { model: String, messages: Vec, max_tokens: i32, temperature: f32, } #[derive(Deserialize)] struct ChatResponse { choices: Vec, } #[derive(Deserialize)] struct Choice { message: MessageResponse, } #[derive(Deserialize)] struct MessageResponse { content: String, } #[tokio::main] async fn main() -> Result<(), Box> { let url = "https://myuser-my-llm.chutes.ai/v1/chat/completions"; let request = ChatRequest { model: "meta-llama/Llama-3.1-8B-Instruct".to_string(), messages: vec![ Message { role: "system".to_string(), content: "You are a helpful assistant.".to_string(), }, Message { role: "user".to_string(), content: "Hello! How are you?".to_string(), }, ], max_tokens: 100, temperature: 0.7, }; let client = reqwest::Client::new(); let response = client .post(url) .header("Content-Type", "application/json") .header("X-API-Key", "your-api-key-here") .json(&request) .send() .await? .json::() .await?; if let Some(choice) = response.choices.first() { println!("{}", choice.message.content); } Ok(()) } ``` ## Error Handling All examples should include proper error handling. Common error codes: - `401`: Invalid or missing API key - `403`: Access denied to the chute - `404`: Chute not found - `429`: Rate limit exceeded - `500`: Internal server error - `503`: Service temporarily unavailable Example error handling in Python: ```python async def call_with_error_handling(): try: async with aiohttp.ClientSession() as session: async with session.post(url, headers=headers, json=payload) as response: if response.status == 200: result = await response.json() return result else: error = await response.text() print(f"Error {response.status}: {error}") return None except aiohttp.ClientError as e: print(f"Request failed: {e}") return None ``` ## Best Practices 1. **Use Environment Variables**: Store API keys in environment variables rather than hardcoding them 2. **Implement Retries**: Add retry logic for transient failures 3. **Handle Rate Limits**: Respect rate limits and implement backoff strategies 4. **Stream Large Responses**: Use streaming for long-form content generation 5. **Set Timeouts**: Configure appropriate timeouts for your use case 6. **Monitor Usage**: Track API usage to manage costs effectively ## Next Steps - Learn about [Authentication](authentication) - Explore [Templates](../templates) for specific use cases - Check the [API Reference](../api-reference/overview) for detailed endpoint documentation - See [Examples](../examples) for more complex implementations --- ## SOURCE: https://chutes.ai/docs/core-concepts/chutes # Understanding Chutes A **Chute** is the fundamental building block of the Chutes platform. Think of it as a complete AI application that can be deployed to GPU-accelerated infrastructure with just a few lines of code. ## What is a Chute? A Chute is essentially a **FastAPI application** with superpowers for AI workloads. It provides: - 🚀 **Serverless deployment** to GPU clusters - 🔌 **Simple decorator-based API** definition - 🏗️ **Custom Docker image** building - ⚡ **Hardware resource** specification - 📊 **Automatic scaling** based on demand - 💰 **Pay-per-use** billing ## Basic Chute Structure ```python from chutes.chute import Chute, NodeSelector from chutes.image import Image # Define your custom image (optional) image = ( Image(username="myuser", name="my-ai-app", tag="1.0") .from_base("nvidia/cuda:12.2-devel-ubuntu22.04") .with_python("3.11") .run_command("pip install torch transformers") ) # Create your chute chute = Chute( username="myuser", name="my-ai-app", image=image, # or use a string like "my-custom-image:latest" tagline="My awesome AI application", readme="# My AI App\nThis app does amazing things!", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16 ), concurrency=4 ) # Add startup initialization @chute.on_startup() async def initialize_model(self): import torch from transformers import AutoModel, AutoTokenizer self.device = "cuda" if torch.cuda.is_available() else "cpu" self.model = AutoModel.from_pretrained("bert-base-uncased") self.tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # Move model to GPU self.model.to(self.device) # Define API endpoints @chute.cord(public_api_path="/predict") async def predict(self, text: str) -> dict: inputs = self.tokenizer(text, return_tensors="pt").to(self.device) with torch.no_grad(): outputs = self.model(**inputs) return {"prediction": outputs.last_hidden_state.mean().item()} ``` ## Chute Constructor Parameters ### Required Parameters #### `username: str` Your Chutes platform username. This is used for: - Image naming and organization - URL generation (`username-chute-name.chutes.ai`) - Access control and billing ```python chute = Chute(username="myuser", ...) # Required ``` #### `name: str` The name of your chute. Must be: - Alphanumeric with hyphens/underscores - Unique within your account - Used in the public URL ```python chute = Chute(name="my-awesome-app", ...) # Required ``` #### `image: str | Image` The Docker image to use. Can be: - A string reference to an existing image: `"nvidia/cuda:12.2-runtime-ubuntu22.04"` - A custom `Image` object with build instructions - A pre-built template image: `"chutes/vllm:latest"` ```python # Using a string reference chute = Chute(image="nvidia/cuda:12.2-runtime-ubuntu22.04", ...) # Using a custom Image object from chutes.image import Image custom_image = Image(username="myuser", name="my-image", tag="1.0") chute = Chute(image=custom_image, ...) ``` ### Optional Parameters #### `tagline: str = ""` A short description displayed in the Chutes dashboard and API listings. ```python chute = Chute(tagline="Fast text generation with custom models", ...) ``` #### `readme: str = ""` Markdown documentation for your chute. Supports full markdown syntax. ````python chute = Chute( readme=""" # My AI Application This chute provides text generation capabilities using a fine-tuned model. ## Usage ```bash curl -X POST https://myuser-myapp.chutes.ai/generate \\ -d '{"prompt": "Hello world"}' ``` ## Features - Fast inference - Streaming support - Custom fine-tuning """, ... ) ```` #### `node_selector: NodeSelector = None` Hardware requirements for your chute. If not specified, uses default settings. ```python from chutes.chute import NodeSelector chute = Chute( node_selector=NodeSelector( gpu_count=2, min_vram_gb_per_gpu=24, include=["a100", "h100"], # Preferred GPU types exclude=["k80", "p100"] # Avoid older GPUs ), ... ) ``` #### `concurrency: int = 1` Maximum number of simultaneous requests each instance can handle. ```python # Handle up to 8 requests simultaneously chute = Chute(concurrency=8, ...) ``` #### `revision: str = None` Version control for your chute deployment. ```python chute = Chute(revision="v1.2.0", ...) ``` #### `standard_template: str = None` Used internally by template builders. Generally not set manually. ## Chute Methods ### Lifecycle Methods #### `@chute.on_startup()` Decorator for functions that run when your chute starts up. Use this for: - Model loading and initialization - Database connections - Preprocessing setup ```python @chute.on_startup() async def load_model(self): # This runs once when the chute starts self.model = load_my_model() self.preprocessor = setup_preprocessing() ``` #### `@chute.on_shutdown()` Decorator for cleanup functions that run when your chute shuts down. ```python @chute.on_shutdown() async def cleanup(self): # This runs when the chute is shutting down if hasattr(self, 'database'): await self.database.close() ``` ### API Definition Methods #### `@chute.cord(...)` Define HTTP API endpoints. See [Cords Documentation](/docs/core-concepts/cords) for details. ```python @chute.cord( public_api_path="/predict", method="POST", input_schema=MyInputSchema, output_content_type="application/json" ) async def predict(self, data: MyInputSchema) -> dict: return {"result": "prediction"} ``` #### `@chute.job(...)` Define background jobs or long-running tasks. See [Jobs Documentation](/docs/core-concepts/jobs) for details. ```python @chute.job(timeout=3600, upload=True) async def train_model(self, training_data: dict): # Long-running training job pass ``` ## Chute Properties ### Read-Only Properties ```python # Access chute metadata print(chute.name) # Chute name print(chute.uid) # Unique identifier print(chute.username) # Owner username print(chute.tagline) # Short description print(chute.readme) # Documentation print(chute.node_selector) # Hardware requirements print(chute.image) # Docker image reference print(chute.cords) # List of API endpoints print(chute.jobs) # List of background jobs ``` ## Advanced Usage ### Custom Context Management You can store data in the chute instance that persists across requests: ```python @chute.on_startup() async def setup(self): # This data persists for the lifetime of the chute instance self.cache = {} self.request_count = 0 @chute.cord(public_api_path="/cached-predict") async def cached_predict(self, text: str) -> dict: # Access persistent data self.request_count += 1 if text in self.cache: return self.cache[text] result = await expensive_computation(text) self.cache[text] = result return result ``` ### Integration with FastAPI Features Since Chute extends FastAPI, you can use FastAPI features directly: ```python from fastapi import HTTPException, Depends @chute.cord(public_api_path="/secure-endpoint") async def secure_endpoint(self, data: str, api_key: str = Depends(validate_api_key)): if not api_key: raise HTTPException(status_code=401, detail="Invalid API key") return {"secure_data": process_data(data)} ``` ### Environment Variables Access environment variables in your chute: ```python import os @chute.on_startup() async def configure(self): self.debug_mode = os.getenv("DEBUG", "false").lower() == "true" self.model_path = os.getenv("MODEL_PATH", "/app/models/default") ``` ## Best Practices ### 1. Resource Management ```python @chute.on_startup() async def initialize(self): # Pre-load models and resources self.model = load_model() # Do this once, not per request @chute.on_shutdown() async def cleanup(self): # Clean up resources if hasattr(self, 'model'): del self.model ``` ### 2. Error Handling ```python @chute.cord(public_api_path="/predict") async def predict(self, text: str) -> dict: try: result = await self.model.predict(text) return {"result": result} except Exception as e: # Log the error and return a user-friendly message logger.error(f"Prediction failed: {e}") raise HTTPException(status_code=500, detail="Prediction failed") ``` ### 3. Input Validation ```python from pydantic import BaseModel, Field class PredictionInput(BaseModel): text: str = Field(..., min_length=1, max_length=1000) temperature: float = Field(0.7, ge=0.0, le=2.0) @chute.cord(input_schema=PredictionInput) async def predict(self, data: PredictionInput) -> dict: # Input is automatically validated return await self.model.generate(data.text, temperature=data.temperature) ``` ### 4. Performance Optimization ```python @chute.on_startup() async def optimize(self): import torch # Optimize for inference torch.set_num_threads(1) torch.backends.cudnn.benchmark = True # Pre-compile models if possible self.model = torch.jit.script(self.model) ``` ## Common Patterns ### Model Loading ```python @chute.on_startup() async def load_models(self): from transformers import AutoModel, AutoTokenizer import torch device = "cuda" if torch.cuda.is_available() else "cpu" model_name = "bert-base-uncased" self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.model = AutoModel.from_pretrained(model_name).to(device) self.device = device ``` ### Batched Processing ```python @chute.cord(public_api_path="/batch-predict") async def batch_predict(self, texts: list[str]) -> list[dict]: # Process multiple inputs efficiently inputs = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt") inputs = {k: v.to(self.device) for k, v in inputs.items()} with torch.no_grad(): outputs = self.model(**inputs) return [{"result": output.tolist()} for output in outputs.last_hidden_state] ``` ### Streaming Responses ```python @chute.cord(public_api_path="/stream", stream=True) async def stream_generate(self, prompt: str): for token in self.model.generate_stream(prompt): yield {"token": token} ``` ## Next Steps - **[Cords (API Endpoints)](/docs/core-concepts/cords)** - Learn how to define custom API endpoints - **[Jobs (Background Tasks)](/docs/core-concepts/jobs)** - Understand background job processing - **[Images (Docker Containers)](/docs/core-concepts/images)** - Build custom Docker environments - **[Node Selection](/docs/core-concepts/node-selection)** - Optimize hardware allocation - **[Your First Custom Chute](/docs/getting-started/first-chute)** - Complete example walkthrough --- ## SOURCE: https://chutes.ai/docs/core-concepts/cords # Cords (API Endpoints) **Cords** are the way you define HTTP API endpoints in your Chutes. Think of them as FastAPI routes, but with additional features for AI workloads like streaming, input validation, and automatic scaling. ## What is a Cord? A Cord is a decorated function that becomes an HTTP API endpoint. The name comes from "parachute cord" - the connection between your chute and the outside world. ```python @chute.cord(public_api_path="/predict") async def predict(self, text: str) -> dict: result = await self.model.predict(text) return {"prediction": result} ``` This creates an endpoint accessible at `https://your-username-your-chute.chutes.ai/predict`. ## Basic Cord Definition ### Simple Cord ```python from chutes.chute import Chute chute = Chute(username="myuser", name="my-chute", image="my-image") @chute.cord(public_api_path="/hello") async def say_hello(self, name: str) -> dict: return {"message": f"Hello, {name}!"} ``` ### With Input Validation ```python from pydantic import BaseModel, Field class GreetingInput(BaseModel): name: str = Field(..., min_length=1, max_length=100) language: str = Field("en", regex="^(en|es|fr|de)$") @chute.cord( public_api_path="/greet", input_schema=GreetingInput ) async def greet(self, data: GreetingInput) -> dict: greetings = { "en": f"Hello, {data.name}!", "es": f"¡Hola, {data.name}!", "fr": f"Bonjour, {data.name}!", "de": f"Hallo, {data.name}!" } return {"greeting": greetings[data.language]} ``` ## Cord Parameters ### Required Parameters #### `public_api_path: str` The URL path where your endpoint will be accessible. ```python @chute.cord(public_api_path="/predict") # https://user-chute.chutes.ai/predict @chute.cord(public_api_path="/api/v1/generate") # https://user-chute.chutes.ai/api/v1/generate ``` ### Optional Parameters #### `method: str = "POST"` HTTP method for the endpoint. ```python @chute.cord(public_api_path="/status", method="GET") async def get_status(self) -> dict: return {"status": "healthy"} @chute.cord(public_api_path="/update", method="PUT") async def update_config(self, config: dict) -> dict: return {"updated": True} ``` #### `input_schema: BaseModel = None` Pydantic model for automatic input validation and API documentation. ```python from pydantic import BaseModel, Field class PredictionInput(BaseModel): text: str = Field(..., description="Input text to analyze") max_length: int = Field(100, ge=1, le=1000, description="Maximum output length") temperature: float = Field(0.7, ge=0.0, le=2.0, description="Sampling temperature") @chute.cord( public_api_path="/predict", input_schema=PredictionInput ) async def predict(self, data: PredictionInput) -> dict: # Automatic validation and type conversion return await self.model.generate( data.text, max_length=data.max_length, temperature=data.temperature ) ``` #### `minimal_input_schema: BaseModel = None` Simplified input schema for easier testing and basic usage. ```python class FullInput(BaseModel): text: str max_length: int = Field(100, ge=1, le=1000) temperature: float = Field(0.7, ge=0.0, le=2.0) top_p: float = Field(0.9, ge=0.0, le=1.0) frequency_penalty: float = Field(0.0, ge=-2.0, le=2.0) class SimpleInput(BaseModel): text: str # Only required field @chute.cord( public_api_path="/generate", input_schema=FullInput, minimal_input_schema=SimpleInput # For simpler API calls ) async def generate(self, data: FullInput) -> dict: return await self.model.generate(data.text, **data.dict(exclude={'text'})) ``` #### `output_content_type: str = None` Specify the content type of the response. ```python @chute.cord( public_api_path="/generate-image", output_content_type="image/jpeg" ) async def generate_image(self, prompt: str) -> Response: image_bytes = await self.model.generate_image(prompt) return Response(content=image_bytes, media_type="image/jpeg") @chute.cord( public_api_path="/generate-audio", output_content_type="audio/wav" ) async def generate_audio(self, text: str) -> Response: audio_bytes = await self.tts_model.synthesize(text) return Response(content=audio_bytes, media_type="audio/wav") ``` #### `stream: bool = False` Enable streaming responses for real-time output. ```python @chute.cord( public_api_path="/stream-generate", stream=True ) async def stream_generate(self, prompt: str): # Yield tokens as they're generated async for token in self.model.generate_stream(prompt): yield {"token": token, "done": False} yield {"token": "", "done": True} ``` #### `passthrough: bool = False` Proxy requests to another service running in the same container. ```python @chute.cord( public_api_path="/v1/chat/completions", passthrough=True, passthrough_path="/v1/chat/completions", passthrough_port=8000 ) async def chat_completions(self, data): # Automatically forwards to localhost:8000/v1/chat/completions return data ``` ## Function Signatures ### Self Parameter All cord functions must take `self` as the first parameter, which provides access to the chute instance. ```python @chute.cord(public_api_path="/predict") async def predict(self, text: str) -> dict: # Access chute instance data result = await self.model.predict(text) self.request_count += 1 return {"result": result, "count": self.request_count} ``` ### Input Parameters #### Direct Parameters ```python @chute.cord(public_api_path="/simple") async def simple_endpoint(self, text: str, temperature: float = 0.7) -> dict: return {"text": text, "temperature": temperature} ``` #### Pydantic Model Input ```python @chute.cord(public_api_path="/validated", input_schema=MyInput) async def validated_endpoint(self, data: MyInput) -> dict: return {"processed": data.text} ``` ### Return Types #### JSON Response (Default) ```python @chute.cord(public_api_path="/json") async def json_response(self, text: str) -> dict: return {"result": "processed"} # Automatically serialized to JSON ``` #### Custom Response Objects ```python from fastapi import Response @chute.cord(public_api_path="/custom") async def custom_response(self, data: str) -> Response: return Response( content="Custom content", media_type="text/plain", headers={"X-Custom-Header": "value"} ) ``` #### Streaming Responses ```python @chute.cord(public_api_path="/stream", stream=True) async def streaming_response(self, prompt: str): for i in range(10): yield {"chunk": i, "data": f"Generated text {i}"} ``` ## Advanced Features ### Error Handling ```python from fastapi import HTTPException @chute.cord(public_api_path="/predict") async def predict(self, text: str) -> dict: if not text.strip(): raise HTTPException(status_code=400, detail="Text cannot be empty") try: result = await self.model.predict(text) return {"prediction": result} except Exception as e: # Log the error logger.error(f"Prediction failed: {e}") raise HTTPException(status_code=500, detail="Prediction failed") ``` ### Request Context ```python from fastapi import Request @chute.cord(public_api_path="/context") async def with_context(self, request: Request, text: str) -> dict: # Access request metadata client_ip = request.client.host user_agent = request.headers.get("user-agent") return { "result": await self.model.predict(text), "metadata": { "client_ip": client_ip, "user_agent": user_agent } } ``` ### File Uploads ```python from fastapi import UploadFile, File @chute.cord(public_api_path="/upload") async def upload_file(self, file: UploadFile = File(...)) -> dict: contents = await file.read() # Process the uploaded file result = await self.process_file(contents, file.content_type) return { "filename": file.filename, "size": len(contents), "result": result } ``` ### Response Headers ```python from fastapi import Response @chute.cord(public_api_path="/with-headers") async def with_headers(self, text: str) -> dict: result = await self.model.predict(text) # Add custom headers (if returning Response object) response = Response( content=json.dumps({"result": result}), media_type="application/json" ) response.headers["X-Processing-Time"] = "123ms" response.headers["X-Model-Version"] = self.model_version return response ``` ## Streaming in Detail ### Text Streaming ```python @chute.cord(public_api_path="/stream-text", stream=True) async def stream_text(self, prompt: str): async for token in self.model.generate_stream(prompt): yield { "choices": [{ "delta": {"content": token}, "index": 0, "finish_reason": None }] } # Signal completion yield { "choices": [{ "delta": {}, "index": 0, "finish_reason": "stop" }] } ``` ### Binary Streaming ```python @chute.cord( public_api_path="/stream-audio", stream=True, output_content_type="audio/wav" ) async def stream_audio(self, text: str): async for audio_chunk in self.tts_model.synthesize_stream(text): yield audio_chunk ``` ### Server-Sent Events ```python @chute.cord( public_api_path="/events", stream=True, output_content_type="text/event-stream" ) async def server_sent_events(self, prompt: str): async for event in self.model.generate_events(prompt): yield f"data: {json.dumps(event)}\n\n" ``` ## Best Practices ### 1. Input Validation ```python from pydantic import BaseModel, Field, validator class TextInput(BaseModel): text: str = Field(..., min_length=1, max_length=10000) language: str = Field("en", regex="^[a-z]{2}$") @validator('text') def text_must_not_be_empty(cls, v): if not v.strip(): raise ValueError('Text cannot be empty or whitespace only') return v.strip() @chute.cord(input_schema=TextInput) async def process_text(self, data: TextInput) -> dict: # Input is guaranteed to be valid return await self.model.process(data.text, data.language) ``` ### 2. Error Handling ```python @chute.cord(public_api_path="/robust") async def robust_endpoint(self, text: str) -> dict: try: # Validate input if not text or len(text.strip()) == 0: raise HTTPException(status_code=400, detail="Text is required") if len(text) > 10000: raise HTTPException(status_code=413, detail="Text too long") # Process request result = await self.model.predict(text) return {"result": result, "status": "success"} except HTTPException: # Re-raise HTTP exceptions raise except Exception as e: # Log unexpected errors logger.exception(f"Unexpected error in robust_endpoint: {e}") raise HTTPException(status_code=500, detail="Internal server error") ``` ### 3. Performance Optimization ```python @chute.cord(public_api_path="/optimized") async def optimized_endpoint(self, texts: list[str]) -> dict: # Batch processing for efficiency if len(texts) > 100: raise HTTPException(status_code=413, detail="Too many texts") # Process in batches results = [] batch_size = 32 for i in range(0, len(texts), batch_size): batch = texts[i:i + batch_size] batch_results = await self.model.predict_batch(batch) results.extend(batch_results) return {"results": results} ``` ### 4. Resource Management ```python @chute.cord(public_api_path="/resource-managed") async def resource_managed_endpoint(self, file_data: bytes) -> dict: temp_file = None try: # Create temporary resources temp_file = await self.create_temp_file(file_data) # Process result = await self.model.process_file(temp_file) return {"result": result} finally: # Always clean up if temp_file and os.path.exists(temp_file): os.remove(temp_file) ``` ## Common Patterns ### Authentication ```python from fastapi import Depends, HTTPException import jwt async def verify_token(authorization: str = Header(None)): if not authorization or not authorization.startswith("Bearer "): raise HTTPException(status_code=401, detail="Missing or invalid token") token = authorization.split(" ")[1] try: payload = jwt.decode(token, "secret", algorithms=["HS256"]) return payload except jwt.InvalidTokenError: raise HTTPException(status_code=401, detail="Invalid token") @chute.cord(public_api_path="/secure") async def secure_endpoint(self, text: str, user=Depends(verify_token)) -> dict: return { "result": await self.model.predict(text), "user": user["username"] } ``` ### Rate Limiting ```python import time from collections import defaultdict # Simple in-memory rate limiter request_counts = defaultdict(list) @chute.cord(public_api_path="/rate-limited") async def rate_limited_endpoint(self, request: Request, text: str) -> dict: client_ip = request.client.host current_time = time.time() # Clean old requests (older than 1 minute) request_counts[client_ip] = [ req_time for req_time in request_counts[client_ip] if current_time - req_time < 60 ] # Check rate limit (max 10 requests per minute) if len(request_counts[client_ip]) >= 10: raise HTTPException(status_code=429, detail="Rate limit exceeded") # Record this request request_counts[client_ip].append(current_time) return await self.model.predict(text) ``` ### Caching ```python import hashlib import json @chute.on_startup() async def setup_cache(self): self.cache = {} @chute.cord(public_api_path="/cached") async def cached_endpoint(self, text: str, temperature: float = 0.7) -> dict: # Create cache key cache_key = hashlib.md5( json.dumps({"text": text, "temperature": temperature}).encode() ).hexdigest() # Check cache if cache_key in self.cache: return {"result": self.cache[cache_key], "cached": True} # Compute result result = await self.model.predict(text, temperature=temperature) # Store in cache self.cache[cache_key] = result return {"result": result, "cached": False} ``` ## Testing Cords ### Unit Testing ```python import pytest from httpx import AsyncClient @pytest.mark.asyncio async def test_predict_endpoint(): async with AsyncClient(app=chute, base_url="http://test") as client: response = await client.post( "/predict", json={"text": "Hello world"} ) assert response.status_code == 200 data = response.json() assert "result" in data ``` ### Local Testing ```python if __name__ == "__main__": # Test locally before deployment import uvicorn uvicorn.run(chute, host="0.0.0.0", port=8000) ``` ## Next Steps - **[Jobs (Background Tasks)](/docs/core-concepts/jobs)** - Learn about long-running tasks - **[Input/Output Schemas](/docs/guides/schemas)** - Deep dive into validation - **[Streaming Responses](/docs/guides/streaming)** - Advanced streaming patterns - **[Error Handling](/docs/guides/error-handling)** - Robust error management --- ## SOURCE: https://chutes.ai/docs/core-concepts/images # Images (Docker Containers) **Images** in Chutes define the Docker environment where your AI applications run. You can use pre-built images or create custom ones with a fluent Python API that generates optimized Dockerfiles. ## What is an Image? An Image is a Docker container definition that includes: - 🐧 **Base operating system** (usually Ubuntu with CUDA) - 🐍 **Python environment** and packages - 🧠 **AI frameworks** (PyTorch, TensorFlow, etc.) - 📦 **System dependencies** and tools - ⚙️ **Environment variables** and configuration - 👤 **User setup** and permissions ## Using Pre-built Images ### Popular Base Images ```python # NVIDIA CUDA images "nvidia/cuda:12.2-devel-ubuntu22.04" "nvidia/cuda:11.8-runtime-ubuntu20.04" # Chutes optimized images "chutes/cuda-python:12.2-py311" "chutes/pytorch:2.1-cuda12.2" "chutes/tensorflow:2.13-cuda11.8" # Specialized AI framework images "pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel" "tensorflow/tensorflow:2.13.0-gpu" ``` ### Using String References ```python from chutes.chute import Chute chute = Chute( username="myuser", name="my-chute", image="nvidia/cuda:12.2-devel-ubuntu22.04" # Simple string reference ) ``` ## Building Custom Images ### Basic Custom Image ```python from chutes.image import Image image = ( Image(username="myuser", name="text-analyzer", tag="1.0") .from_base("nvidia/cuda:12.2-devel-ubuntu22.04") .with_python("3.11") .run_command("pip install torch transformers accelerate") .with_env("MODEL_CACHE", "/app/models") ) ``` ### Image Constructor Parameters #### Required Parameters ```python Image( username="myuser", # Your Chutes username name="my-image", # Image name (alphanumeric + hyphens) tag="1.0" # Version tag ) ``` #### Full Example ```python image = Image( username="myuser", name="advanced-nlp", tag="2.1.3", readme="Advanced NLP processing with multiple models" ) ``` ## Image Building Methods ### Base Image Selection #### `.from_base(base_image: str)` Set the base Docker image: ```python # CUDA development environment .from_base("nvidia/cuda:12.2-devel-ubuntu22.04") # Lightweight runtime .from_base("nvidia/cuda:12.2-runtime-ubuntu22.04") # Pre-built PyTorch .from_base("pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel") ``` ### Python Environment #### `.with_python(version: str)` Install a specific Python version: ```python .with_python("3.11") # Python 3.11 (recommended) .with_python("3.10") # Python 3.10 .with_python("3.9") # Python 3.9 ``` #### Installing Python Packages Use `run_command()` to install Python packages: ```python # Individual packages .run_command("pip install torch transformers numpy") # With versions .run_command("pip install torch==2.1.0 transformers>=4.21.0") # From requirements file .run_command("pip install -r requirements.txt") ``` #### Installing Conda Packages Use `run_command()` to install packages via conda: ```python .run_command("conda install pytorch torchvision torchaudio") .run_command("conda install cudatoolkit=11.8 numpy scipy") ``` ### System Commands #### `.run_command(command: str)` Execute arbitrary shell commands: ```python # Install system packages .run_command("apt-get update && apt-get install -y git curl wget") # Download models .run_command("wget https://example.com/model.bin -O /app/model.bin") # Set up directories .run_command("mkdir -p /app/models /app/data /app/logs") # Compile native extensions .run_command("cd /app && python setup.py build_ext --inplace") ``` ### Environment Variables #### `.with_env(key: str, value: str)` Set environment variables: ```python .with_env("CUDA_VISIBLE_DEVICES", "0") .with_env("TRANSFORMERS_CACHE", "/app/cache") .with_env("PYTORCH_CUDA_ALLOC_CONF", "max_split_size_mb:512") .with_env("MODEL_PATH", "/app/models/my-model") ``` ### File Operations #### `.add(*args, **kwargs)` Add files to the image: ```python # Add files to the image .add("config.json", "/app/config.json") # Add directories .add("models/", "/app/models/") # Add requirements file .add("requirements.txt", "/app/requirements.txt") ``` ### User Management #### `.set_user(user: str)` Set the user for the container: ```python # Set user .set_user("appuser") # Set user for chutes .set_user("chutes") ``` #### `.set_workdir(directory: str)` Set the working directory: ```python .set_workdir("/app") .set_workdir("/workspace/myproject") ``` ## Complete Example ```python from chutes.image import Image # Build a comprehensive NLP processing image image = ( Image( username="myuser", name="nlp-suite", tag="1.2.0", description="Complete NLP processing suite with multiple models" ) # Start with CUDA base .from_base("nvidia/cuda:12.2-devel-ubuntu22.04") # Install system dependencies .run_command(""" apt-get update && apt-get install -y \\ git curl wget unzip \\ build-essential \\ ffmpeg \\ && rm -rf /var/lib/apt/lists/* """) # Set up Python .with_python("3.11") # Install core ML packages .run_command(""" pip install \\ torch==2.1.0 \\ torchvision==0.16.0 \\ torchaudio==2.1.0 \\ transformers>=4.30.0 \\ accelerate>=0.20.0 \\ datasets>=2.12.0 \\ tokenizers>=0.13.0 """) # Install additional NLP tools .run_command(""" pip install \\ spacy>=3.6.0 \\ nltk>=3.8 \\ scikit-learn>=1.3.0 \\ pandas>=2.0.0 \\ numpy>=1.24.0 """) # Set up directories .run_command("mkdir -p /app/models /app/data /app/cache /app/logs") # Add application files .add("requirements.txt", "/app/requirements.txt") .add("src/", "/app/src/") .add("config/", "/app/config/") # Set environment variables .with_env("TRANSFORMERS_CACHE", "/app/cache") .with_env("HF_HOME", "/app/cache") .with_env("TORCH_HOME", "/app/cache/torch") .with_env("PYTHONPATH", "/app/src") # Download spaCy models .run_command("python -m spacy download en_core_web_sm") .run_command("python -m spacy download en_core_web_lg") # Download NLTK data .run_command(""" python -c " import nltk nltk.download('punkt') nltk.download('stopwords') nltk.download('wordnet') " """) # Set working directory and user .set_workdir("/app") .set_user("appuser") ) ``` ## Advanced Features ### Multi-stage Builds ```python # Build stage for compiling build_image = ( Image(username="myuser", name="builder", tag="temp") .from_base("nvidia/cuda:12.2-devel-ubuntu22.04") .with_python("3.11") .run_command("pip install cython numpy") .copy_file("src/", "/build/src/") .run_command("cd /build && python setup.py build_ext") ) # Production stage with compiled artifacts production_image = ( Image(username="myuser", name="production", tag="1.0") .from_base("nvidia/cuda:12.2-runtime-ubuntu22.04") .with_python("3.11") .add("/build/dist/", "/app/") .run_command("pip install torch transformers") ) ``` ### Conditional Building ```python def build_image_for_gpu(gpu_type: str) -> Image: image = ( Image(username="myuser", name=f"model-{gpu_type}", tag="1.0") .from_base("nvidia/cuda:12.2-devel-ubuntu22.04") .with_python("3.11") ) if gpu_type == "a100": # Optimize for A100 image = image.with_env("TORCH_CUDA_ARCH_LIST", "8.0") elif gpu_type == "v100": # Optimize for V100 image = image.with_env("TORCH_CUDA_ARCH_LIST", "7.0") return image.run_command("pip install torch transformers") ``` ### Template Images ```python def create_pytorch_image(username: str, name: str, pytorch_version: str = "2.1.0") -> Image: """Template for PyTorch-based images""" return ( Image(username=username, name=name, tag=pytorch_version) .from_base("nvidia/cuda:12.2-devel-ubuntu22.04") .with_python("3.11") .run_command(f"pip install torch=={pytorch_version}") .run_command("pip install torchvision torchaudio") .with_env("TORCH_CUDA_ARCH_LIST", "7.0;8.0;8.6") .set_workdir("/app") ) # Use the template my_image = create_pytorch_image("myuser", "my-pytorch-app") ``` ## Image Building Process ### Local Building ```bash # Build image locally chutes build my_chute:chute --wait # Build with custom tag chutes build my_chute:chute --tag custom-v1.0 # Build without cache chutes build my_chute:chute --no-cache ``` ### Remote Building Images are built on Chutes infrastructure with: - 🚀 **Fast build times** with optimized caching - 🔒 **Secure environment** with isolated builds - 📦 **Automatic registry** management - 🏗️ **Multi-architecture** support ### Build Optimization ```python # Layer caching - put stable operations first image = ( Image(username="myuser", name="optimized", tag="1.0") .from_base("nvidia/cuda:12.2-devel-ubuntu22.04") # System packages (rarely change) .run_command("apt-get update && apt-get install -y git curl") # Python installation (stable) .with_python("3.11") # Core dependencies (change less frequently) .run_command("pip install torch==2.1.0 transformers==4.30.0") # Application-specific packages (change more frequently) .run_command("pip install -r requirements.txt") # Application code (changes most frequently) .add("src/", "/app/src/") ) ``` ## Best Practices ### 1. Layer Optimization ```python # Good: Group related operations .run_command(""" apt-get update && \\ apt-get install -y git curl wget && \\ rm -rf /var/lib/apt/lists/* """) # Bad: Separate operations create more layers .run_command("apt-get update") .run_command("apt-get install -y git") .run_command("apt-get install -y curl") ``` ### 2. Security ```python # Use specific versions .run_command("pip install torch==2.1.0 transformers==4.30.0") # Create non-root user .set_user("appuser") # Clean up package caches .run_command("apt-get clean && rm -rf /var/lib/apt/lists/*") ``` ### 3. Size Optimization ```python # Combine operations to reduce layers .run_command(""" pip install torch transformers && \\ pip cache purge && \\ rm -rf ~/.cache/pip """) # Add only what you need .add("src/", "/app/src/") # Only add what you need ``` ### 4. Environment Consistency ```python # Pin all versions .with_python("3.11.5") .run_command("pip install torch==2.1.0+cu121 transformers==4.30.2") # Set explicit environment .with_env("PYTHONPATH", "/app/src") .with_env("CUDA_VISIBLE_DEVICES", "0") ``` ## Common Patterns ### AI Framework Setup ```python # PyTorch with CUDA pytorch_image = ( Image(username="myuser", name="pytorch-app", tag="1.0") .from_base("pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel") .run_command("pip install transformers accelerate datasets") .with_env("TORCH_CUDA_ARCH_LIST", "7.0;8.0;8.6") ) # TensorFlow with CUDA tensorflow_image = ( Image(username="myuser", name="tensorflow-app", tag="1.0") .from_base("tensorflow/tensorflow:2.13.0-gpu") .run_command("pip install tensorflow-datasets tensorflow-hub") .with_env("TF_FORCE_GPU_ALLOW_GROWTH", "true") ) ``` ### Model Downloading ```python model_image = ( Image(username="myuser", name="model-app", tag="1.0") .from_base("nvidia/cuda:12.2-runtime-ubuntu22.04") .with_python("3.11") .run_command("pip install transformers torch") # Pre-download models during build .run_command(""" python -c " from transformers import AutoModel, AutoTokenizer AutoModel.from_pretrained('bert-base-uncased') AutoTokenizer.from_pretrained('bert-base-uncased') " """) .with_env("TRANSFORMERS_CACHE", "/app/cache") ) ``` ## Next Steps - **[Chutes](/docs/core-concepts/chutes)** - Learn how to use images in Chutes - **[Node Selection](/docs/core-concepts/node-selection)** - Hardware requirements - **[Custom Image Building Guide](/docs/guides/custom-images)** - Advanced image building - **[Template Images](/docs/core-concepts/templates)** - Pre-built image templates --- ## SOURCE: https://chutes.ai/docs/core-concepts/jobs # Jobs (Background Tasks) **Jobs** are background tasks in Chutes that handle long-running operations, file uploads, and asynchronous processing. Unlike Cords (API endpoints), Jobs don't need to respond immediately and can run for extended periods. ## What is a Job? A Job is a decorated function that can: - 🕐 **Run for extended periods** (hours or days) - 📁 **Handle file uploads** and downloads - 🔄 **Process data asynchronously** - 💾 **Store results** in persistent storage - 📊 **Track progress** and status - 🔄 **Retry on failure** automatically ## Basic Job Definition ```python from chutes.chute import Chute chute = Chute(username="myuser", name="my-chute", image="my-image") @chute.job(timeout=3600) # 1 hour timeout async def process_data(self, data: dict) -> dict: # Long-running processing logic result = await expensive_computation(data) return {"status": "completed", "result": result} ``` ## Job Decorator Parameters ### `timeout: int = 300` Maximum time the job can run (in seconds). ```python @chute.job(timeout=7200) # 2 hours async def long_training_job(self, config: dict): # Training logic that might take hours pass ``` ### `upload: bool = False` Whether the job accepts file uploads. ```python @chute.job(upload=True, timeout=1800) async def process_video(self, video_file: bytes) -> dict: # Process uploaded video file return {"processed": True} ``` ### `retry: int = 0` Number of automatic retries on failure. ```python @chute.job(retry=3, timeout=600) async def unreliable_task(self, data: dict): # Will retry up to 3 times if it fails pass ``` ## Input Types ### Simple Data ```python @chute.job() async def analyze_text(self, text: str, language: str = "en") -> dict: analysis = await perform_analysis(text, language) return {"sentiment": analysis.sentiment, "topics": analysis.topics} ``` ### Structured Input with Pydantic ```python from pydantic import BaseModel class TrainingConfig(BaseModel): model_type: str learning_rate: float epochs: int batch_size: int @chute.job(timeout=14400) # 4 hours async def train_model(self, config: TrainingConfig) -> dict: model = create_model(config.model_type) results = await train(model, config) return {"accuracy": results.accuracy, "loss": results.final_loss} ``` ### File Uploads ```python @chute.job(upload=True, timeout=3600) async def process_dataset(self, dataset_file: bytes) -> dict: # Save uploaded file with open("/tmp/dataset.csv", "wb") as f: f.write(dataset_file) # Process the dataset df = pd.read_csv("/tmp/dataset.csv") results = analyze_dataset(df) return {"rows": len(df), "analysis": results} ``` ## Progress Tracking For long-running jobs, you can track and report progress: ```python @chute.job(timeout=7200) async def batch_process(self, items: list) -> dict: results = [] total = len(items) for i, item in enumerate(items): # Process each item result = await process_item(item) results.append(result) # Report progress (this is logged) progress = (i + 1) / total * 100 print(f"Progress: {progress:.1f}% ({i+1}/{total})") return {"processed": len(results), "results": results} ``` ## Error Handling ```python @chute.job(retry=2, timeout=1800) async def resilient_job(self, data: dict) -> dict: try: result = await risky_operation(data) return {"success": True, "result": result} except TemporaryError as e: # This will trigger a retry raise e except PermanentError as e: # Return error instead of raising to avoid retries return {"success": False, "error": str(e)} ``` ## Working with Files ### Processing Uploaded Files ```python import tempfile import os @chute.job(upload=True, timeout=1800) async def process_image(self, image_file: bytes) -> dict: # Create temporary file with tempfile.NamedTemporaryFile(delete=False, suffix=".jpg") as tmp: tmp.write(image_file) tmp_path = tmp.name try: # Process the image processed = await image_processing_function(tmp_path) return {"processed": True, "features": processed} finally: # Clean up os.unlink(tmp_path) ``` ### Generating Files for Download ```python @chute.job(timeout=3600) async def generate_report(self, report_config: dict) -> dict: # Generate report report_data = await create_report(report_config) # Save to file (this could be uploaded to cloud storage) report_path = f"/tmp/report_{report_config['id']}.pdf" save_report_as_pdf(report_data, report_path) return { "report_generated": True, "report_path": report_path, "pages": len(report_data) } ``` ## State Management Jobs can maintain state throughout their execution: ```python @chute.job(timeout=7200) async def training_job(self, config: dict) -> dict: # Initialize training state self.training_state = { "epoch": 0, "best_accuracy": 0.0, "model_checkpoints": [] } for epoch in range(config["epochs"]): self.training_state["epoch"] = epoch # Train for one epoch accuracy = await train_epoch(epoch) if accuracy > self.training_state["best_accuracy"]: self.training_state["best_accuracy"] = accuracy # Save checkpoint checkpoint_path = f"/tmp/checkpoint_epoch_{epoch}.pt" save_checkpoint(checkpoint_path) self.training_state["model_checkpoints"].append(checkpoint_path) return self.training_state ``` ## Job Lifecycle 1. **Queued**: Job is submitted and waiting to run 2. **Running**: Job is executing 3. **Completed**: Job finished successfully 4. **Failed**: Job encountered an error 5. **Retrying**: Job failed but will retry (if retry > 0) 6. **Timeout**: Job exceeded timeout limit ## Running Jobs ### Programmatically ```python # Submit a job job_id = await chute.submit_job("process_data", {"input": "data"}) # Check job status status = await chute.get_job_status(job_id) # Get job results (when completed) results = await chute.get_job_results(job_id) ``` ### Via HTTP API ```bash # Submit a job curl -X POST https://your-username-your-chute.chutes.ai/jobs/process_data \ -H "Content-Type: application/json" \ -d '{"input": "data"}' # Check status curl https://your-username-your-chute.chutes.ai/jobs/{job_id}/status # Get results curl https://your-username-your-chute.chutes.ai/jobs/{job_id}/results ``` ## Best Practices ### 1. Set Appropriate Timeouts ```python # Short tasks @chute.job(timeout=300) # 5 minutes # Medium tasks @chute.job(timeout=1800) # 30 minutes # Long training jobs @chute.job(timeout=14400) # 4 hours ``` ### 2. Handle Failures Gracefully ```python @chute.job(retry=2) async def robust_job(self, data: dict) -> dict: try: return await process_data(data) except Exception as e: # Log the error logger.error(f"Job failed: {e}") # Return error info instead of raising return {"success": False, "error": str(e)} ``` ### 3. Use Progress Tracking ```python @chute.job(timeout=3600) async def batch_job(self, items: list) -> dict: for i, item in enumerate(items): # Process item await process_item(item) # Log progress every 10 items if i % 10 == 0: print(f"Processed {i}/{len(items)} items") ``` ### 4. Clean Up Resources ```python @chute.job(timeout=1800) async def file_processing_job(self, data: dict) -> dict: temp_files = [] try: # Create temporary files for file_data in data["files"]: tmp_file = create_temp_file(file_data) temp_files.append(tmp_file) # Process files results = await process_files(temp_files) return results finally: # Always clean up for tmp_file in temp_files: os.unlink(tmp_file) ``` ## Common Use Cases ### Model Training ```python @chute.job(timeout=14400, retry=1) async def train_custom_model(self, training_data: dict) -> dict: # Load training data dataset = load_dataset(training_data["dataset_path"]) # Initialize model model = create_model(training_data["model_config"]) # Train model for epoch in range(training_data["epochs"]): loss = await train_epoch(model, dataset) print(f"Epoch {epoch}: Loss = {loss}") # Save trained model model_path = f"/tmp/trained_model_{int(time.time())}.pt" save_model(model, model_path) return {"model_path": model_path, "final_loss": loss} ``` ### Data Processing Pipeline ```python @chute.job(upload=True, timeout=7200) async def process_pipeline(self, raw_data: bytes) -> dict: # Stage 1: Parse data parsed_data = parse_raw_data(raw_data) print(f"Parsed {len(parsed_data)} records") # Stage 2: Clean data cleaned_data = clean_data(parsed_data) print(f"Cleaned data, {len(cleaned_data)} records remaining") # Stage 3: Transform data transformed_data = transform_data(cleaned_data) print(f"Transformed data complete") # Stage 4: Generate insights insights = generate_insights(transformed_data) return { "records_processed": len(parsed_data), "records_final": len(transformed_data), "insights": insights } ``` ### Batch Image Processing ```python @chute.job(timeout=3600) async def batch_image_process(self, image_urls: list) -> dict: results = [] for i, url in enumerate(image_urls): try: # Download and process image image = await download_image(url) processed = await process_image(image) results.append({"url": url, "success": True, "result": processed}) except Exception as e: results.append({"url": url, "success": False, "error": str(e)}) # Progress update if i % 10 == 0: print(f"Processed {i}/{len(image_urls)} images") success_count = sum(1 for r in results if r["success"]) return { "total": len(image_urls), "successful": success_count, "failed": len(image_urls) - success_count, "results": results } ``` ## Next Steps - **[Chutes](/docs/core-concepts/chutes)** - Learn about the main Chute class - **[Cords](/docs/core-concepts/cords)** - Understand API endpoints - **[Images](/docs/core-concepts/images)** - Build custom Docker environments - **[Your First Custom Chute](/docs/getting-started/first-chute)** - Complete example walkthrough --- ## SOURCE: https://chutes.ai/docs/core-concepts/node-selection # Node Selection (Hardware) **Node Selection** in Chutes allows you to specify exactly what hardware your application needs. This ensures optimal performance while controlling costs by only using the GPU resources you actually need. ## What is Node Selection? Node Selection defines the hardware requirements for your chute: - 🖥️ **GPU type and count** (A100, H100, V100, etc.) - 💾 **VRAM requirements** per GPU - 🔧 **CPU and memory** specifications - 🎯 **Hardware preferences** (include/exclude specific types) - 🌍 **Geographic regions** for deployment ## Basic Node Selection ```python from chutes.chute import NodeSelector, Chute # Simple GPU requirement node_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16 ) chute = Chute( username="myuser", name="my-chute", image="my-image", node_selector=node_selector ) ``` ## NodeSelector Parameters ### GPU Requirements #### `gpu_count: int` Number of GPUs your application needs. ```python # Single GPU for small models NodeSelector(gpu_count=1) # Multi-GPU for large models NodeSelector(gpu_count=4) # Maximum parallelization NodeSelector(gpu_count=8) ``` #### `min_vram_gb_per_gpu: int` Minimum VRAM (video memory) required per GPU. ```python # Small models (e.g., BERT, small LLMs) NodeSelector(min_vram_gb_per_gpu=8) # Medium models (e.g., 7B parameter models) NodeSelector(min_vram_gb_per_gpu=16) # Large models (e.g., 13B+ parameter models) NodeSelector(min_vram_gb_per_gpu=24) # Very large models (e.g., 70B+ parameter models) NodeSelector(min_vram_gb_per_gpu=80) ``` ### Hardware Preferences #### `include: list[str] = None` Prefer specific GPU types or categories. ```python # Prefer latest generation GPUs NodeSelector(include=["a100", "h100"]) # Prefer high-memory GPUs NodeSelector(include=["a100_80gb", "h100_80gb"]) # Include budget-friendly options NodeSelector(include=["rtx4090", "rtx3090"]) ``` #### `exclude: list[str] = None` Avoid specific GPU types or categories. ```python # Avoid older generation GPUs NodeSelector(exclude=["k80", "p100", "v100"]) # Avoid specific models NodeSelector(exclude=["rtx3080", "rtx2080"]) # Avoid low-memory variants NodeSelector(exclude=["a100_40gb"]) ``` ### CPU and Memory #### `min_cpu_count: int = None` Minimum CPU cores required. ```python # CPU-intensive preprocessing NodeSelector(min_cpu_count=16) # Heavy data loading NodeSelector(min_cpu_count=32) ``` #### `min_memory_gb: int = None` Minimum system RAM required. ```python # Large dataset in memory NodeSelector(min_memory_gb=64) # Very large preprocessing NodeSelector(min_memory_gb=256) ``` ### Geographic Preferences #### `regions: list[str] = None` Preferred deployment regions. ```python # US regions only NodeSelector(regions=["us-east", "us-west"]) # Europe regions NodeSelector(regions=["eu-west", "eu-central"]) # Global deployment NodeSelector(regions=["us-east", "eu-west", "asia-pacific"]) ``` ## Common Hardware Configurations ### Small Language Models (< 1B parameters) ```python # BERT, DistilBERT, small T5 models small_model_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=8 ) ``` ### Medium Language Models (1B - 7B parameters) ```python # GPT-2, small LLaMA models, Flan-T5 medium_model_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16, include=["rtx4090", "a100", "h100"] ) ``` ### Large Language Models (7B - 30B parameters) ```python # LLaMA 7B-13B, GPT-3 variants large_model_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24, include=["a100", "h100"], exclude=["rtx3080", "rtx4080"] # Not enough VRAM ) ``` ### Very Large Language Models (30B+ parameters) ```python # LLaMA 30B+, GPT-4 class models xl_model_selector = NodeSelector( gpu_count=2, min_vram_gb_per_gpu=80, include=["a100_80gb", "h100_80gb"] ) ``` ### Massive Models (100B+ parameters) ```python # Very large models requiring model parallelism massive_model_selector = NodeSelector( gpu_count=8, min_vram_gb_per_gpu=80, include=["a100_80gb", "h100_80gb"], min_cpu_count=64, min_memory_gb=512 ) ``` ## GPU Types and Specifications ### NVIDIA A100 ```python # A100 40GB - excellent for most workloads NodeSelector( gpu_count=1, min_vram_gb_per_gpu=40, include=["a100_40gb"] ) # A100 80GB - for very large models NodeSelector( gpu_count=1, min_vram_gb_per_gpu=80, include=["a100_80gb"] ) ``` ### NVIDIA H100 ```python # Latest generation, highest performance NodeSelector( gpu_count=1, min_vram_gb_per_gpu=80, include=["h100"] ) ``` ### RTX Series (Cost-Effective) ```python # RTX 4090 - excellent price/performance NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24, include=["rtx4090"] ) # RTX 3090 - budget option NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24, include=["rtx3090"] ) ``` ### V100 (Legacy but Stable) ```python # V100 for proven workloads NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16, include=["v100"] ) ``` ## Advanced Selection Strategies ### Cost Optimization ```python # Prefer cost-effective GPUs cost_optimized = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16, include=["rtx4090", "rtx3090", "v100"], exclude=["a100", "h100"] # More expensive ) ``` ### Performance Optimization ```python # Prefer highest performance performance_optimized = NodeSelector( gpu_count=2, min_vram_gb_per_gpu=80, include=["h100", "a100_80gb"], exclude=["rtx", "v100"] # Lower performance ) ``` ### Availability Optimization ```python # Prefer widely available hardware availability_optimized = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16, include=["rtx4090", "a100", "v100"], regions=["us-east", "us-west", "eu-west"] ) ``` ### Multi-Region Deployment ```python # Global deployment with failover global_deployment = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24, include=["a100", "h100"], regions=["us-east", "us-west", "eu-west", "asia-pacific"] ) ``` ## Memory Requirements by Use Case ### Text Generation ```python # Small models (up to 7B parameters) text_gen_small = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16 ) # Large models (7B-30B parameters) text_gen_large = NodeSelector( gpu_count=2, min_vram_gb_per_gpu=40 ) ``` ### Image Generation ```python # Stable Diffusion variants image_gen = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=12, # SD 1.5/2.1 include=["rtx4090", "a100"] ) # High-resolution image generation image_gen_hires = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24, # SDXL, custom models include=["rtx4090", "a100"] ) ``` ### Video Processing ```python # Video analysis and generation video_processing = NodeSelector( gpu_count=2, min_vram_gb_per_gpu=24, min_cpu_count=16, min_memory_gb=64 ) ``` ### Training Workloads ```python # Model fine-tuning training_workload = NodeSelector( gpu_count=4, min_vram_gb_per_gpu=40, min_cpu_count=32, min_memory_gb=128, include=["a100", "h100"] ) ``` ## Template-Specific Recommendations ### VLLM Template ```python from chutes.chute.template.vllm import build_vllm_chute # Optimized for VLLM inference vllm_chute = build_vllm_chute( username="myuser", model_name="microsoft/DialoGPT-medium", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16, include=["a100", "h100", "rtx4090"] # VLLM optimized ) ) ``` ### Diffusion Template ```python from chutes.chute.template.diffusion import build_diffusion_chute # Optimized for image generation diffusion_chute = build_diffusion_chute( username="myuser", model_name="stabilityai/stable-diffusion-xl-base-1.0", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=12, include=["rtx4090", "a100"] # Good for image gen ) ) ``` ## Best Practices ### 1. Start Conservative ```python # Begin with minimum requirements conservative_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16 ) # Scale up if needed ``` ### 2. Test Different Configurations ```python # Development configuration dev_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=8, include=["rtx3090", "rtx4090"] ) # Production configuration prod_selector = NodeSelector( gpu_count=2, min_vram_gb_per_gpu=40, include=["a100", "h100"] ) ``` ### 3. Consider Cost vs Performance ```python # Budget-conscious budget_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16, include=["rtx4090", "v100"], exclude=["a100", "h100"] ) # Performance-critical performance_selector = NodeSelector( gpu_count=2, min_vram_gb_per_gpu=80, include=["h100", "a100_80gb"] ) ``` ### 4. Plan for Scaling ```python # Single instance single_instance = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24 ) # Multi-instance ready multi_instance = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24, regions=["us-east", "us-west", "eu-west"] ) ``` ## Monitoring and Optimization ### Resource Utilization Monitor your chute's actual resource usage: ```python # Over-provisioned (waste of money) over_provisioned = NodeSelector( gpu_count=4, # Using only 1 min_vram_gb_per_gpu=80 # Using only 20GB ) # Right-sized (cost-effective) right_sized = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24 ) ``` ### Performance Tuning ```python # CPU-bound preprocessing cpu_intensive = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16, min_cpu_count=16, # Extra CPU for preprocessing min_memory_gb=64 ) # GPU-bound inference gpu_intensive = NodeSelector( gpu_count=2, # More GPU power min_vram_gb_per_gpu=40, min_cpu_count=8 # Less CPU needed ) ``` ## Troubleshooting ### Common Issues #### "No available nodes" ```python # Too restrictive problematic = NodeSelector( gpu_count=8, min_vram_gb_per_gpu=80, include=["h100"], regions=["specific-rare-region"] ) # More flexible flexible = NodeSelector( gpu_count=4, # Reduced requirement min_vram_gb_per_gpu=40, include=["h100", "a100_80gb"], # More options regions=["us-east", "us-west"] # More regions ) ``` #### "High costs" ```python # Expensive configuration expensive = NodeSelector( gpu_count=8, min_vram_gb_per_gpu=80, include=["h100"] ) # Cost-optimized alternative cost_optimized = NodeSelector( gpu_count=2, min_vram_gb_per_gpu=40, include=["a100", "rtx4090"] ) ``` #### "Poor performance" ```python # Underpowered underpowered = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=8, include=["rtx3080"] ) # Better performance better_performance = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24, include=["rtx4090", "a100"] ) ``` ## Next Steps - **[Chutes](/docs/core-concepts/chutes)** - Learn how to use NodeSelector in Chutes - **[Templates](/docs/core-concepts/templates)** - Pre-configured hardware for common use cases - **[Best Practices Guide](/docs/guides/best-practices)** - Optimization strategies - **[Cost Management](/docs/guides/cost-optimization)** - Control and optimize costs --- ## SOURCE: https://chutes.ai/docs/core-concepts/security-architecture # Chutes Security/Integrity # 1. Introduction and Guiding Principles This document provides a comprehensive overview of the security measures implemented within the Chutes serverless compute platform. Our security model is built on a defense-in-depth strategy, with multiple layers of verification and protection to ensure the integrity of the compute environment and the privacy of user data. ## Guiding Principles The Chutes network is designed for an adversarial environment where miners are anonymous and permissionless. Our security posture is therefore built on the principle of "don't trust, verify." We employ a multi-faceted approach to security, including: - **End-to-End Encryption:** All communication between the user, the validator, and the miner is encrypted. - **Code and Filesystem Integrity:** We continuously verify that the code running on miners' machines has not been tampered with. - **Environment Attestation:** We collect and verify detailed information about the miner's hardware and software environment. - **Containment:** We strictly limit the capabilities of the running code, including network access and access to the host system. - **Trusted Execution Environments (TEE):** For the highest level of assurance, we leverage Intel TDX and NVIDIA GPUs to create a fully isolated and verifiable compute environment. ## Security Layers The following sections detail the different layers of our security model, from the base-level protections applied to all chutes to the advanced TEE-based measures. # 2. Standard Security Measures (Non-TEE) These security measures are applied to all chutes running on the network, regardless of whether they are in a TEE or not. They form the baseline of trust and verification for the entire platform. ## Private Security Components (High-Level Overview) The Chutes platform utilizes several closed-source security components to protect against various attack vectors. While the source code for these components is not public, their functionality is described below. - **`cfsv` (Chutes Secure Filesystem Validation):** Responsible for ensuring the integrity of the container's filesystem. It works by building an index of all files and generating secure cryptographic digests based on random challenge seeds provided by the validator. This prevents unauthorized modifications to the filesystem. The source-of-truth for these digests is generated during the image build process. - **`cllmv` (Chutes Large Language Model Verification):** This component integrates with the SGLang inference engine to provide per-token verification hashes. Crucially, the specified Hugging Face model name and exact revision hash are cryptographically bound into the per-token proofs. This allows for cryptographic verification that every single token of output was generated by the *exact* model and revision claimed by the miner, making it impossible to spoof results from a cheaper or different model. - **`envdump` (Environment Dump):** Securely collects a comprehensive snapshot of the miner's environment. This includes environment variables, filesystem information, kernel details, and loaded Python modules. This data is sent to the validator to ensure the miner's environment conforms to the expected configuration. - **`inspecto` (Python Code Inspection):** This tool performs static analysis of Python bytecode for all loaded modules. It detects and prevents attempts by a miner to override standard library paths or insert malicious "logic bombs" that a simple file hash might miss. It generates a secure hash of the bytecode, which is compared against a source-of-truth hash generated at image build time. - **`chutes-net-nanny` (Network and Process Nanny):** A critical component for runtime security and containment. Its responsibilities include: - **Network Access Control:** Limits outbound network connections to a predefined set of hosts. - **Filesystem Encryption:** Encrypts the main "chute" source file to protect intellectual property. - **Integrity Verification:** Uses self-referencing hashes to ensure its own integrity. - **DNS Verification:** Prevents DNS spoofing attacks. - **Pod Access Prevention:** Intentionally causes a segmentation fault if any attempt is made to `exec` into the pod, run a sidecar container, or connect to a local service not in the process tree. This defeats a huge class of common container-based attacks. - **`graval-priv` (GPU Attestation):** This component provides "Proof of Consecutive VRAM Work" to cryptographically attest to the physical properties of the GPU. It uses OpenCL and the clBLAS library for broad compatibility with GPUs from different manufacturers, including NVIDIA and AMD. The process involves performing a series of consecutive matrix multiplications on the GPU. To create a verifiable yet efficient benchmark, it takes diagonal memory slices from the matrices, drastically reducing data transfer overhead while retaining a cryptographic proof that the full multiplication occurred. The time taken to complete these operations, combined with the memory access patterns, provides a hardware-level signature of the GPU's processing speed and available VRAM. This prevents miners from fraudulently claiming to have a more powerful GPU than they actually possess. This attestation process also enables the creation of a unique AES-256 encryption key based on the specific GPU's UUID and a random challenge, tying the secure communication channel to the verified physical hardware. ## Public Security Components (Detailed Description) The following open-source components are key to the Chutes security model. - **`chutes` (Miner-side Library):** The core library that is injected into every chute container. It orchestrates the entire startup and validation process from the miner's perspective. The main logic is in [`chutes/chutes/entrypoint/run.py`](https://github.com/chutesai/chutes/blob/main/chutes/chutes/entrypoint/run.py), which executes a multi-stage security handshake to ensure the integrity of the environment before any user code is run. For specific applications like SGLang LLMs, the Chutes library wrapper implements additional hardening: it launches the SGLang process with a password and strictly binds it only to the loopback interface (`127.0.0.1`). This ensures that nothing can directly access the inference server on the miner node except authenticated, validated, and signed calls originating from the validator, which are securely proxied *through the Chutes library wrapper itself*. - **`chutes-api` (Validator and API):** The central validator and API server for the Chutes network. It is responsible for creating the trusted environment that miners must adhere to, validating miners against that baseline, and securely relaying requests. Its key security functions are distributed across several components: - [**`api/image/forge.py`](https://github.com/chutesai/chutes-api/blob/main/api/image/forge.py): The Source of Truth** This is arguably the most critical security component on the validator side. The `forge` is responsible for building all chute images that run on the network. It establishes the "source of truth" that all miners are subsequently validated against. It performs controlled, multi-stage builds, generates filesystem and bytecode baselines, scans for vulnerabilities, and cryptographically signs the final image with `cosign`. - [**`api/graval_worker.py`](https://github.com/chutesai/chutes-api/blob/main/api/graval_worker.py) and [`api/instance/router.py`](https://github.com/chutesai/chutes-api/blob/main/api/instance/router.py): Miner Validation and Activation** These components handle the other side of the conversation with the miner's `entrypoint/run.py`, verifying the initial handshake, performing hardware attestation, and issuing the symmetric encryption key only upon successful validation of all proofs. - [**`watchtower.py`](https://github.com/chutesai/chutes-api/blob/main/watchtower.py): Continuous Monitoring and Active Defense** The `watchtower` is an active defense system that continuously monitors the health and integrity of all active miners on the network. It goes beyond simple liveness checks and performs deep, randomized validation: 1. **Software Integrity Checks:** It can issue random challenges to miners at any time, instructing them to perform on-demand `cfsv`, `inspecto`, or `envdump` checks and return the results. 2. **Model Weight Verification:** To ensure the correct model is loaded and to defeat "bait-and-switch" attacks (where a miner loads the correct model at startup but swaps it for a cheaper one later), the `watchtower` can command a chute to read its model files at a random offset and return a SHA256 hash of that data slice. The validator compares this against the correct hash for the specified model, making it computationally infeasible for a miner to use a different or modified set of model weights. If a miner fails any of these checks or does not respond, it is immediately removed from the network. - **`chutes-miner` (Miner Management):** This repository contains the tools for miners to manage their chute deployments. It acts as the local enforcement layer, translating the validator's desired state into actual running pods on the miner's Kubernetes cluster, using a JWT-based authorization flow to ensure no chute can launch without explicit permission. # 3. TEE Security Measures: The `sek8s` Environment While the standard security measures provide a robust defense-in-depth strategy, for users who require the highest possible level of assurance and data confidentiality, Chutes offers deployment in a Trusted Execution Environment (TEE). This is powered by our custom, security-hardened Kubernetes distribution, **`sek8s`**. The `sek8s` environment, located in the public [`sek8s`](https://github.com/chutesai/sek8s) repository, is designed from the ground up to run workloads within Intel® Trust Domain Extensions (TDX) confidential virtual machines. When a chute runs in a `sek8s` environment, it is not just protected by our standard validation mechanisms; it is further isolated by hardware-level security guarantees. This provides a verifiable and impenetrable black box for your data and code. Here are the key security features of `sek8s`, which work in concert with all the previously mentioned security layers: ## Intel® TDX Deep Dive: Creating the Confidential VM Intel® TDX is the cornerstone of our TEE offering. It allows us to create a special type of virtual machine called a Trust Domain (TD) that is isolated from almost everything else on the system. - **Secure Arbitration Mode (SEAM):** TDX introduces a new CPU mode called SEAM. This is a hardware-enforced layer that sits alongside the standard VMX modes used by hypervisors. A special, Intel-signed and hardware-resident module called the "TDX module" operates within SEAM. This module is responsible for creating, managing, and tearing down Trust Domains. The key is that the host's hypervisor (or Virtual Machine Monitor, VMM) is no longer fully in control; it must make requests to the TDX module to interact with a TD, and the TDX module will refuse any request that would violate the TD's confidentiality or integrity. - **Memory Encryption and Integrity:** The primary guarantee of TDX is that the memory used by a TD is encrypted using a key known only to the CPU. If the hypervisor, or an attacker with root access on the host, tries to read the memory of a running chute, they will only see ciphertext. Furthermore, TDX provides memory integrity protection, which prevents attackers from replaying or tampering with the encrypted memory pages. - **Data Isolation:** Because of SEAM and memory encryption, the VMM/hypervisor is removed from the trust boundary. It is treated as untrusted. It can no longer inspect the CPU registers or memory of the TD. This means the host operator, and any malware on the host, is physically prevented by the CPU from seeing a user's data-in-use inside the chute. ## NVIDIA Confidential Computing with Protected PCIe (PPCIE) Modern AI workloads are not confined to the CPU. To provide a true end-to-end TEE, the trust boundary must be extended to the GPU. - **The Problem:** The PCIe bus, which connects the CPU and GPU, is traditionally unencrypted. An attacker with physical access or sufficient host compromise could potentially snoop this bus to intercept data as it travels to and from the GPU. - **The Solution:** We use NVIDIA GPUs (such as the H100) that support Confidential Computing mode with Protected PCIe. In this mode, the GPU and CPU establish a secure, encrypted channel over the PCIe bus. All data and code sent to the GPU for processing are encrypted, protecting them from bus snooping attacks. This ensures that your data remains confidential even as it's being used for high-speed training or inference on the GPU. ## Full System Attestation: Proving Trust Before Execution Before a TEE-enabled chute is even started, the validator performs a full remote attestation of the `sek8s` environment to prove that it is genuine and untampered. - **The Measurement (RTMR):** During the boot process of the Trust Domain, the TDX module performs a series of cryptographic measurements. It measures the firmware, the bootloader, the kernel, and other critical software components. These measurements are stored in special CPU registers called **Runtime Memory Measurement Registers (RTMRs)**. Any change to the software, no matter how small, will result in a different RTMR value. - **The Quote:** The `sek8s` node can request that the TDX module generate a "TD Quote." This is a data structure that is cryptographically signed by a private key fused into the CPU itself. The Quote contains the RTMR values, a nonce provided by the validator (to prevent replay attacks), and other important metadata. - **The Verification:** The attestation process is as follows: 1. The validator generates a random nonce and sends it to the miner's `sek8s` node. 2. The `sek8s` node requests a TD Quote from the CPU, including the nonce. It also gathers an attestation report from the NVIDIA GPU. 3. The node sends both the CPU's TD Quote and the NVIDIA attestation report to the validator. 4. The validator first checks the cryptographic signature on the TD Quote using Intel's public keys to confirm it came from a genuine Intel CPU with TDX enabled. It then checks the NVIDIA report. 5. Finally, it compares the RTMR measurements inside the Quote with a known-good "golden" configuration for `sek8s`. 6. **Only if every single measurement matches does attestation pass.** This proves, with cryptographic certainty, that the hardware and software stack on the miner's machine is exactly what it is supposed to be. - **Encrypted and Measured Root Filesystem:** This attestation is tied directly to the filesystem's accessibility. The root filesystem of the `sek8s` guest environment is encrypted with LUKS. The decryption key is only released by a secure service after a successful attestation. This means the node cannot even boot into a usable state if its underlying software has been modified in any way. Any change to the filesystem would alter the measurements, cause attestation to fail, and prevent the decryption key from being released, rendering the node inoperable. ## `cosign` Image Admission Controller The final link in the chain of trust is ensuring that only authorized code runs within the attested, confidential environment. The `sek8s` Kubernetes API server is configured with a strict admission controller that intercepts all pod creation requests. This controller will only allow a pod to be scheduled if its container image has been cryptographically signed by Chutes' `cosign` key. This connects back to the `chutes-api` `forge`, which signs every image it builds. It makes it impossible to run a malicious or tampered image inside the `sek8s` TEE. ## Hardened Environment & No Backdoors The `sek8s` environment is stripped down to the bare essentials. There are no SSH daemons, remote access tools, or unnecessary services running. Deployment and management are handled exclusively through the locked-down Kubernetes API, which itself is subject to strict authentication and authorization controls. ## The TEE Guarantee When you run a chute in TEE mode, you are not just trusting our software validation stack; you are relying on hardware-enforced cryptographic guarantees from Intel and NVIDIA. The combination of remote attestation, encrypted memory, and a locked-down, measured environment means you can be confident that: 1. **Your code is running on genuine, untampered hardware.** 2. **The software environment is exactly what Chutes has defined, with no modifications.** 3. **No one, not even the machine's owner, can access or view your data while it is being processed.** This provides the strongest possible protection against data exfiltration and intellectual property theft, making Chutes a uniquely secure platform for sensitive AI workloads. # 4. Verifiability and Trust The previous sections detailed the "how" of Chutes' security model. This section details the "why," explaining how these features combine to create a platform that is not just secure, but transparently and verifiably so. ## Model and Configuration Transparency A cornerstone of the Chutes platform is eliminating the ambiguity common in other compute networks. When you use a Chutes model, you know *exactly* what you are getting and how it's running. For any public chute, you can visit its page on the `chutes.ai` website and click the "Source" tab to inspect its complete, reproducible configuration. This includes: * **Full Source Code:** The exact Python code for the chute is visible. * **Inference Engine Arguments:** The precise `engine_args` used to launch the inference server (e.g., SGLang) are listed, showing every flag and setting. * **GPU Requirements:** The specific GPU models the chute is designed and validated to run on. * **Hugging Face Model & Revision:** The exact `model_name` and, most importantly, the locked `revision` (commit hash) from Hugging Face are clearly defined. We virtually never use quantized models; if we did, the quantization configuration would also be explicitly defined here. * **Open Source SGLang Fork:** The version of our SGLang fork used is open source and can be inspected on GitHub, and is generally kept in sync with the main upstream `sglang` project. This transparency means there is no "black box" when it comes to the model itself. You can verify the exact, non-quantized, revision-locked model you are paying for before you ever make an API call. ## The Chutes Difference: A Comparison with Opaque AI Platforms The verifiability of the Chutes platform stands in stark contrast to the "trust me, bro" model of typical closed, centralized AI platforms. When considering security and integrity, the difference is fundamental. | Question a Skeptic Would Ask | Typical Opaque Platform (e.g., "ACME LLM, Inc.") | The Chutes Verifiable Answer | | --- | --- | --- | | **Which model am I *really* using?** | You are told you're using `ACME-Chat-v3-Turbo`, but you have no way to verify if it's the latest version or an older, cheaper one. | You can see the exact Hugging Face `model_name` and `revision` hash on `chutes.ai` for the specific chute you are using. | | **Is the model quantized or modified?** | You don't know. They might be using a heavily quantized (e.g., 4-bit) or "lobotomized" version of the model to save on costs, delivering lower quality results. | You can see the exact `engine_args` and source code. Chutes almost never uses quantized models, and if so, it would be explicitly declared. The `watchtower`'s random hash checks of the model files ensure the weights on disk are the ones you expect. | | **What code is processing my prompt?** | It's a proprietary secret, running in their data center. You are trusting that their internal code has no bugs, no malicious logic, and does what the privacy policy says. | The code for the chute, the `chutes` library, and the `SGLang` fork are all open source. `inspecto` verifies the bytecode at runtime. | | **How is my data protected while in use?** | You have to trust their internal security practices and their privacy policy. A single rogue employee or host-level vulnerability could expose your data. | **Verifiable hardware isolation.** With `sek8s`, your data is protected by Intel TDX memory encryption and NVIDIA PPCIE. Not even the owner of the machine can see your data in memory. This is a physical guarantee, not a policy promise. | | **Is my prompt being logged or used for training?** | Their privacy policy says no, but you have no way to prove it. Malicious or accidental logging is a significant risk. | The code is open and auditable. More importantly, `chutes-net-nanny` blocks all outbound network traffic by default, so even if the code *tried* to exfiltrate your data, it would be blocked by a lower-level security layer. | | **How do I know the environment is secure?** | You don't. You are trusting their infrastructure security, which is completely opaque to you. | **You can verify it yourself, in real time.** You can fetch the hardware attestation quote (TD Quote) and the full software manifest (IMA report) for the node running your workload at any time. | | **What is the basis of trust?** | Trust in a brand, its marketing, and its legal documents (privacy policy). | **Cryptographic proof.** The entire system is built on the principle of "don't trust, verify," from the hardware up to the application code. | ## Why TEEs Alone Are Not Enough: Chutes' Holistic Security Philosophy While Trusted Execution Environments (TEEs) provide groundbreaking hardware-level isolation, it is crucial to understand that they are not a silver bullet. Relying solely on TEEs can create a false sense of security, as several attack vectors remain unaddressed. Chutes' approach is built on the understanding that true security requires a holistic, multi-layered strategy that integrates hardware TEEs with robust software validation, continuous monitoring, and radical transparency. Here's why TEEs alone are insufficient and how Chutes addresses these gaps: - **The Insider Threat: What Good is a Black Box if the Code Inside is Malicious?** A TEE's primary function is to protect a workload from a compromised host. It creates a "black box" where the CPU prevents the host OS from snooping on the code's memory. However, the TEE itself does not know if the code *it is executing* is malicious. For example, a malicious operator could create a chute that perfectly mimics a legitimate LLM service, but adds one extra line of code: `log_file.write(user_prompt)`. The TEE will dutifully run this code and protect it from the host, but it will also faithfully execute the instruction to log the user's private data. Without a mechanism to verify the integrity of the code *inside* the TEE, the user has no guarantee against this kind of insider attack. - **Chutes' Mitigation:** This is precisely why our software validation stack is not just an add-on, but an essential component of TEE security. A TEE's job is to protect data in use; Chutes' job is to verify the code that uses it. 1. **Verified Code:** Our rigorous `forge` process (`inspecto`, `cfsv`, `trivy`) and `cosign` image signing guarantee that the code running inside the TEE is the exact, untampered code the user expects. The malicious prompt-logging chute would never be deployed because its `inspecto` hash would not match the source-of-truth, and its image would not have a valid signature. 2. **Continuous Checks:** Even if an attacker found a novel way to modify the code *after* launch (a hypothetical scenario, as this is blocked by multiple layers), the `watchtower`'s continuous and random `inspecto` and `cfsv` challenges would immediately detect the modification. 3. **Configurable Egress Control:** As a final defense, the `chutes-net-nanny`, while optional, is typically enabled to block all outbound network traffic, preventing a malicious chute from "phoning home" with stolen data. - **TEEs Are Not Immune to Vulnerabilities:** Hardware is not perfect, and TEE implementations have historically had, and will likely continue to have, their own vulnerabilities and zero-days. Exploiting a TEE vulnerability could potentially allow an attacker to break isolation or extract keys. - **Chutes' Mitigation:** Our multi-layered approach means that even if a TEE vulnerability were to be discovered, the attacker would still face significant hurdles. The external network lockdown by `chutes-net-nanny` and `sek8s` network policies would prevent command-and-control communication or data exfiltration. The continuous `cfsv` and `inspecto` checks would detect tampering. The IMA manifests provide a real-time audit trail. These redundant layers reduce the blast radius of any single point of failure. - **Lack of Visibility and Trust:** While TEEs provide a "black box," this can ironically lead to a lack of verifiable trust for external observers. How can a user be sure that the code inside the black box is indeed what it claims to be, or that the attestation process itself isn't being spoofed? - **Chutes' Mitigation:** Our commitment to "Radical Verifiability" addresses this head-on. By providing real-time, public access to hardware attestation reports (TD Quotes, NVIDIA attestations) and full software manifests (IMA), Chutes enables any third-party observer to independently verify the integrity of the environment and the running code. This transparency transforms the "black box" into a cryptographically transparent, auditable compute environment. In summary, while Intel TDX and NVIDIA PPCIE provide essential hardware roots of trust, Chutes understands that a truly secure confidential computing platform must go further. By combining these advanced hardware technologies with a comprehensive, open-source-auditable software stack and a commitment to radical verifiability, Chutes delivers a level of integrity and confidence that far exceeds what TEEs alone can offer. ## Openness and Radical Verifiability A core tenet of the Chutes security model is that you should not have to trust us blindly. We believe that **verifiability means nothing unless you have something to verify against.** Cryptographic reports are only meaningful if you can compare them to a known-good, publicly auditable baseline. - **Open Source as the Foundation of Trust:** The core logic for the validator (`chutes-api`), the miner deployment engine (`chutes-miner`), the client library (`chutes`), and the entire TEE environment (`sek8s`) are publicly available on GitHub. This is not just a philosophical choice; it is a security necessity. Our open-source repositories define the "golden state"—the exact configuration, software components, and measurements that our attestation reports should reflect. Without this public baseline, our claims of verifiability would be empty. - **Real-time, Public Attestation:** We are building on this foundation to provide radical transparency. For any chute running on the network, at any time, anyone will be able to query: 1. **The Full Attestation Report:** You can request the latest TD Quote and NVIDIA attestation report directly from the node the chute is running on. You can then independently verify the hardware signatures and, most importantly, compare the software measurements (RTMRs) against the configuration defined in the open-source `sek8s` repository. 2. **The Full Software Manifest:** We use the **Integrity Measurement Architecture (IMA)** of the Linux kernel to generate a signed manifest of every single file, library, and package on the filesystem. This manifest is included in the attestation report's measurements. You can fetch this manifest and compare it against the public `sek8s` build to prove that not a single file has been added, removed, or altered. This ability for any third party to independently and cryptographically verify the integrity of any node on the network against a public, open-source codebase is the ultimate expression of our "don't trust, verify" principle. It provides a level of provable security that is unparalleled in public compute platforms. # 5. Attack Vectors and Mitigations To make the security guarantees of the Chutes platform more concrete, this section enumerates common attack vectors and details how they are mitigated by the platform's security layers. | Attack Vector | Description | Standard Mitigation (All Chutes) | TEE (`sek8s`) Mitigation (Enhanced Protection) | | --- | --- | --- | --- | | **Code Tampering** | A malicious miner modifies the chute's source code to steal data, alter results, or introduce a backdoor. | **`inspecto`:** At startup, generates a hash of all Python bytecode, which is validated against a source-of-truth hash from the image build. Any modification is immediately detected. | **`cosign` Admission Controller:** The Kubernetes API server flatly refuses to run any image that does not have a valid cryptographic signature from the Chutes build system (`forge`).

**Immutable Filesystem:** The container's root filesystem is read-only. | | **Filesystem Tampering** | The miner modifies system libraries, Python packages, or other files within the container to compromise the environment. | **`cfsv`:** At startup and on-demand, performs a challenge-response protocol to verify the integrity of the entire filesystem against a source-of-truth index created at build time. | **Measured & Encrypted Root FS:** The entire host filesystem for the confidential VM is measured at boot and encrypted. Attestation will fail if a single byte is changed, and the disk decryption key will not be released, rendering the node inert. | | **Model Substitution / Weight Tampering** | A miner uses a cheaper, quantized, or "lobotomized" model while claiming to run the full-precision version specified by the user. | **`watchtower`:** Can issue a random challenge at any time, requiring the miner to hash a specific slice of the model weight files on disk. This defeats "bait-and-switch" attacks.

**`cllmv`:** Cryptographically binds the model name and revision hash to the per-token output proofs. | These software-level checks are the primary defense and are augmented by the TEE's general isolation, which prevents an attacker from tampering with the validation tools themselves. | | **Data-in-Transit Interception** | An attacker on the same network as the miner or validator attempts to read or modify API requests. | **End-to-End Encryption:** All communication between the validator and the miner is encrypted using a symmetric AES-256 key negotiated during the `graval-priv` hardware attestation handshake. | **Hardware-Enforced Isolation:** The TLS session for communication is terminated *inside* the confidential VM. An attacker on the host cannot intercept the unencrypted traffic, as the host OS has no access to the TD's memory or network stack. | | **Data-in-Use / Memory Snooping** | The miner (or an attacker who has compromised the host OS) attempts to read the memory of the running chute to steal user data, prompts, or model weights. | **Process Isolation:** Standard OS-level process isolation is used. This does not protect against a root-level attacker on the host. | **Intel® TDX Memory Encryption:** The entire memory space of the confidential VM is encrypted by the CPU. The host OS and hypervisor only see ciphertext. It is physically impossible for the host to read the chute's memory. | | **GPU Bus Snooping** | An attacker with physical access or high-level host compromise uses specialized tools to read data as it travels over the PCIe bus between the CPU and the GPU. | (No specific mitigation for this advanced attack). | **NVIDIA Protected PCIe (PPCIE):** The link between the CPU and GPU is fully encrypted. All data and models sent to GPU VRAM are protected from snooping attacks on the PCIe bus. | | **Pod Breakout / Host Compromise** | A process inside the chute container attempts to escape its container and gain access to the host operating system. | **`chutes-net-nanny`:** Intercepts system calls and intentionally segfaults any process that attempts to `exec` into the pod, attach a debugger, or otherwise interact with processes outside its own tree.

**Restrictive K8s Config:** Pods are run with a restrictive `securityContext`, as non-root users, and with privilege escalation disabled. | **Hypervisor Isolation:** The chute runs inside a completely separate, hardware-isolated confidential VM (the Trust Domain). A pod breakout would only grant access to the inside of the TD, which has no access to the host system or other TDs. | | **Malicious Network Activity / Data Exfiltration** | A compromised or malicious chute attempts to send user data to an attacker-controlled server on the internet. | **`chutes-net-nanny`:** By default, all outbound network traffic is blocked, except to the Chutes validator proxy. Egress can only be enabled on a per-chute basis by the chute's owner. | **`sek8s` Network Policies:** In addition to `net-nanny`, `sek8s` enforces strict, default-deny Kubernetes network policies at the infrastructure level, providing a second layer of egress control. | | **Attestation Forgery / Impersonation** | A malicious miner tries to fake its hardware or software environment to trick the validator into accepting it. | **`graval-priv`:** Uses a GPU-specific, hardware-based challenge-response mechanism that is difficult to fake without access to the specific GPU hardware.

**Continuous Monitoring:** The `watchtower` performs random, on-demand checks. | **Hardware-Signed Quotes:** Attestation is not a software proof; it's a cryptographic report (TD Quote) signed by a private key fused into the CPU hardware. This signature is verifiable and cannot be forged. The use of a nonce from the validator prevents replay attacks. | | **GPU Fraud / Misrepresentation** | A miner claims to have a powerful, expensive GPU (e.g., an H100) to attract high-value workloads but actually runs the computation on a cheaper, slower GPU (e.g., a T4). | **`graval-priv`:** This non-TEE attestation serves as a hardware benchmark. The "Proof of Consecutive VRAM Work" (consecutive matrix multiplications) cryptographically proves the GPU's actual processing speed and VRAM amount. The time it takes to return the proof is a key part of the validation, making it impossible for a slow GPU to fake the performance of a fast one. | **NVIDIA Hardware Attestation:** In a TEE, this is augmented by the signed attestation report from the GPU itself. This report, verifiable by the validator, contains the true identity of the GPU (e.g., "NVIDIA H100"), providing a second, hardware-rooted proof that prevents misrepresentation. | | **Rollback Attacks** | An attacker tries to force a miner to run an old, known-vulnerable version of a chute image. | **Validator State:** The validator (`chutes-api`) is the source of truth for which chute versions are valid. `gepetto` will refuse to run any version not explicitly approved by the validator. | **`cosign` Verification:** The admission controller verifies the image signature against the latest trusted keys. An older image might be signed, but could be blocked by other policy if vulnerabilities are discovered. | # 6. Case Study: End-to-End SGLang LLM Request in a TEE To demonstrate how these layers work together, let's walk through the entire lifecycle of a request to a Large Language Model (LLM) running in an SGLang chute, deployed inside a `sek8s` TEE. ### Stage 0: Pre-Flight Verification A skeptical user, before spending any money, visits the chute's public page on `chutes.ai`. In the "Source" tab, they verify the *exact* configuration: the Hugging Face model (`meta-llama/Llama-2-70b-chat-hf`) and revision (`a1b2c3d...`), the precise `engine_args` used to launch SGLang, the lack of any quantization flags, and the open-source chute code itself. This provides a verifiable baseline for what to expect. ### Stage 1: The Build - Creating Verifiable Truth 1. **Image Creation:** The `chutes-api` `forge` service picks up the chute definition. It builds a new container image. 2. **Baseline Hashes:** During the build, `cfsv` and `inspecto` are run inside the container to generate the "source-of-truth" hashes for the filesystem and Python bytecode. 3. **Signing:** The final image is pushed to the registry and cryptographically signed with `cosign`. ### Stage 2: The Deployment - Attestation Before Execution 1. **Deployment Request:** The user decides to run the chute. `gepetto` identifies a TEE-capable server running `sek8s`. 2. **Hardware Attestation:** Before deploying, the `chutes-api` validator initiates remote attestation. The `sek8s` node returns a TD Quote signed by the CPU's hardware key and an NVIDIA GPU report. 3. **Verification:** The validator verifies the signatures and compares the measurements in the quotes to the "golden" `sek8s` configuration. Attestation passes, proving the node is genuine and untampered. 4. **Launch Authorization:** `gepetto` receives a single-use JWT launch token from the validator. 5. **Kubernetes Deployment:** `gepetto` creates the Kubernetes `Job` object. The `sek8s` admission controller verifies the `cosign` signature on the image and allows the pod to be scheduled. ### Stage 3: The Launch - A Chained Sequence of Checks 1. **Secure Startup:** The pod starts, and the `chutes/entrypoint/run.py` script executes. 2. **Validation Handshake:** The entrypoint uses its JWT to open a dialogue with the validator, sending its `cfsv` and `inspecto` hashes. The validator confirms they match the build-time hashes. 3. **Symmetric Key:** With all checks passed, the validator sends the ephemeral AES symmetric key to the chute. The `GraValMiddleware` is now active. 4. **SGLang Initialization:** The chute's `@chute.on_startup()` function is called. The script downloads the specific, revision-locked Llama-2-70B model and starts the `sglang.launch_server` process. Importantly, the `sglang` server is launched with a password and strictly binds *only* to the loopback interface (`127.0.0.1`). This means no external process can directly connect to the SGLang server; all communication must be securely routed through the Chutes library's proxy. 5. **Activation & Lockdown:** The SGLang server is ready. The entrypoint calls the `activation_url`, and `netnanny` permanently disables external network access (if configured). ### Stage 4: The Inference Request & Continuous Verification 1. **User Request:** A user sends a prompt: `POST /v1/chat/completions`. 2. **Encrypted Forward & Decryption:** The request is encrypted, sent to the miner, and decrypted *inside* the Intel TDX Trust Domain. The host OS sees only ciphertext. 3. **Secure Inference:** The prompt is processed by the LLM on the GPU. The data is protected by TDX on the CPU and by NVIDIA PPCIE on the PCIe bus. 4. **Runtime Check (Optional):** At this very moment, the `watchtower` could issue a random challenge, demanding the chute hash a slice of the Llama-2 model weights on disk to prove they haven't been swapped post-launch. 5. **Verified Output:** As the LLM generates tokens, `cllmv` generates verification hashes for the output, cryptographically binding the response to the `meta-llama/Llama-2-70b-chat-hf` model and revision `a1b2c3d...` that the user originally inspected. 6. **Encrypted Response:** The final response is encrypted by the `GraValMiddleware` *inside the TD* and sent back to the user. From start to finish, the user's data has been protected by multiple, overlapping layers of hardware and software security. At the most critical stage—when the data is in use—it is inside a hardware-enforced black box, invisible even to the owner of the machine it's running on. --- ## SOURCE: https://chutes.ai/docs/core-concepts/templates # Templates **Templates** in Chutes are pre-built, optimized configurations for common AI workloads. They provide production-ready setups with just a few lines of code, handling complex configurations like Docker images, model loading, API endpoints, and hardware requirements. ## What are Templates? Templates are factory functions that create complete Chute configurations for specific use cases: - 🚀 **One-line deployment** of complex AI systems - 🔧 **Pre-optimized configurations** for performance and cost - 📦 **Batteries-included** with all necessary dependencies - 🎯 **Best practices** built-in by default - 🔄 **Customizable** for specific needs ## Available Templates ### Language Model Templates #### VLLM Template High-performance language model serving with OpenAI-compatible API. ```python from chutes.chute.template.vllm import build_vllm_chute chute = build_vllm_chute( username="myuser", model_name="microsoft/DialoGPT-medium", revision="main", node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=16) ) ``` #### SGLang Template Structured generation for complex prompting and reasoning. ```python from chutes.chute.template.sglang import build_sglang_chute chute = build_sglang_chute( username="myuser", model_name="microsoft/DialoGPT-medium", node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=16) ) ``` ### Embedding Templates #### Text Embeddings Inference (TEI) Optimized text embedding generation. ```python from chutes.chute.template.tei import build_tei_chute chute = build_tei_chute( username="myuser", model_name="sentence-transformers/all-MiniLM-L6-v2", node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=8) ) ``` ### Image Generation Templates #### Diffusion Template Stable Diffusion and other diffusion model serving. ```python from chutes.chute.template.diffusion import build_diffusion_chute chute = build_diffusion_chute( username="myuser", model_name="stabilityai/stable-diffusion-xl-base-1.0", node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=12) ) ``` ## Template Categories ### 🗣️ Language Models **Use Cases**: Text generation, chat, completion, code generation - **VLLM**: Production-scale LLM serving - **SGLang**: Complex reasoning and structured generation - **Transformers**: Custom model implementations ### 🔤 Text Processing **Use Cases**: Embeddings, classification, named entity recognition - **TEI**: Fast embedding generation - **Sentence Transformers**: Semantic similarity - **BERT**: Classification and encoding ### 🎨 Image Generation **Use Cases**: Image synthesis, editing, style transfer - **Diffusion**: Stable Diffusion variants - **GAN**: Generative adversarial networks - **ControlNet**: Controlled image generation ### 🎵 Audio Processing **Use Cases**: Speech recognition, text-to-speech, music generation - **Whisper**: Speech-to-text - **TTS**: Text-to-speech synthesis - **MusicGen**: Music generation ### 🎬 Video Processing **Use Cases**: Video analysis, generation, editing - **Video Analysis**: Object detection, tracking - **Video Generation**: Text-to-video models - **Video Enhancement**: Upscaling, stabilization ## Template Benefits ### 1. **Instant Deployment** ```python # Without templates (complex setup) image = ( Image(username="myuser", name="vllm-app", tag="1.0") .from_base("nvidia/cuda:12.1-devel-ubuntu22.04") .with_python("3.11") .run_command("pip install vllm==0.2.0") .run_command("pip install transformers torch") # ... 50+ more lines of configuration ) chute = Chute( username="myuser", name="llm-service", image=image, node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=16) ) @chute.on_startup() async def load_model(self): # ... complex model loading logic @chute.cord(public_api_path="/v1/chat/completions") async def chat_completions(self, request: ChatRequest): # ... OpenAI API compatibility logic # With templates (one line) chute = build_vllm_chute( username="myuser", model_name="microsoft/DialoGPT-medium", node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=16) ) ``` ### 2. **Production-Ready Defaults** ```python # Templates include: # ✅ Optimized Docker images # ✅ Proper error handling # ✅ Logging and monitoring # ✅ Health checks # ✅ Resource optimization # ✅ Security best practices ``` ### 3. **Hardware Optimization** ```python # Templates automatically optimize for: # - GPU memory usage # - CPU utilization # - Network throughput # - Storage requirements ``` ## Template Customization ### Basic Customization ```python from chutes.chute.template.vllm import build_vllm_chute # Customize standard parameters chute = build_vllm_chute( username="myuser", model_name="microsoft/DialoGPT-medium", revision="main", node_selector=NodeSelector( gpu_count=2, min_vram_gb_per_gpu=24 ), concurrency=8, tagline="Custom LLM API", readme="# My Custom LLM\nPowered by VLLM" ) ``` ### Advanced Customization ```python # Custom engine arguments chute = build_vllm_chute( username="myuser", model_name="microsoft/DialoGPT-medium", engine_args={ "max_model_len": 4096, "gpu_memory_utilization": 0.9, "max_num_seqs": 32, "temperature": 0.7 } ) # Custom Docker image custom_image = ( Image(username="myuser", name="custom-vllm", tag="1.0") .from_base("nvidia/cuda:12.1-devel-ubuntu22.04") .with_python("3.11") .run_command("pip install vllm==0.2.0") .run_command("pip install my-custom-package") ) chute = build_vllm_chute( username="myuser", model_name="microsoft/DialoGPT-medium", image=custom_image ) ``` ### Template Extension ```python # Extend a template with custom functionality base_chute = build_vllm_chute( username="myuser", model_name="microsoft/DialoGPT-medium", node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=16) ) # Add custom endpoints @base_chute.cord(public_api_path="/custom/analyze") async def analyze_text(self, text: str) -> dict: # Custom text analysis logic return {"analysis": "custom_result"} # Add custom startup logic @base_chute.on_startup() async def custom_initialization(self): # Additional setup self.custom_processor = CustomProcessor() ``` ## Template Parameters ### Common Parameters All templates support these standard parameters: ```python def build_template_chute( username: str, # Required: Your Chutes username model_name: str, # Required: HuggingFace model name revision: str = "main", # Git revision/branch node_selector: NodeSelector = None, # Hardware requirements image: str | Image = None, # Custom Docker image tagline: str = "", # Short description readme: str = "", # Markdown documentation concurrency: int = 1, # Concurrent requests per instance **kwargs # Template-specific options ) ``` ### Template-Specific Parameters #### VLLM Template ```python build_vllm_chute( # Standard parameters... engine_args: dict = None, # VLLM engine configuration trust_remote_code: bool = False, # Allow remote code execution max_model_len: int = None, # Maximum sequence length gpu_memory_utilization: float = 0.85, # GPU memory usage max_num_seqs: int = 128 # Maximum concurrent sequences ) ``` #### Diffusion Template ```python build_diffusion_chute( # Standard parameters... pipeline_type: str = "text2img", # Pipeline type scheduler: str = "euler", # Diffusion scheduler safety_checker: bool = True, # Content safety guidance_scale: float = 7.5, # CFG scale num_inference_steps: int = 50 # Generation steps ) ``` #### TEI Template ```python build_tei_chute( # Standard parameters... pooling: str = "mean", # Pooling strategy normalize: bool = True, # Normalize embeddings batch_size: int = 32, # Inference batch size max_length: int = 512 # Maximum input length ) ``` ## Template Comparison ### Language Model Templates | Template | Best For | Performance | Memory | API | | ------------ | ---------------------- | ----------- | --------- | ----------------- | | VLLM | Production LLM serving | Highest | Optimized | OpenAI-compatible | | SGLang | Complex reasoning | High | Standard | Custom structured | | Transformers | Custom implementations | Medium | High | Flexible | ### Image Templates | Template | Best For | Speed | Quality | Customization | | ------------------- | ------------------------ | ------ | ------- | ------------- | | Diffusion | General image generation | Fast | High | Extensive | | Stable Diffusion XL | High-resolution images | Medium | Highest | Good | | ControlNet | Controlled generation | Medium | High | Specialized | ## Creating Custom Templates ### Simple Template Function ```python def build_custom_nlp_chute( username: str, model_name: str, node_selector: NodeSelector, task_type: str = "classification" ) -> Chute: """Custom NLP template for classification and NER""" # Create custom image image = ( Image(username=username, name="custom-nlp", tag="1.0") .from_base("nvidia/cuda:12.1-runtime-ubuntu22.04") .with_python("3.11") .run_command("pip install transformers torch scikit-learn") ) # Create chute chute = Chute( username=username, name=f"nlp-{task_type}", image=image, node_selector=node_selector, tagline=f"Custom {task_type} service" ) # Add model loading @chute.on_startup() async def load_model(self): from transformers import AutoTokenizer, AutoModelForSequenceClassification self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.model = AutoModelForSequenceClassification.from_pretrained(model_name) # Add API endpoint @chute.cord(public_api_path=f"/{task_type}") async def classify(self, text: str) -> dict: inputs = self.tokenizer(text, return_tensors="pt") outputs = self.model(**inputs) predictions = outputs.logits.softmax(dim=-1) return {"predictions": predictions.tolist()} return chute # Use the custom template custom_chute = build_custom_nlp_chute( username="myuser", model_name="distilbert-base-uncased-finetuned-sst-2-english", node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=8), task_type="sentiment" ) ``` ### Advanced Template with Configuration ```python from dataclasses import dataclass from typing import Optional @dataclass class CustomNLPConfig: batch_size: int = 32 max_length: int = 512 use_gpu: bool = True cache_size: int = 1000 def build_advanced_nlp_chute( username: str, model_name: str, node_selector: NodeSelector, config: CustomNLPConfig = None ) -> Chute: """Advanced NLP template with configuration""" if config is None: config = CustomNLPConfig() # Build image with config-specific optimizations image = ( Image(username=username, name="advanced-nlp", tag="1.0") .from_base("nvidia/cuda:12.1-runtime-ubuntu22.04") .with_python("3.11") .run_command("pip install transformers torch accelerate") ) if config.use_gpu: image = image.with_env("CUDA_VISIBLE_DEVICES", "0") chute = Chute( username=username, name="advanced-nlp", image=image, node_selector=node_selector ) @chute.on_startup() async def setup(self): # Initialize with configuration self.config = config self.cache = {} # Simple caching # Load model from transformers import AutoTokenizer, AutoModel self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.model = AutoModel.from_pretrained(model_name) if config.use_gpu: self.model = self.model.cuda() @chute.cord(public_api_path="/process") async def process_text(self, texts: list[str]) -> dict: # Batch processing with configuration results = [] for i in range(0, len(texts), self.config.batch_size): batch = texts[i:i + self.config.batch_size] # Check cache cached_results = [] new_texts = [] for text in batch: if text in self.cache and len(self.cache) < self.config.cache_size: cached_results.append(self.cache[text]) else: new_texts.append(text) # Process new texts if new_texts: inputs = self.tokenizer( new_texts, return_tensors="pt", padding=True, truncation=True, max_length=self.config.max_length ) if self.config.use_gpu: inputs = {k: v.cuda() for k, v in inputs.items()} with torch.no_grad(): outputs = self.model(**inputs) # Cache results for text, output in zip(new_texts, outputs.last_hidden_state): result = output.mean(dim=0).cpu().tolist() self.cache[text] = result cached_results.append(result) results.extend(cached_results) return {"embeddings": results, "count": len(results)} return chute ``` ## Template Best Practices ### 1. **Use Appropriate Templates** ```python # For LLM inference vllm_chute = build_vllm_chute(...) # For embedding generation tei_chute = build_tei_chute(...) # For image generation diffusion_chute = build_diffusion_chute(...) ``` ### 2. **Customize Hardware Requirements** ```python # Small models small_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=8 ) # Large models large_selector = NodeSelector( gpu_count=2, min_vram_gb_per_gpu=40 ) ``` ### 3. **Version Control Your Models** ```python # Always specify revision chute = build_vllm_chute( username="myuser", model_name="microsoft/DialoGPT-medium", revision="main" # or specific commit hash ) ``` ### 4. **Document Your Deployments** ```python chute = build_vllm_chute( username="myuser", model_name="microsoft/DialoGPT-medium", tagline="Customer service chatbot", readme=""" # Customer Service Bot This chute provides automated customer service responses using DialoGPT-medium. ## Usage Send POST requests to `/v1/chat/completions` """ ) ``` ## Next Steps - **[VLLM Template](/docs/templates/vllm)** - Detailed VLLM documentation - **[Diffusion Template](/docs/templates/diffusion)** - Image generation guide - **[TEI Template](/docs/templates/tei)** - Text embeddings guide - **[Custom Templates Guide](/docs/guides/custom-templates)** - Build your own templates --- ## SOURCE: https://chutes.ai/docs/sdk-reference/chute # Chute API Reference The `Chute` class is the core component of the Chutes framework, representing a deployable AI application unit. It extends FastAPI, so you can use all FastAPI features. This reference covers all methods, properties, and configuration options. ## Class Definition ```python from chutes.chute import Chute chute = Chute( username: str, name: str, image: str | Image, tagline: str = "", readme: str = "", standard_template: str = None, revision: str = None, node_selector: NodeSelector = None, concurrency: int = 1, max_instances: int = 1, shutdown_after_seconds: int = 300, scaling_threshold: float = 0.75, allow_external_egress: bool = False, encrypted_fs: bool = False, passthrough_headers: dict = {}, tee: bool = False, **kwargs ) ``` ## Constructor Parameters ### Required Parameters #### `username: str` The username or organization name for the chute deployment. **Example:** ```python chute = Chute(username="mycompany", name="ai-service", image="parachutes/python:3.12") ``` #### `name: str` The name of the chute application. **Example:** ```python chute = Chute(username="mycompany", name="text-generator", image="parachutes/python:3.12") ``` #### `image: str | Image` Docker image for the chute runtime environment (required). **Example:** ```python # Using a string reference to a pre-built image chute = Chute( username="mycompany", name="text-generator", image="parachutes/python:3.12" ) # Using a custom Image object from chutes.image import Image custom_image = Image(username="mycompany", name="custom-ai", tag="1.0") chute = Chute( username="mycompany", name="text-generator", image=custom_image ) ``` ### Optional Parameters #### `tagline: str = ""` A brief description of what the chute does. **Example:** ```python chute = Chute( username="mycompany", name="text-generator", image="parachutes/python:3.12", tagline="Advanced text generation with GPT models" ) ``` #### `readme: str = ""` Detailed documentation for the chute in Markdown format. **Example:** ```python readme = """ # Text Generation API This chute provides advanced text generation capabilities. ## Features - Multiple model support - Customizable parameters - Real-time streaming """ chute = Chute( username="mycompany", name="text-generator", image="parachutes/python:3.12", readme=readme ) ``` #### `standard_template: str = None` Reference to a standard template (e.g., "vllm", "sglang", "diffusion"). #### `revision: str = None` Specific revision or version identifier for the chute. #### `node_selector: NodeSelector = None` Hardware requirements and preferences for the chute. **Example:** ```python from chutes.chute import NodeSelector node_selector = NodeSelector( gpu_count=2, min_vram_gb_per_gpu=24, include=["h100", "a100"], exclude=["t4"] ) chute = Chute( username="mycompany", name="text-generator", image="parachutes/python:3.12", node_selector=node_selector ) ``` #### `concurrency: int = 1` Maximum number of concurrent requests the chute can handle per instance. **Example:** ```python chute = Chute( username="mycompany", name="text-generator", image="parachutes/python:3.12", concurrency=8 # Handle up to 8 concurrent requests ) ``` **Guidelines:** - For vLLM/SGLang with continuous batching: 64-256 - For single-request models (diffusion): 1 - For models with some parallelism: 4-16 #### `max_instances: int = 1` Maximum number of instances that can be scaled up. **Example:** ```python chute = Chute( username="mycompany", name="text-generator", image="parachutes/python:3.12", max_instances=10 # Scale up to 10 instances ) ``` #### `shutdown_after_seconds: int = 300` Time in seconds to wait before shutting down an idle instance. Default is 5 minutes. **Example:** ```python chute = Chute( username="mycompany", name="text-generator", image="parachutes/python:3.12", shutdown_after_seconds=600 # Shutdown after 10 minutes idle ) ``` #### `scaling_threshold: float = 0.75` Utilization threshold at which to trigger scaling (0.0 to 1.0). #### `allow_external_egress: bool = False` Whether to allow external network connections after startup. **Important:** By default, external network access is blocked after initialization. Set to `True` if your chute needs to fetch external resources at runtime (e.g., image URLs for vision models). **Example:** ```python # For vision language models that need to fetch images chute = Chute( username="mycompany", name="vision-model", image="parachutes/python:3.12", allow_external_egress=True ) ``` #### `encrypted_fs: bool = False` Whether to use encrypted filesystem for the chute. #### `passthrough_headers: dict = {}` Headers to pass through to passthrough cord endpoints. #### `tee: bool = False` Whether this chute runs in a Trusted Execution Environment. #### `**kwargs` Additional keyword arguments passed to the underlying FastAPI application. ## Decorators ### Lifecycle Decorators #### `@chute.on_startup(priority: int = 50)` Decorator for functions to run during chute startup. **Signature:** ```python @chute.on_startup(priority: int = 50) async def initialization_function(self) -> None: """Function to run on startup.""" pass ``` **Parameters:** - `priority`: Execution order (lower values execute first, default=50) - 0-20: Early initialization - 30-70: Normal operations - 80-100: Late initialization **Example:** ```python @chute.on_startup(priority=10) # Runs early async def load_model(self): """Load the AI model during startup.""" from transformers import AutoTokenizer, AutoModelForCausalLM self.tokenizer = AutoTokenizer.from_pretrained("gpt2") self.model = AutoModelForCausalLM.from_pretrained("gpt2") print("Model loaded successfully") @chute.on_startup(priority=90) # Runs late async def log_startup(self): print("All initialization complete") ``` **Use Cases:** - Load AI models - Initialize databases - Set up caches - Configure services #### `@chute.on_shutdown(priority: int = 50)` Decorator for functions to run during chute shutdown. **Signature:** ```python @chute.on_shutdown(priority: int = 50) async def cleanup_function(self) -> None: """Function to run on shutdown.""" pass ``` **Example:** ```python @chute.on_shutdown(priority=10) async def cleanup_resources(self): """Clean up resources during shutdown.""" if hasattr(self, 'model'): del self.model print("Resources cleaned up") ``` ### API Endpoint Decorator #### `@chute.cord()` Decorator to create HTTP API endpoints. See [Cord Decorator Reference](/docs/sdk-reference/cord) for detailed documentation. **Basic Example:** ```python @chute.cord(public_api_path="/generate", public_api_method="POST") async def generate_text(self, prompt: str) -> str: """Generate text from a prompt.""" return await self.model.generate(prompt) ``` ### Job Decorator #### `@chute.job()` Decorator to create long-running jobs or server rentals. See [Job Decorator Reference](/docs/sdk-reference/job) for detailed documentation. **Basic Example:** ```python from chutes.chute.job import Port @chute.job(ports=[Port(name="web", port=8080, proto="http")], timeout=3600) async def training_job(self, **job_data): """Long-running training job.""" output_dir = job_data["output_dir"] # Perform training... return {"status": "completed"} ``` ## Properties ### `chute.name` The name of the chute. **Type:** `str` ### `chute.uid` The unique identifier for the chute. **Type:** `str` ### `chute.readme` The readme/documentation for the chute. **Type:** `str` ### `chute.tagline` The tagline for the chute. **Type:** `str` ### `chute.image` The image configuration for the chute. **Type:** `str | Image` ### `chute.node_selector` The hardware requirements for the chute. **Type:** `NodeSelector | None` ### `chute.standard_template` The standard template name if using a template. **Type:** `str | None` ### `chute.cords` List of cord endpoints registered with the chute. **Type:** `list[Cord]` ### `chute.jobs` List of jobs registered with the chute. **Type:** `list[Job]` ## Methods ### `async chute.initialize()` Initialize the chute by running all startup hooks. Called automatically when the chute starts in remote context. ```python await chute.initialize() ``` ## FastAPI Integration Since `Chute` extends `FastAPI`, you can use all FastAPI features directly: ### Adding Middleware ```python from fastapi.middleware.cors import CORSMiddleware @chute.on_startup() async def setup_middleware(self): self.add_middleware( CORSMiddleware, allow_origins=["*"], allow_methods=["*"], allow_headers=["*"] ) ``` ### Adding Custom Routes ```python @chute.on_startup() async def add_custom_routes(self): @self.get("/custom") async def custom_endpoint(): return {"message": "Custom endpoint"} ``` ### Using Dependencies ```python from fastapi import Depends, HTTPException async def verify_token(token: str): if token != "secret": raise HTTPException(401, "Invalid token") return token @chute.cord(public_api_path="/protected") async def protected_endpoint(self, token: str = Depends(verify_token)): return {"message": "Protected content"} ``` ## Complete Example ```python from chutes.chute import Chute, NodeSelector from chutes.image import Image from pydantic import BaseModel, Field # Define custom image image = ( Image(username="myuser", name="my-chute", tag="1.0") .from_base("parachutes/python:3.12") .run_command("pip install transformers torch") ) # Define input/output schemas class GenerationInput(BaseModel): prompt: str = Field(..., description="Input prompt") max_tokens: int = Field(100, ge=1, le=1000) class GenerationOutput(BaseModel): text: str tokens_used: int # Create chute chute = Chute( username="myuser", name="text-generator", tagline="Generate text with transformers", readme="## Text Generator\n\nGenerates text from prompts.", image=image, node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16 ), concurrency=4, max_instances=3, shutdown_after_seconds=300, allow_external_egress=False ) @chute.on_startup() async def load_model(self): """Load model during startup.""" from transformers import pipeline self.generator = pipeline("text-generation", model="gpt2", device=0) print("Model loaded!") @chute.cord( public_api_path="/generate", public_api_method="POST", minimal_input_schema=GenerationInput ) async def generate(self, input_data: GenerationInput) -> GenerationOutput: """Generate text from a prompt.""" result = self.generator( input_data.prompt, max_length=input_data.max_tokens )[0]["generated_text"] return GenerationOutput( text=result, tokens_used=len(result.split()) ) @chute.cord(public_api_path="/health", public_api_method="GET") async def health(self) -> dict: """Health check endpoint.""" return { "status": "healthy", "model_loaded": hasattr(self, "generator") } ``` ## Best Practices ### 1. Use Appropriate Concurrency ```python # For LLMs with continuous batching chute = Chute(..., concurrency=64) # For single-request models chute = Chute(..., concurrency=1) ``` ### 2. Set Reasonable Shutdown Timers ```python # Development - short timeout chute = Chute(..., shutdown_after_seconds=60) # Production - longer timeout to avoid cold starts chute = Chute(..., shutdown_after_seconds=300) ``` ### 3. Use Type Hints and Schemas ```python from pydantic import BaseModel class MyInput(BaseModel): text: str @chute.cord( public_api_path="/process", minimal_input_schema=MyInput ) async def process(self, data: MyInput) -> dict: return {"result": data.text.upper()} ``` ### 4. Handle Errors Gracefully ```python from fastapi import HTTPException @chute.cord(public_api_path="/generate") async def generate(self, prompt: str): if not prompt.strip(): raise HTTPException(400, "Prompt cannot be empty") try: return await self.model.generate(prompt) except Exception as e: raise HTTPException(500, f"Generation failed: {e}") ``` ## See Also - **[Cord Decorator](/docs/sdk-reference/cord)** - Detailed cord documentation - **[Job Decorator](/docs/sdk-reference/job)** - Job and server rental documentation - **[Image Class](/docs/sdk-reference/image)** - Custom image building - **[NodeSelector](/docs/sdk-reference/node-selector)** - Hardware requirements - **[Templates](/docs/sdk-reference/templates)** - Pre-built templates --- ## SOURCE: https://chutes.ai/docs/sdk-reference/cord # Cord Decorator API Reference The `@chute.cord()` decorator is used to create HTTP API endpoints in Chutes applications. Cords are the primary way to expose functionality from your chute. This reference covers all parameters, patterns, and best practices. ## Decorator Signature ```python @chute.cord( path: str = None, passthrough_path: str = None, passthrough: bool = False, passthrough_port: int = None, public_api_path: str = None, public_api_method: str = "POST", stream: bool = False, provision_timeout: int = 180, input_schema: Optional[Any] = None, minimal_input_schema: Optional[Any] = None, output_content_type: Optional[str] = None, output_schema: Optional[Dict] = None, **session_kwargs ) ``` ## Parameters ### `public_api_path: str` The URL path where the endpoint will be accessible via the public API. **Format Rules:** - Must start with `/` - Must match pattern `^(/[a-z0-9_]+[a-z0-9-_]*)+$` - Can include path parameters with `{parameter_name}` syntax - Case-sensitive **Examples:** ```python # Simple path @chute.cord(public_api_path="/generate") # Path with parameter @chute.cord(public_api_path="/users/{user_id}") # Nested resource @chute.cord(public_api_path="/models/{model_id}/generate") ``` ### `public_api_method: str = "POST"` The HTTP method for the public API endpoint. **Supported Methods:** - `GET` - Retrieve data - `POST` - Create or process data (default) - `PUT` - Update existing data - `DELETE` - Remove data - `PATCH` - Partial updates **Examples:** ```python # GET for data retrieval @chute.cord(public_api_path="/models", public_api_method="GET") async def list_models(self): return {"models": ["gpt-3.5", "gpt-4"]} # POST for data processing (default) @chute.cord(public_api_path="/generate", public_api_method="POST") async def generate_text(self, prompt: str): return await self.model.generate(prompt) # DELETE for removal @chute.cord(public_api_path="/cache", public_api_method="DELETE") async def clear_cache(self): self.cache.clear() return {"status": "cache cleared"} ``` ### `path: str = None` Internal path for the endpoint. Defaults to the function name if not specified. ### `stream: bool = False` Enable streaming responses for real-time data transmission. **When to Use Streaming:** - Long-running text generation - Real-time progress updates - Token-by-token LLM output - Large data processing **Streaming Example:** ```python from fastapi.responses import StreamingResponse import json @chute.cord( public_api_path="/stream_generate", public_api_method="POST", stream=True ) async def stream_text_generation(self, prompt: str): async def generate_stream(): async for token in self.model.stream_generate(prompt): data = {"token": token, "finished": False} yield f"data: {json.dumps(data)}\n\n" # Send completion signal yield f"data: {json.dumps({'token': '', 'finished': True})}\n\n" return StreamingResponse( generate_stream(), media_type="text/event-stream" ) ``` ### `input_schema: Optional[Any] = None` Pydantic model for input validation and documentation. **Benefits:** - Automatic input validation - Auto-generated API documentation - Type safety - Error handling **Example:** ```python from pydantic import BaseModel, Field class TextGenerationInput(BaseModel): prompt: str = Field(..., description="Text prompt for generation") max_tokens: int = Field(100, ge=1, le=2000, description="Maximum tokens") temperature: float = Field(0.7, ge=0.0, le=2.0, description="Sampling temperature") @chute.cord( public_api_path="/generate", public_api_method="POST", input_schema=TextGenerationInput ) async def generate_text(self, input_data: TextGenerationInput): return await self.model.generate( input_data.prompt, max_tokens=input_data.max_tokens, temperature=input_data.temperature ) ``` ### `minimal_input_schema: Optional[Any] = None` Simplified schema for basic API documentation and testing. Useful when you have complex input but want simpler examples. **Example:** ```python class FullInput(BaseModel): prompt: str max_tokens: int = 100 temperature: float = 0.7 top_p: float = 0.9 frequency_penalty: float = 0.0 class SimpleInput(BaseModel): prompt: str = Field(..., description="Just the prompt for quick testing") @chute.cord( public_api_path="/generate", input_schema=FullInput, minimal_input_schema=SimpleInput ) async def generate_flexible(self, input_data: FullInput): return await self.model.generate(**input_data.dict()) ``` ### `output_content_type: Optional[str] = None` The MIME type of the response content. Auto-detected for JSON/text, but should be specified for binary responses. **Common Content Types:** - `application/json` - JSON responses (auto-detected) - `text/plain` - Plain text (auto-detected) - `image/png`, `image/jpeg` - Images - `audio/wav`, `audio/mpeg` - Audio files - `text/event-stream` - Server-sent events **Image Response Example:** ```python from fastapi import Response @chute.cord( public_api_path="/generate_image", public_api_method="POST", output_content_type="image/png" ) async def generate_image(self, prompt: str) -> Response: image_data = await self.image_model.generate(prompt) return Response( content=image_data, media_type="image/png", headers={"Content-Disposition": "inline; filename=generated.png"} ) ``` **Audio Response Example:** ```python @chute.cord( public_api_path="/text_to_speech", public_api_method="POST", output_content_type="audio/wav" ) async def text_to_speech(self, text: str) -> Response: audio_data = await self.tts_model.synthesize(text) return Response( content=audio_data, media_type="audio/wav" ) ``` ### `output_schema: Optional[Dict] = None` Schema for output validation and documentation. Auto-extracted from return type hints. ### `passthrough: bool = False` Enable passthrough mode to forward requests to an underlying service. **Use Case:** When you're running a service like vLLM or SGLang that has its own HTTP server, you can use passthrough to forward requests. **Example:** ```python @chute.cord( public_api_path="/v1/completions", public_api_method="POST", passthrough=True, passthrough_path="/v1/completions", passthrough_port=8000 ) async def completions(self, **kwargs): # Request is forwarded to localhost:8000/v1/completions pass ``` ### `passthrough_path: str = None` The path to forward requests to when using passthrough mode. ### `passthrough_port: int = None` The port to forward requests to when using passthrough mode. Defaults to 8000. ### `provision_timeout: int = 180` Timeout in seconds for waiting for the chute to provision. Default is 3 minutes. ## Function Patterns ### Simple Functions ```python # Basic function with primitive parameters @chute.cord(public_api_path="/simple") async def simple_endpoint(self, text: str, number: int = 10): return {"text": text, "number": number} # Function with optional parameters @chute.cord(public_api_path="/optional") async def optional_params( self, required_param: str, optional_param: str = None, default_param: int = 100 ): return { "required": required_param, "optional": optional_param, "default": default_param } ``` ### Schema-Based Functions ```python from pydantic import BaseModel class MyInput(BaseModel): text: str count: int = 1 class MyOutput(BaseModel): results: list[str] @chute.cord( public_api_path="/process", input_schema=MyInput, output_schema=MyOutput ) async def process_with_schemas(self, data: MyInput) -> MyOutput: results = [data.text] * data.count return MyOutput(results=results) ``` ### File Responses ```python from fastapi.responses import FileResponse @chute.cord( public_api_path="/download", public_api_method="GET", output_content_type="application/pdf" ) async def download_file(self) -> FileResponse: return FileResponse( "report.pdf", media_type="application/pdf", filename="report.pdf" ) ``` ## Error Handling ```python from fastapi import HTTPException @chute.cord(public_api_path="/generate") async def generate_with_errors(self, prompt: str): # Validate input if not prompt.strip(): raise HTTPException( status_code=400, detail="Prompt cannot be empty" ) if len(prompt) > 10000: raise HTTPException( status_code=400, detail="Prompt too long (max 10,000 characters)" ) try: result = await self.model.generate(prompt) return {"generated_text": result} except Exception as e: raise HTTPException( status_code=500, detail=f"Generation failed: {str(e)}" ) ``` ## Complete Example ```python from chutes.chute import Chute, NodeSelector from chutes.image import Image from pydantic import BaseModel, Field from fastapi import HTTPException from fastapi.responses import StreamingResponse import json image = ( Image(username="myuser", name="text-gen", tag="1.0") .from_base("parachutes/python:3.12") .run_command("pip install transformers torch") ) chute = Chute( username="myuser", name="text-generator", image=image, node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=16), concurrency=4 ) class GenerationInput(BaseModel): prompt: str = Field(..., min_length=1, max_length=10000) max_tokens: int = Field(100, ge=1, le=2000) temperature: float = Field(0.7, ge=0.0, le=2.0) class SimpleInput(BaseModel): prompt: str @chute.on_startup() async def load_model(self): from transformers import pipeline self.generator = pipeline("text-generation", model="gpt2", device=0) @chute.cord( public_api_path="/generate", public_api_method="POST", input_schema=GenerationInput, minimal_input_schema=SimpleInput ) async def generate(self, params: GenerationInput) -> dict: """Generate text from a prompt.""" result = self.generator( params.prompt, max_length=params.max_tokens, temperature=params.temperature )[0]["generated_text"] return { "generated_text": result, "tokens_used": len(result.split()) } @chute.cord( public_api_path="/stream", public_api_method="POST", stream=True ) async def stream_generate(self, prompt: str): """Stream text generation token by token.""" async def generate(): # Simulated streaming words = prompt.split() for word in words: yield f"data: {json.dumps({'token': word + ' '})}\n\n" yield f"data: {json.dumps({'finished': True})}\n\n" return StreamingResponse(generate(), media_type="text/event-stream") @chute.cord(public_api_path="/health", public_api_method="GET") async def health(self) -> dict: """Health check endpoint.""" return { "status": "healthy", "model_loaded": hasattr(self, "generator") } ``` ## Best Practices ### 1. Use Descriptive Paths ```python # Good @chute.cord(public_api_path="/generate_text") @chute.cord(public_api_path="/analyze_sentiment") # Avoid @chute.cord(public_api_path="/api") @chute.cord(public_api_path="/do") ``` ### 2. Choose Appropriate Methods ```python # GET for read-only operations @chute.cord(public_api_path="/models", public_api_method="GET") # POST for AI generation/processing @chute.cord(public_api_path="/generate", public_api_method="POST") ``` ### 3. Use Input Schemas for Validation ```python from pydantic import BaseModel, Field class ValidatedInput(BaseModel): prompt: str = Field(..., min_length=1, max_length=10000) temperature: float = Field(0.7, ge=0.0, le=2.0) @chute.cord(public_api_path="/generate", input_schema=ValidatedInput) async def generate(self, params: ValidatedInput): # Input is automatically validated pass ``` ### 4. Handle Errors Gracefully ```python @chute.cord(public_api_path="/generate") async def generate(self, prompt: str): if not prompt.strip(): raise HTTPException(400, "Prompt cannot be empty") try: return await self.model.generate(prompt) except Exception as e: raise HTTPException(500, f"Generation failed: {e}") ``` ### 5. Use Streaming for Long Operations ```python @chute.cord(public_api_path="/generate", stream=True) async def stream_generate(self, prompt: str): async def stream(): async for token in self.model.stream(prompt): yield f"data: {json.dumps({'token': token})}\n\n" return StreamingResponse(stream(), media_type="text/event-stream") ``` ## See Also - **[Chute Class](/docs/sdk-reference/chute)** - Main chute documentation - **[Job Decorator](/docs/sdk-reference/job)** - Background job documentation - **[Streaming Guide](/docs/guides/streaming)** - Detailed streaming patterns --- ## SOURCE: https://chutes.ai/docs/sdk-reference/image # Image API Reference The `Image` class is used to build custom Docker images for Chutes applications. This reference covers all methods, configuration options, and best practices for creating optimized container images. ## Class Definition ```python from chutes.image import Image image = Image( username: str, name: str, tag: str, readme: str = "" ) ``` ## Constructor Parameters ### Required Parameters #### `username: str` The username or organization name for the image. **Example:** ```python image = Image(username="mycompany", name="custom-ai", tag="1.0") ``` **Rules:** - Must match pattern `^[a-z0-9][a-z0-9-_\.]*$` - Should match your Chutes username #### `name: str` The name of the Docker image. **Example:** ```python image = Image(username="mycompany", name="text-processor", tag="1.0") ``` **Rules:** - Must match pattern `^[a-z0-9][a-z0-9-_\.]*$` - Should be descriptive of the image purpose #### `tag: str` Version tag for the image. **Examples:** ```python # Version tag image = Image(username="mycompany", name="ai-model", tag="1.0.0") # Development tag image = Image(username="mycompany", name="ai-model", tag="dev") ``` **Best Practices:** - Use semantic versioning (1.0.0, 1.1.0, etc.) - Use descriptive tags for different environments - Avoid using "latest" in production ### Optional Parameters #### `readme: str = ""` Documentation for the image in Markdown format. **Example:** ```python readme = """ # Custom AI Processing Image This image contains optimized libraries for AI text processing. ## Features - PyTorch 2.0 with CUDA support - Transformers library - Optimized for GPU inference """ image = Image( username="mycompany", name="ai-processor", tag="1.0.0", readme=readme ) ``` ## Default Base Image By default, images use `parachutes/python:3.12` as the base image, which includes: - CUDA 12.x support - Python 3.12 - OpenCL libraries - Common system dependencies **We highly recommend using this base image** to avoid dependency issues. ## Methods ### `.from_base(base_image: str)` Replace the base image. **Signature:** ```python def from_base(self, base_image: str) -> Image ``` **Examples:** ```python # Use recommended Chutes base image (default) image = Image("myuser", "myapp", "1.0").from_base("parachutes/python:3.12") # Use NVIDIA CUDA base images image = Image("myuser", "myapp", "1.0").from_base("nvidia/cuda:12.2-runtime-ubuntu22.04") # Use Python base images image = Image("myuser", "myapp", "1.0").from_base("python:3.11-slim") ``` **Choosing Base Images:** - **parachutes/python:3.12**: Recommended for most use cases - **nvidia/cuda:\***: For GPU-accelerated applications needing specific CUDA versions - **python:3.11-slim**: Lightweight, CPU-only workloads ### `.run_command(command: str)` Execute shell commands during image build. **Signature:** ```python def run_command(self, command: str) -> Image ``` **Examples:** ```python # Install Python packages image = ( Image("myuser", "myapp", "1.0") .from_base("parachutes/python:3.12") .run_command("pip install torch transformers accelerate") ) # Multiple commands in one call image = ( Image("myuser", "myapp", "1.0") .from_base("parachutes/python:3.12") .run_command(""" pip install --upgrade pip && pip install torch transformers && pip install accelerate datasets """) ) # Install from requirements file image = ( Image("myuser", "myapp", "1.0") .from_base("parachutes/python:3.12") .add("requirements.txt", "/tmp/requirements.txt") .run_command("pip install -r /tmp/requirements.txt") ) ``` ### `.add(source: str, dest: str)` Add files from the build context to the image. **Signature:** ```python def add(self, source: str, dest: str) -> Image ``` **Examples:** ```python # Add single file image = ( Image("myuser", "myapp", "1.0") .from_base("parachutes/python:3.12") .add("requirements.txt", "/app/requirements.txt") ) # Add directory image = ( Image("myuser", "myapp", "1.0") .from_base("parachutes/python:3.12") .add("src/", "/app/src/") ) # Add multiple files image = ( Image("myuser", "myapp", "1.0") .from_base("parachutes/python:3.12") .add("requirements.txt", "/app/requirements.txt") .add("config.yaml", "/app/config.yaml") .add("src/", "/app/src/") ) ``` **Best Practices:** ```python # Add requirements first for better caching image = ( Image("myuser", "myapp", "1.0") .from_base("parachutes/python:3.12") .add("requirements.txt", "/tmp/requirements.txt") # Add early .run_command("pip install -r /tmp/requirements.txt") # Install deps .add("src/", "/app/src/") # Add code last (changes frequently) ) ``` ### `.with_env(key: str, value: str)` Set environment variables in the image. **Signature:** ```python def with_env(self, key: str, value: str) -> Image ``` **Examples:** ```python # Basic environment variables image = ( Image("myuser", "myapp", "1.0") .from_base("parachutes/python:3.12") .with_env("PYTHONPATH", "/app") .with_env("PYTHONUNBUFFERED", "1") ) # Model cache configuration image = ( Image("myuser", "ai-app", "1.0") .from_base("parachutes/python:3.12") .with_env("TRANSFORMERS_CACHE", "/opt/models") .with_env("HF_HOME", "/opt/huggingface") .with_env("TORCH_HOME", "/opt/torch") ) # Application configuration image = ( Image("myuser", "myapp", "1.0") .from_base("parachutes/python:3.12") .with_env("APP_ENV", "production") .with_env("LOG_LEVEL", "INFO") ) ``` **Common Environment Variables:** ```python # Python optimization image = image.with_env("PYTHONOPTIMIZE", "2") image = image.with_env("PYTHONDONTWRITEBYTECODE", "1") image = image.with_env("PYTHONUNBUFFERED", "1") # PyTorch optimizations image = image.with_env("TORCH_BACKENDS_CUDNN_BENCHMARK", "1") ``` ### `.set_workdir(directory: str)` Set the working directory for the container. **Signature:** ```python def set_workdir(self, directory: str) -> Image ``` **Examples:** ```python # Set working directory image = ( Image("myuser", "myapp", "1.0") .from_base("parachutes/python:3.12") .set_workdir("/app") .add("src/", "/app/src/") ) # Multiple working directories for different stages image = ( Image("myuser", "myapp", "1.0") .from_base("parachutes/python:3.12") .set_workdir("/tmp") .add("requirements.txt", "requirements.txt") .run_command("pip install -r requirements.txt") .set_workdir("/app") .add("src/", "src/") ) ``` ### `.set_user(user: str)` Set the user for running commands and the container. **Signature:** ```python def set_user(self, user: str) -> Image ``` **Examples:** ```python # Create and use non-root user image = ( Image("myuser", "myapp", "1.0") .from_base("parachutes/python:3.12") .run_command("useradd -m -u 1000 appuser") .run_command("mkdir -p /app && chown appuser:appuser /app") .set_user("appuser") .set_workdir("/app") ) # Use existing user image = ( Image("myuser", "myapp", "1.0") .from_base("ubuntu:22.04") .set_user("nobody") ) ``` ### `.apt_install(package: str | List[str])` Install system packages using apt. **Signature:** ```python def apt_install(self, package: str | List[str]) -> Image ``` **Examples:** ```python # Install single package image = image.apt_install("git") # Install multiple packages image = image.apt_install(["git", "curl", "wget", "ffmpeg"]) ``` ### `.apt_remove(package: str | List[str])` Remove system packages using apt. **Signature:** ```python def apt_remove(self, package: str | List[str]) -> Image ``` **Example:** ```python # Remove packages after use image = ( image .apt_install(["build-essential", "cmake"]) .run_command("pip install some-package-that-needs-compilation") .apt_remove(["build-essential", "cmake"]) ) ``` ### `.with_python(version: str = "3.10.15")` Install a specific version of Python from source. **Signature:** ```python def with_python(self, version: str = "3.10.15") -> Image ``` **Example:** ```python # Install specific Python version image = ( Image("myuser", "myapp", "1.0") .from_base("ubuntu:22.04") .with_python("3.11.5") ) ``` **Note:** This builds Python from source, which can be slow. Consider using `parachutes/python:3.12` as your base image instead. ### `.with_maintainer(maintainer: str)` Set the maintainer for the image. **Signature:** ```python def with_maintainer(self, maintainer: str) -> Image ``` **Example:** ```python image = image.with_maintainer("team@mycompany.com") ``` ### `.with_entrypoint(*args)` Set the container entrypoint. **Signature:** ```python def with_entrypoint(self, *args) -> Image ``` **Examples:** ```python # Python module entrypoint image = image.with_entrypoint("python", "-m", "myapp") # Shell script entrypoint image = ( image .add("entrypoint.sh", "/entrypoint.sh") .run_command("chmod +x /entrypoint.sh") .with_entrypoint("/entrypoint.sh") ) ``` ## Complete Examples ### Basic ML Image ```python from chutes.image import Image image = ( Image(username="myuser", name="ml-app", tag="1.0") .from_base("parachutes/python:3.12") .run_command("pip install torch transformers accelerate") .add("requirements.txt", "/app/requirements.txt") .run_command("pip install -r /app/requirements.txt") .add("src/", "/app/src/") .set_workdir("/app") .with_env("PYTHONPATH", "/app") ) ``` ### Optimized PyTorch Image ```python image = ( Image(username="myuser", name="pytorch-app", tag="1.0", readme="## PyTorch Application\nOptimized for GPU inference.") .from_base("parachutes/python:3.12") # System dependencies .apt_install(["git", "curl", "ffmpeg"]) # Python packages .run_command(""" pip install --upgrade pip && pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 && pip install transformers accelerate datasets tokenizers """) # Environment optimization .with_env("PYTHONUNBUFFERED", "1") .with_env("TRANSFORMERS_CACHE", "/opt/models") .with_env("TORCH_BACKENDS_CUDNN_BENCHMARK", "1") # Application code .add("requirements.txt", "/app/requirements.txt") .run_command("pip install -r /app/requirements.txt") .add("src/", "/app/src/") .set_workdir("/app") ) ``` ### Image with System Dependencies ```python image = ( Image(username="myuser", name="audio-processor", tag="1.0") .from_base("parachutes/python:3.12") # Audio processing dependencies .apt_install([ "ffmpeg", "libsndfile1", "libportaudio2", "libsox-fmt-all" ]) # Python audio libraries .run_command(""" pip install soundfile librosa pydub torchaudio """) .add("src/", "/app/src/") .set_workdir("/app") ) ``` ## Layer Caching Best Practices For faster builds, order your directives from least to most frequently changing: ```python # Good: Optimal layer ordering image = ( Image("myuser", "myapp", "1.0") .from_base("parachutes/python:3.12") # 1. System packages (rarely change) .apt_install(["git", "curl"]) # 2. Python dependencies from requirements (change occasionally) .add("requirements.txt", "/tmp/requirements.txt") .run_command("pip install -r /tmp/requirements.txt") # 3. Application code (changes frequently) .add("src/", "/app/src/") .set_workdir("/app") ) # Bad: Frequent changes early invalidate cache image = ( Image("myuser", "myapp", "1.0") .from_base("parachutes/python:3.12") .add("src/", "/app/src/") # Changes often - invalidates all later layers! .apt_install(["git", "curl"]) .run_command("pip install torch") ) ``` ## Combining Commands Combine related commands into single layers to reduce image size: ```python # Good: Single layer with cleanup image = image.run_command(""" apt-get update && apt-get install -y git curl && rm -rf /var/lib/apt/lists/* """) # Less optimal: Multiple layers image = ( image .run_command("apt-get update") .run_command("apt-get install -y git curl") .run_command("rm -rf /var/lib/apt/lists/*") # Cleanup in separate layer doesn't reduce size ) ``` ## Properties ### `image.uid` The unique identifier for the image. **Type:** `str` ### `image.name` The name of the image. **Type:** `str` ### `image.tag` The tag/version of the image. **Type:** `str` ### `image.readme` The documentation for the image. **Type:** `str` ### `image.username` The username/organization for the image. **Type:** `str` ## See Also - **[Chute Class](/docs/sdk-reference/chute)** - Using images with chutes - **[Building Images](/docs/cli/build)** - CLI build commands - **[Templates](/docs/sdk-reference/templates)** - Pre-built image templates --- ## SOURCE: https://chutes.ai/docs/sdk-reference/job # Job Decorator API Reference The `@chute.job()` decorator is used to create long-running jobs or server rentals in Chutes applications. Jobs are different from API endpoints (cords) and are designed for tasks that need persistent compute resources, specific network ports, or long-running processes. ## Decorator Signature ```python @chute.job( ports: list[Port] = [], timeout: Optional[int] = None, upload: bool = True, ssh: bool = False ) ``` ## Port Class Jobs can expose network ports for external access using the `Port` class: ```python from chutes.chute.job import Port port = Port( name: str, # Port identifier (lowercase letters + optional numbers) port: int, # Port number (2202 or 8002-65535) proto: str # Protocol: "tcp", "udp", or "http" ) ``` ### Port Rules - Port must be 2202 (reserved for SSH) or in range 8002-65535 - Each port must have a unique number within the job - Name must match pattern `^[a-z]+[0-9]*$` (e.g., "web", "api", "metrics1") ## Parameters ### `ports: list[Port] = []` List of network ports to expose for the job. **Examples:** ```python from chutes.chute.job import Port # Single HTTP port @chute.job(ports=[Port(name="web", port=8080, proto="http")]) async def web_server_job(self, **job_data): pass # Multiple ports @chute.job(ports=[ Port(name="api", port=8000, proto="http"), Port(name="metrics", port=9090, proto="http"), Port(name="grpc", port=8001, proto="tcp") ]) async def multi_port_job(self, **job_data): pass # No ports (compute-only job) @chute.job() async def compute_job(self, **job_data): pass ``` ### `timeout: Optional[int] = None` Maximum execution time for the job in seconds. **Constraints:** - If specified, must be between 30 seconds and 86400 seconds (24 hours) - `None` means no timeout (job can run indefinitely - useful for server rentals) **Examples:** ```python # Job with 1 hour timeout @chute.job(timeout=3600) async def training_job(self, **job_data): """Model training with 1 hour limit.""" await self.train_model() # Long-running server with no timeout @chute.job(timeout=None) async def server_job(self, **job_data): """Persistent server process.""" await self.start_server() # Short batch job (5 minutes) @chute.job(timeout=300) async def quick_batch_job(self, **job_data): """Quick data processing job.""" await self.process_batch() ``` ### `upload: bool = True` Whether to automatically upload output files generated by the job. **Purpose:** - Automatically collects and uploads files created in the job's output directory - Useful for jobs that generate artifacts, model weights, logs, or result files **Examples:** ```python # Job with file upload (default) @chute.job(upload=True) async def generate_report_job(self, **job_data): """Generate report and upload results.""" output_dir = job_data["output_dir"] # Files written to output_dir will be automatically uploaded with open(f"{output_dir}/report.pdf", "wb") as f: f.write(await self.generate_pdf_report()) with open(f"{output_dir}/results.json", "w") as f: json.dump(self.results, f) # Job without file upload @chute.job(upload=False) async def streaming_job(self, **job_data): """Streaming job that doesn't generate files.""" while not self.cancel_event.is_set(): await self.process_stream() await asyncio.sleep(1) ``` ### `ssh: bool = False` Whether to enable SSH access to the job container. **Purpose:** - Debug running jobs - Interactive development - Manual intervention when needed **Examples:** ```python # Job with SSH access for debugging @chute.job(ssh=True, timeout=7200) async def debug_job(self, **job_data): """Job with SSH access for debugging.""" # SSH key should be provided in job_data["_ssh_public_key"] await self.complex_operation() # Regular job without SSH @chute.job(ssh=False) # Default async def regular_job(self, **job_data): """Regular job without SSH access.""" await self.standard_operation() ``` **Note:** When `ssh=True`, port 2202 is automatically added to the job's ports for SSH access. ## Job Function Signature Job functions receive keyword arguments containing job data: ```python @chute.job() async def my_job(self, **job_data): # job_data contains: # - "output_dir": Directory path for output files # - "_ssh_public_key": SSH public key (if ssh=True and provided) # - Any other data passed when starting the job output_dir = job_data["output_dir"] # Your job logic here return {"status": "completed"} ``` ## Job Lifecycle ### Cancellation Support Jobs have access to a cancel event that can be used to gracefully handle cancellation: ```python @chute.job(timeout=3600) async def cancellable_job(self, **job_data): """Job that handles cancellation gracefully.""" for i in range(100): # Check for cancellation if self.cancel_event.is_set(): print("Job cancelled, cleaning up...") break await self.process_step(i) await asyncio.sleep(1) return {"processed_steps": i} ``` ### Output Directory Jobs receive an `output_dir` in `job_data` where they can write files: ```python @chute.job(upload=True) async def job_with_outputs(self, **job_data): output_dir = job_data["output_dir"] # Write output files model_path = f"{output_dir}/model.pt" torch.save(self.model.state_dict(), model_path) # Write logs with open(f"{output_dir}/training_log.txt", "w") as f: f.write("\n".join(self.logs)) # Files in output_dir are automatically uploaded when job completes return {"model_path": model_path} ``` ## Complete Examples ### Model Training Job ```python from chutes.chute import Chute, NodeSelector from chutes.chute.job import Port from chutes.image import Image image = ( Image(username="myuser", name="training", tag="1.0") .from_base("parachutes/python:3.12") .run_command("pip install torch transformers") ) chute = Chute( username="myuser", name="model-trainer", image=image, node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=24) ) @chute.on_startup() async def setup(self): import torch self.device = "cuda" if torch.cuda.is_available() else "cpu" @chute.job( timeout=7200, # 2 hours upload=True, ssh=True # Enable SSH for debugging ) async def train_model(self, **job_data): """Train a model and save checkpoints.""" import torch from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer output_dir = job_data["output_dir"] model_name = job_data.get("model_name", "gpt2") epochs = job_data.get("epochs", 3) print(f"Loading model: {model_name}") tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) model.to(self.device) # Training loop for epoch in range(epochs): if self.cancel_event.is_set(): print("Training cancelled") break print(f"Epoch {epoch + 1}/{epochs}") # ... training logic ... # Save checkpoint checkpoint_path = f"{output_dir}/checkpoint_epoch_{epoch}.pt" torch.save(model.state_dict(), checkpoint_path) # Save final model final_path = f"{output_dir}/final_model.pt" torch.save(model.state_dict(), final_path) return { "status": "completed", "epochs_completed": epoch + 1, "model_path": final_path } ``` ### Web Server Job ```python from chutes.chute.job import Port @chute.job( ports=[ Port(name="web", port=8080, proto="http"), Port(name="metrics", port=9090, proto="http") ], timeout=None, # Run indefinitely upload=False ) async def web_server_job(self, **job_data): """Run a web server as a long-running job.""" from fastapi import FastAPI import uvicorn app = FastAPI() @app.get("/") async def root(): return {"message": "Hello from job!"} @app.get("/health") async def health(): return {"status": "healthy"} config = uvicorn.Config(app, host="0.0.0.0", port=8080) server = uvicorn.Server(config) # Run until cancelled await server.serve() ``` ### Batch Processing Job ```python @chute.job(timeout=1800, upload=True) async def batch_processing_job(self, **job_data): """Process a batch of items.""" output_dir = job_data["output_dir"] items = job_data.get("items", []) results = [] processed = 0 failed = 0 for item in items: if self.cancel_event.is_set(): print(f"Cancelled after processing {processed} items") break try: result = await self.process_item(item) results.append(result) processed += 1 except Exception as e: print(f"Failed to process item: {e}") failed += 1 # Save results with open(f"{output_dir}/results.json", "w") as f: json.dump(results, f) return { "status": "completed", "processed": processed, "failed": failed, "total": len(items) } ``` ## Error Handling Jobs should handle errors gracefully and return appropriate status: ```python @chute.job(timeout=3600) async def robust_job(self, **job_data): """Job with comprehensive error handling.""" output_dir = job_data["output_dir"] try: # Perform main work result = await self.do_work() # Save output with open(f"{output_dir}/output.json", "w") as f: json.dump(result, f) return { "status": "completed", "result": result } except asyncio.CancelledError: # Handle cancellation print("Job was cancelled") raise except ValueError as e: # Handle known errors return { "status": "failed", "error": "invalid_input", "message": str(e) } except Exception as e: # Handle unexpected errors print(f"Unexpected error: {e}") # Save error log with open(f"{output_dir}/error.log", "w") as f: f.write(f"Error: {e}\n") import traceback f.write(traceback.format_exc()) return { "status": "error", "error": str(e) } ``` ## Best Practices ### 1. Always Check for Cancellation ```python @chute.job(timeout=3600) async def long_job(self, **job_data): for i in range(1000): if self.cancel_event.is_set(): return {"status": "cancelled", "progress": i} await self.process_step(i) ``` ### 2. Use Appropriate Timeouts ```python # Short job - use explicit timeout @chute.job(timeout=300) async def quick_job(self, **job_data): pass # Long training - longer timeout @chute.job(timeout=86400) # 24 hours async def training_job(self, **job_data): pass # Server rental - no timeout @chute.job(timeout=None) async def server_job(self, **job_data): pass ``` ### 3. Write Important Data to Output Directory ```python @chute.job(upload=True) async def job_with_checkpoints(self, **job_data): output_dir = job_data["output_dir"] for epoch in range(100): # Train... # Save checkpoint periodically if epoch % 10 == 0: torch.save(model, f"{output_dir}/checkpoint_{epoch}.pt") ``` ### 4. Use SSH for Debugging Complex Jobs ```python @chute.job(ssh=True, timeout=7200) async def debuggable_job(self, **job_data): """Enable SSH so you can connect and debug if needed.""" pass ``` ### 5. Return Meaningful Status ```python @chute.job() async def well_documented_job(self, **job_data): return { "status": "completed", "items_processed": 150, "errors": 2, "duration_seconds": 342, "output_files": ["results.json", "model.pt"] } ``` ## See Also - **[Chute Class](/docs/sdk-reference/chute)** - Main chute documentation - **[Cord Decorator](/docs/sdk-reference/cord)** - API endpoint documentation - **[NodeSelector](/docs/sdk-reference/node-selector)** - Hardware requirements --- ## SOURCE: https://chutes.ai/docs/sdk-reference/node-selector # NodeSelector API Reference The `NodeSelector` class specifies hardware requirements for Chutes deployments. This reference covers all configuration options, GPU types, and best practices for optimal resource allocation. ## Class Definition ```python from chutes.chute import NodeSelector node_selector = NodeSelector( gpu_count: int = 1, min_vram_gb_per_gpu: int = 16, include: Optional[List[str]] = None, exclude: Optional[List[str]] = None ) ``` ## Parameters ### `gpu_count: int = 1` Number of GPUs required for the deployment. **Constraints:** 1-8 GPUs **Examples:** ```python # Single GPU (default) node_selector = NodeSelector(gpu_count=1) # Multiple GPUs for large models node_selector = NodeSelector(gpu_count=4) # Maximum supported GPUs node_selector = NodeSelector(gpu_count=8) ``` **Use Cases:** | GPU Count | Use Case | |-----------|----------| | 1 | Standard AI models (BERT, GPT-2, 7B LLMs) | | 2-4 | Larger language models (13B-30B parameters) | | 4-8 | Very large models (70B+ parameters) | ### `min_vram_gb_per_gpu: int = 16` Minimum VRAM (Video RAM) required per GPU in gigabytes. **Constraints:** 16-140 GB **Examples:** ```python # Default minimum (suitable for most models) node_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16 ) # Medium models requiring more VRAM node_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24 ) # Large models node_selector = NodeSelector( gpu_count=2, min_vram_gb_per_gpu=48 ) # Ultra-large models (H100 80GB required) node_selector = NodeSelector( gpu_count=4, min_vram_gb_per_gpu=80 ) ``` **VRAM Requirements by Model Size:** | Model Size | Min VRAM | Example Models | |------------|----------|----------------| | 1-3B params | 16GB | DistilBERT, GPT-2 | | 7B params | 24GB | Llama-2-7B, Mistral-7B | | 13B params | 32-40GB | Llama-2-13B | | 30B params | 48GB | CodeLlama-34B | | 70B+ params | 80GB+ | Llama-2-70B, DeepSeek-R1 | ### `include: Optional[List[str]] = None` List of GPU types to include in selection. Only these GPU types will be considered. **Examples:** ```python # Only high-end GPUs node_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24, include=["a100", "h100"] ) # Cost-effective options node_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=48, include=["l40", "a6000"] ) # H100 only for maximum performance node_selector = NodeSelector( gpu_count=2, min_vram_gb_per_gpu=80, include=["h100"] ) ``` ### `exclude: Optional[List[str]] = None` List of GPU types to exclude from selection. **Examples:** ```python # Avoid older GPUs node_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24, exclude=["t4"] ) # Cost optimization - exclude expensive GPUs node_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24, exclude=["h100", "a100-80gb"] ) ``` ## Available GPU Types ### High-Performance GPUs | GPU | VRAM | Notes | |-----|------|-------| | `h100` | 80GB | Latest Hopper architecture, best performance | | `h200` | 141GB | Hopper with HBM3e, maximum memory | | `a100-80gb` | 80GB | Ampere, excellent for training/inference | | `a100` | 40GB | Ampere, high performance tier | ### Professional GPUs | GPU | VRAM | Notes | |-----|------|-------| | `l40` | 48GB | Ada Lovelace, good balance of cost/performance | | `a6000` | 48GB | Professional-grade, good for development | | `a5000` | 24GB | Professional-grade, medium workloads | | `a4000` | 16GB | Entry professional GPU | ### Consumer/Entry GPUs | GPU | VRAM | Notes | |-----|------|-------| | `rtx4090` | 24GB | Consumer, cost-effective | | `rtx3090` | 24GB | Previous gen consumer | | `a10` | 24GB | Good for smaller models | | `t4` | 16GB | Entry-level, inference-focused | ### AMD GPUs | GPU | VRAM | Notes | |-----|------|-------| | `mi300x` | 192GB | AMD Instinct, very high memory | ## Common Selection Patterns ### Cost-Optimized ```python # Small models - minimize cost budget_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16, include=["t4", "a4000", "a10"] ) # Medium models - balance cost/performance balanced_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24, include=["l40", "a5000", "rtx4090"], exclude=["h100", "a100-80gb"] ) ``` ### Performance-Optimized ```python # Maximum performance performance_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=80, include=["h100", "a100-80gb"] ) # High throughput serving throughput_selector = NodeSelector( gpu_count=4, min_vram_gb_per_gpu=48, include=["l40", "a100"] ) ``` ### Model-Specific ```python # 7B parameter models (e.g., Mistral-7B, Llama-2-7B) llm_7b_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24, include=["l40", "a5000", "rtx4090"] ) # 13B parameter models llm_13b_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=40, include=["l40", "a100", "a6000"] ) # 70B parameter models llm_70b_selector = NodeSelector( gpu_count=2, min_vram_gb_per_gpu=80, include=["h100", "a100-80gb"] ) # DeepSeek-R1 (671B parameters) deepseek_selector = NodeSelector( gpu_count=8, min_vram_gb_per_gpu=141, include=["h200"] ) ``` ### Image Generation ```python # Stable Diffusion / SDXL diffusion_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24, include=["l40", "a5000", "rtx4090"] ) # FLUX models flux_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=48, include=["l40", "a6000", "a100"] ) ``` ## Integration Examples ### With Chute Definition ```python from chutes.chute import Chute, NodeSelector from chutes.image import Image node_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24, include=["l40", "a100"] ) chute = Chute( username="myuser", name="my-model-server", image=Image(username="myuser", name="my-image", tag="1.0"), node_selector=node_selector ) ``` ### With Templates ```python from chutes.chute.template import build_vllm_chute from chutes.chute import NodeSelector chute = build_vllm_chute( username="myuser", model_name="meta-llama/Llama-2-7b-chat-hf", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24, include=["l40", "a5000"] ) ) ``` ### Dynamic Selection Based on Model ```python def get_node_selector(model_size: str) -> NodeSelector: """Get appropriate NodeSelector based on model size.""" configs = { "small": { # < 3B parameters "gpu_count": 1, "min_vram_gb_per_gpu": 16 }, "medium": { # 7-13B parameters "gpu_count": 1, "min_vram_gb_per_gpu": 32, "exclude": ["t4"] }, "large": { # 30-70B parameters "gpu_count": 2, "min_vram_gb_per_gpu": 48, "include": ["a100", "l40", "h100"] }, "xlarge": { # 70B+ parameters "gpu_count": 4, "min_vram_gb_per_gpu": 80, "include": ["h100", "a100-80gb"] } } return NodeSelector(**configs.get(model_size, configs["medium"])) ``` ## Common Issues and Solutions ### "No available nodes match your requirements" **Solution 1:** Broaden your requirements ```python # Too restrictive strict_selector = NodeSelector( gpu_count=8, min_vram_gb_per_gpu=80, include=["h100"] ) # More flexible flexible_selector = NodeSelector( gpu_count=4, min_vram_gb_per_gpu=48, include=["h100", "a100", "l40"] ) ``` **Solution 2:** Reduce GPU count ```python # Try multiple smaller GPUs multi_gpu = NodeSelector( gpu_count=2, min_vram_gb_per_gpu=40 ) ``` ### "Out of memory" errors Increase VRAM requirements: ```python # Increase min_vram_gb_per_gpu higher_vram = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=48 # Increased from 24 ) ``` ## Best Practices ### 1. Right-Size Your Requirements Don't over-provision - it wastes resources and costs more: ```python # Bad - wastes resources for a 7B model oversized = NodeSelector( gpu_count=8, min_vram_gb_per_gpu=80 ) # Good - matches actual needs rightsized = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24 ) ``` ### 2. Use Include/Exclude Wisely ```python # Be specific when you have known requirements specific_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=48, include=["l40", "a6000"] # Known compatible GPUs ) # Exclude known incompatible GPUs compatible_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24, exclude=["t4"] # Known to be too slow for your use case ) ``` ### 3. Development vs Production ```python # Development - prioritize cost dev_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16, include=["t4", "a4000"] ) # Production - prioritize performance prod_selector = NodeSelector( gpu_count=2, min_vram_gb_per_gpu=48, include=["l40", "a100"], exclude=["t4", "a4000"] ) ``` ## Summary The NodeSelector provides control over GPU hardware selection with four parameters: | Parameter | Default | Range | Description | |-----------|---------|-------|-------------| | `gpu_count` | 1 | 1-8 | Number of GPUs | | `min_vram_gb_per_gpu` | 16 | 16-140 | Minimum VRAM per GPU | | `include` | None | List[str] | Whitelist GPU types | | `exclude` | None | List[str] | Blacklist GPU types | Start with minimum requirements and adjust based on performance needs and availability. ## See Also - **[Chute Class](/docs/sdk-reference/chute)** - Using NodeSelector with chutes - **[Templates](/docs/sdk-reference/templates)** - Pre-built templates with NodeSelector - **[Cost Optimization](/docs/guides/cost-optimization)** - GPU selection for cost efficiency --- ## SOURCE: https://chutes.ai/docs/sdk-reference/README # SDK Reference Complete SDK reference for the Chutes Python SDK. Each page documents the classes, functions, decorators, and methods available. ## Core Classes - **[Chute Class](/docs/sdk-reference/chute)** - The main class for defining AI applications - **[Cord Decorator](/docs/sdk-reference/cord)** - HTTP API endpoint decorator - **[Job Decorator](/docs/sdk-reference/job)** - Long-running jobs and server rentals - **[Image Class](/docs/sdk-reference/image)** - Docker image building - **[NodeSelector Class](/docs/sdk-reference/node-selector)** - Hardware requirements ## Templates - **[Template Functions](/docs/sdk-reference/templates)** - Pre-built templates for vLLM, SGLang, Diffusion, and Embeddings ## Quick Links | Class | Import | Purpose | |-------|--------|---------| | `Chute` | `from chutes.chute import Chute` | Define AI applications | | `NodeSelector` | `from chutes.chute import NodeSelector` | Specify GPU requirements | | `Image` | `from chutes.image import Image` | Build custom images | | `Port` | `from chutes.chute.job import Port` | Define job network ports | | `build_vllm_chute` | `from chutes.chute.template import build_vllm_chute` | vLLM template | ## Reference Format Each API reference includes: - Class/function signature - Parameter descriptions with types and defaults - Usage examples - Best practices --- ## SOURCE: https://chutes.ai/docs/sdk-reference/templates # Templates API Reference Chutes provides pre-built templates for common AI/ML frameworks and use cases. Templates are factory functions that create pre-configured `Chute` instances with optimized settings for specific AI frameworks. ## Overview Templates provide: - **Quick Setup**: Instant deployment of popular AI models - **Best Practices**: Pre-configured optimization settings - **Standard APIs**: OpenAI-compatible endpoints for LLMs - **Customization**: Override any parameter as needed ## Available Templates | Template | Use Case | Import | |----------|----------|--------| | `build_vllm_chute` | LLM serving with vLLM | `from chutes.chute.template import build_vllm_chute` | | `build_sglang_chute` | LLM serving with SGLang | `from chutes.chute.template.sglang import build_sglang_chute` | | `build_diffusion_chute` | Image generation | `from chutes.chute.template.diffusion import build_diffusion_chute` | | `build_embedding_chute` | Text embeddings | `from chutes.chute.template.embedding import build_embedding_chute` | ## vLLM Template ### `build_vllm_chute()` Create a chute optimized for vLLM (high-performance LLM serving) with OpenAI-compatible API endpoints. **Import:** ```python from chutes.chute.template import build_vllm_chute ``` **Signature:** ```python def build_vllm_chute( username: str, model_name: str, node_selector: NodeSelector, image: str | Image = VLLM, tagline: str = "", readme: str = "", concurrency: int = 64, engine_args: Dict[str, Any] = {}, revision: str = None, max_instances: int = 1, scaling_threshold: float = 0.75, shutdown_after_seconds: int = 300, allow_external_egress: bool = False ) -> Chute ``` **Parameters:** - **`username`** - Your Chutes username (required) - **`model_name`** - HuggingFace model identifier (required) - **`node_selector`** - Hardware requirements (required) - **`image`** - Custom vLLM image (defaults to built-in) - **`tagline`** - Brief description - **`readme`** - Detailed documentation - **`concurrency`** - Max concurrent requests (default: 64) - **`engine_args`** - vLLM engine configuration - **`revision`** - Model revision - **`max_instances`** - Max scaling instances (default: 1) - **`scaling_threshold`** - Scaling trigger threshold (default: 0.75) - **`shutdown_after_seconds`** - Idle shutdown time (default: 300) - **`allow_external_egress`** - Allow external network access (default: False) **Basic Example:** ```python from chutes.chute.template import build_vllm_chute from chutes.chute import NodeSelector chute = build_vllm_chute( username="myuser", model_name="mistralai/Mistral-7B-Instruct-v0.3", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24 ) ) ``` **Advanced Example:** ```python from chutes.chute.template import build_vllm_chute from chutes.chute import NodeSelector chute = build_vllm_chute( username="myuser", model_name="meta-llama/Llama-2-70b-chat-hf", node_selector=NodeSelector( gpu_count=8, min_vram_gb_per_gpu=48, exclude=["l40", "a6000"] ), engine_args={ "gpu_memory_utilization": 0.97, "max_model_len": 4096, "max_num_seqs": 8, "trust_remote_code": True, "tensor_parallel_size": 8 }, concurrency=8, max_instances=3 ) ``` **Common vLLM Engine Arguments:** ```python engine_args = { # Memory management "gpu_memory_utilization": 0.95, # Use 95% of GPU memory "swap_space": 4, # GB of CPU swap space # Model configuration "max_model_len": 4096, # Maximum sequence length "max_num_seqs": 256, # Maximum concurrent sequences "trust_remote_code": False, # Allow custom model code # Performance optimization "enable_prefix_caching": True, # Cache prefixes for efficiency "use_v2_block_manager": True, # Improved block manager # Quantization "quantization": None, # e.g., "awq", "gptq", "fp8" "dtype": "auto", # Model data type # Distributed inference "tensor_parallel_size": 1, # GPUs for tensor parallelism # Tokenizer "tokenizer_mode": "auto", # Tokenizer mode # Mistral-specific "config_format": "mistral", # For Mistral models "load_format": "mistral", "tool_call_parser": "mistral", "enable_auto_tool_choice": True } ``` **Provided Endpoints:** vLLM template provides OpenAI-compatible endpoints: - `POST /v1/chat/completions` - Chat completions - `POST /v1/completions` - Text completions - `POST /v1/tokenize` - Tokenization - `POST /v1/detokenize` - Detokenization - `GET /v1/models` - List available models ## SGLang Template ### `build_sglang_chute()` Create a chute optimized for SGLang (structured generation language serving). **Import:** ```python from chutes.chute.template.sglang import build_sglang_chute ``` **Signature:** ```python def build_sglang_chute( username: str, model_name: str, node_selector: NodeSelector, image: str | Image = SGLANG, tagline: str = "", readme: str = "", concurrency: int = 64, engine_args: Dict[str, Any] = {}, revision: str = None, max_instances: int = 1, scaling_threshold: float = 0.75, shutdown_after_seconds: int = 300, allow_external_egress: bool = False ) -> Chute ``` **Example:** ```python from chutes.chute.template.sglang import build_sglang_chute from chutes.chute import NodeSelector chute = build_sglang_chute( username="myuser", model_name="deepseek-ai/DeepSeek-R1", node_selector=NodeSelector( gpu_count=8, include=["h200"], min_vram_gb_per_gpu=141 ), engine_args={ "host": "0.0.0.0", "port": 30000, "tp_size": 8, "trust_remote_code": True, "context_length": 65536, "mem_fraction_static": 0.8 }, concurrency=4 ) ``` **Common SGLang Engine Arguments:** ```python engine_args = { # Server configuration "host": "0.0.0.0", "port": 30000, # Model configuration "context_length": 4096, "trust_remote_code": True, # Performance "tp_size": 1, # Tensor parallelism "mem_fraction_static": 0.9, # Static memory fraction "chunked_prefill_size": 512, # Features "enable_flashinfer": True } ``` ## Diffusion Template ### `build_diffusion_chute()` Create a chute optimized for diffusion model inference (image generation). **Import:** ```python from chutes.chute.template.diffusion import build_diffusion_chute ``` **Example:** ```python from chutes.chute.template.diffusion import build_diffusion_chute from chutes.chute import NodeSelector chute = build_diffusion_chute( username="myuser", model_name="black-forest-labs/FLUX.1-dev", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=48, include=["l40", "a100"] ), engine_args={ "torch_dtype": "bfloat16", "guidance_scale": 3.5, "num_inference_steps": 28 }, concurrency=1 # Image generation is typically 1 concurrent request ) ``` **Generation Input Schema:** ```python from pydantic import BaseModel, Field class GenerationInput(BaseModel): prompt: str negative_prompt: str = "" height: int = Field(default=1024, ge=128, le=2048) width: int = Field(default=1024, ge=128, le=2048) num_inference_steps: int = Field(default=25, ge=1, le=50) guidance_scale: float = Field(default=7.5, ge=1.0, le=20.0) seed: Optional[int] = Field(default=None, ge=0, le=2**32 - 1) ``` **Provided Endpoints:** - `POST /generate` - Generate image from prompt ## Embedding Template ### `build_embedding_chute()` Create a chute optimized for text embeddings using vLLM. **Import:** ```python from chutes.chute.template.embedding import build_embedding_chute ``` **Signature:** ```python def build_embedding_chute( username: str, model_name: str, node_selector: NodeSelector, image: str | Image = VLLM, tagline: str = "", readme: str = "", concurrency: int = 32, engine_args: Dict[str, Any] = {}, revision: str = None, max_instances: int = 1, scaling_threshold: float = 0.75, shutdown_after_seconds: int = 300, pooling_type: str = "auto", max_embed_len: int = 3072000, enable_chunked_processing: bool = True, allow_external_egress: bool = False ) -> Chute ``` **Example:** ```python from chutes.chute.template.embedding import build_embedding_chute from chutes.chute import NodeSelector chute = build_embedding_chute( username="myuser", model_name="BAAI/bge-large-en-v1.5", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16 ), pooling_type="auto", # Auto-detect optimal pooling concurrency=32 ) ``` **Pooling Types:** - `"auto"` - Auto-detect based on model name - `"MEAN"` - Mean pooling (E5, Jina models) - `"CLS"` - CLS token pooling (BGE models) - `"LAST"` - Last token pooling (GTE, Qwen models) **Provided Endpoints:** - `POST /v1/embeddings` - OpenAI-compatible embeddings endpoint ## Extending Templates Templates can be extended with custom functionality: ```python from chutes.chute.template import build_vllm_chute from chutes.chute import NodeSelector # Create base chute from template chute = build_vllm_chute( username="myuser", model_name="mistralai/Mistral-7B-Instruct-v0.3", node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=24) ) # Add custom endpoint @chute.cord(public_api_path="/summarize", public_api_method="POST") async def summarize(self, text: str) -> dict: """Summarize text using the loaded model.""" prompt = f"Summarize the following text:\n\n{text}\n\nSummary:" # Use the template's built-in generation result = await self.generate(prompt=prompt, max_tokens=200) return {"summary": result} # Add custom startup logic @chute.on_startup(priority=90) # Run after template initialization async def custom_setup(self): """Custom initialization after model loads.""" print("Custom setup complete!") ``` ## Model-Specific Configurations ### Mistral Models ```python chute = build_vllm_chute( username="myuser", model_name="mistralai/Mistral-7B-Instruct-v0.3", node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=24), engine_args={ "tokenizer_mode": "mistral", "config_format": "mistral", "load_format": "mistral", "tool_call_parser": "mistral", "enable_auto_tool_choice": True } ) ``` ### Llama Models ```python chute = build_vllm_chute( username="myuser", model_name="meta-llama/Llama-2-70b-chat-hf", node_selector=NodeSelector( gpu_count=4, min_vram_gb_per_gpu=48 ), engine_args={ "max_model_len": 4096, "gpu_memory_utilization": 0.95, "tensor_parallel_size": 4 } ) ``` ### DeepSeek Models ```python from chutes.chute.template.sglang import build_sglang_chute chute = build_sglang_chute( username="myuser", model_name="deepseek-ai/DeepSeek-R1", node_selector=NodeSelector( gpu_count=8, min_vram_gb_per_gpu=141, include=["h200"] ), engine_args={ "tp_size": 8, "trust_remote_code": True, "context_length": 65536 } ) ``` ### FLUX Image Generation ```python from chutes.chute.template.diffusion import build_diffusion_chute chute = build_diffusion_chute( username="myuser", model_name="black-forest-labs/FLUX.1-dev", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=48 ), engine_args={ "torch_dtype": "bfloat16", "guidance_scale": 3.5, "num_inference_steps": 28 } ) ``` ## Best Practices ### 1. Choose the Right Template ```python # For OpenAI-compatible LLM API vllm_chute = build_vllm_chute(...) # For structured generation and reasoning sglang_chute = build_sglang_chute(...) # For text embeddings embedding_chute = build_embedding_chute(...) # For image generation diffusion_chute = build_diffusion_chute(...) ``` ### 2. Match Hardware to Model ```python # 7B model - single GPU node_selector = NodeSelector(gpu_count=1, min_vram_gb_per_gpu=24) # 70B model - multiple GPUs with tensor parallelism node_selector = NodeSelector(gpu_count=4, min_vram_gb_per_gpu=48) engine_args = {"tensor_parallel_size": 4} ``` ### 3. Set Appropriate Concurrency ```python # vLLM/SGLang with continuous batching - high concurrency chute = build_vllm_chute(..., concurrency=64) # Image generation - low concurrency chute = build_diffusion_chute(..., concurrency=1) # Embeddings - medium-high concurrency chute = build_embedding_chute(..., concurrency=32) ``` ### 4. Use Auto-Scaling for Production ```python chute = build_vllm_chute( ..., max_instances=10, scaling_threshold=0.75, shutdown_after_seconds=300 ) ``` ## See Also - **[Chute Class](/docs/sdk-reference/chute)** - Chute class reference - **[NodeSelector](/docs/sdk-reference/node-selector)** - Hardware requirements - **[vLLM Template Guide](/docs/templates/vllm)** - Detailed vLLM documentation - **[SGLang Template Guide](/docs/templates/sglang)** - Detailed SGLang documentation - **[Diffusion Template Guide](/docs/templates/diffusion)** - Image generation guide --- ## SOURCE: https://chutes.ai/docs/api-reference/overview # API Reference Complete REST API reference for the Chutes platform. ## Available APIs ### [Users](users) 38 endpoints ### [Chutes](chutes) 25 endpoints ### [Images](images) 5 endpoints ### [Nodes](nodes) 5 endpoints ### [Pricing](pricing) 6 endpoints ### [Instances](instances) 18 endpoints ### [Invocations](invocations) 6 endpoints ### [Authentication](authentication) 5 endpoints ### [Miner](miner) 16 endpoints ### [Logo](logo) 2 endpoints ### [Configguesser](configguesser) 1 endpoint ### [Audit](audit) 3 endpoints ### [Job](job) 7 endpoints ### [Secret](secret) 4 endpoints ### [Miscellaneous](miscellaneous) 2 endpoints ### [Servers](servers) 11 endpoints ### [Identity Provider](identity-provider) 22 endpoints ### [E2e Encryption](e2e-encryption) 2 endpoints ### [Model Aliases](model-aliases) 3 endpoints ### [General](general) 3 endpoints --- ## SOURCE: https://chutes.ai/docs/api-reference/audit # Audit API Reference This section covers all endpoints related to audit. ## Add Miner Audit Data **Endpoint:** `POST /audit/miner_data` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Block | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## List Audit Entries List all audit reports from the past week. **Endpoint:** `GET /audit/` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Download Audit Data Download report data. **Endpoint:** `GET /audit/download` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | path | string | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- --- ## SOURCE: https://chutes.ai/docs/api-reference/authentication # Authentication API Reference This section covers all endpoints related to authentication. ## Registry Auth Authentication registry/docker pull requests. **Endpoint:** `GET /registry/auth` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## List Keys List (and optionally filter/paginate) keys. **Endpoint:** `GET /api_keys/` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | page | integer \| null | No | | | limit | integer \| null | No | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Create Api Key Create a new API key. **Endpoint:** `POST /api_keys/` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | admin | boolean | Yes | | | name | string | Yes | | | scopes | ScopeArgs[] \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Get Key Get a single key. **Endpoint:** `GET /api_keys/{api_key_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | api_key_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Delete Api Key Delete an API key by ID. **Endpoint:** `DELETE /api_keys/{api_key_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | api_key_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- --- ## SOURCE: https://chutes.ai/docs/api-reference/chutes # Chutes API Reference This section covers all endpoints related to chutes. ## Share Chute Share a chute with another user. **Endpoint:** `POST /chutes/share` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | chute_id_or_name | string | Yes | | | user_id_or_name | string | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Unshare Chute Unshare a chute with another user. **Endpoint:** `POST /chutes/unshare` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | chute_id_or_name | string | Yes | | | user_id_or_name | string | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Make Public Promote subnet chutes to public visibility, owned by the calling subnet admin user. **Endpoint:** `POST /chutes/make_public` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | chutes | string[] | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## List Boosted Chutes Get a list of chutes that have a boost. **Endpoint:** `GET /chutes/boosted` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## List Available Affine Chutes Get a list of affine chutes where the creator/user has a non-zero balance. **Endpoint:** `GET /chutes/affine_available` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## List Chutes List (and optionally filter/paginate) chutes. **Endpoint:** `GET /chutes/` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | include_public | boolean \| null | No | | | template | string \| null | No | | | name | string \| null | No | | | exclude | string \| null | No | | | image | string \| null | No | | | slug | string \| null | No | | | page | integer | No | | | limit | integer | No | | | offset | integer | No | | | include_schemas | boolean \| null | No | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Deploy Chute Standard deploy from the CDK. **Endpoint:** `POST /chutes/` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | accept_fee | boolean \| null | No | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | name | string | Yes | | | tagline | string \| null | No | | | readme | string \| null | No | | | tool_description | string \| null | No | | | logo_id | string \| null | No | | | image | string | Yes | | | public | boolean | Yes | | | code | string | Yes | | | filename | string | Yes | | | ref_str | string | Yes | | | standard_template | string \| null | No | | | node_selector | NodeSelector | Yes | | | cords | Cord[] \| null | No | | | jobs | Job[] \| null | No | | | concurrency | integer \| null | No | | | revision | string \| null | No | | | max_instances | integer \| null | No | | | scaling_threshold | number \| null | No | | | shutdown_after_seconds | integer \| null | No | | | allow_external_egress | boolean \| null | No | | | encrypted_fs | boolean \| null | No | | | tee | boolean \| null | No | | | lock_modules | boolean \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## List Rolling Updates **Endpoint:** `GET /chutes/rolling_updates` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Get Gpu Count History **Endpoint:** `GET /chutes/gpu_count_history` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Get Chute Miner Mean Index **Endpoint:** `GET /chutes/miner_means` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Get Chute Miner Means Load a chute's mean TPS and output token count by miner ID. **Endpoint:** `GET /chutes/miner_means/{chute_id}.{ext}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | chute_id | string | Yes | | | ext | string \| null | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Get Chute Miner Means Load a chute's mean TPS and output token count by miner ID. **Endpoint:** `GET /chutes/miner_means/{chute_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | chute_id | string | Yes | | | ext | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Get Chute Code Load a chute's code by ID or name. **Endpoint:** `GET /chutes/code/{chute_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | chute_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Get Chute Hf Info Return Hugging Face repo_id and revision for a chute so miners can predownload the model. Miner-only; responses are cached by chute_id via aiocache. **Endpoint:** `GET /chutes/{chute_id}/hf_info` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | chute_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Warm Up Chute Warm up a chute. **Endpoint:** `GET /chutes/warmup/{chute_id_or_name}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | chute_id_or_name | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Get Chute Utilization Get chute utilization data from the most recent capacity log. **Endpoint:** `GET /chutes/utilization` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Get Tee Chute Evidence Get TEE evidence for all instances of a chute (TDX quote, GPU evidence, certificate per instance). Args: chute_id_or_name: Chute ID or name nonce: User-provided nonce (64 hex characters, 32 bytes) Returns: TeeChuteEvidence with array of TEE instance evidence per instance Raises: 404: Chute not found 400: Invalid nonce format or chute not TEE-enabled 403: User cannot access chute 429: Rate limit exceeded 500: Server attestation failures **Endpoint:** `GET /chutes/{chute_id_or_name}/evidence` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | chute_id_or_name | string | Yes | | | nonce | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Get Chute Load a chute by ID or name. **Endpoint:** `GET /chutes/{chute_id_or_name}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | chute_id_or_name | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Update Common Attributes Update readme, tagline, etc. (but not code, image, etc.). **Endpoint:** `PUT /chutes/{chute_id_or_name}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | chute_id_or_name | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | tagline | string \| null | No | | | readme | string \| null | No | | | tool_description | string \| null | No | | | logo_id | string \| null | No | | | max_instances | integer \| null | No | | | scaling_threshold | number \| null | No | | | shutdown_after_seconds | integer \| null | No | | | disabled | boolean \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Delete Chute Delete a chute by ID. **Endpoint:** `DELETE /chutes/{chute_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | chute_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Easy Deploy Vllm Chute Easy/templated vLLM deployment. **Endpoint:** `POST /chutes/vllm` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | model | string | Yes | | | logo_id | string \| null | No | | | tagline | string \| null | No | | | tool_description | string \| null | No | | | readme | string \| null | No | | | public | boolean \| null | No | | | node_selector | NodeSelector \| null | No | | | engine_args | VLLMEngineArgs \| null | No | | | revision | string \| null | No | | | concurrency | integer \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Easy Deploy Diffusion Chute Easy/templated diffusion deployment. **Endpoint:** `POST /chutes/diffusion` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | model | string | Yes | | | name | string | Yes | | | logo_id | string \| null | No | | | tagline | string \| null | No | | | tool_description | string \| null | No | | | readme | string \| null | No | | | public | boolean \| null | No | | | node_selector | NodeSelector \| null | No | | | concurrency | integer \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Teeify Chute Create a new TEE-enabled chute from an existing affine chute. **Endpoint:** `PUT /chutes/{chute_id}/teeify` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | chute_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Get Bounty List List available bounties, if any. **Endpoint:** `GET /bounties/` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Increase Chute Bounty Increase bounty value (creating if not exists). **Endpoint:** `GET /bounties/{chute_id}/increase` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | chute_id | string | Yes | | | boost | number \| null | No | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- --- ## SOURCE: https://chutes.ai/docs/api-reference/configguesser # Configguesser API Reference This section covers all endpoints related to configguesser. ## Analyze Model Attempt to guess required GPU count and VRAM for a model on huggingface, assuming safetensors format. **Endpoint:** `GET /guess/vllm_config` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | model | string | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- --- ## SOURCE: https://chutes.ai/docs/api-reference/e2e-encryption # E2e Encryption API Reference This section covers all endpoints related to e2e encryption. ## Get E2E Instances Discover E2E-capable instances for a chute and get nonces for invocation. **Endpoint:** `GET /e2e/instances/{chute_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | chute_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## E2E Invoke Relay an E2E encrypted invocation to a specific instance. **Endpoint:** `POST /e2e/invoke` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chute-Id | string | Yes | | | X-Instance-Id | string | Yes | | | X-E2E-Nonce | string | Yes | | | X-E2E-Stream | string | No | | | X-E2E-Path | string | No | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- --- ## SOURCE: https://chutes.ai/docs/api-reference/general # General API Reference This section covers all endpoints related to general. ## Ping **Endpoint:** `GET /ping` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Get Latest Metrics **Endpoint:** `GET /_metrics` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Openid Configuration Root OpenID Connect Discovery endpoint. **Endpoint:** `GET /.well-known/openid-configuration` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- --- ## SOURCE: https://chutes.ai/docs/api-reference/identity-provider # Identity Provider API Reference This section covers all endpoints related to identity provider. ## List Scopes List all available OAuth2 scopes with descriptions. This endpoint is public and can be used for documentation or scope selection UIs. **Endpoint:** `GET /idp/scopes` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Get Cli Login Nonce Get a nonce for CLI-based hotkey signature login. **Endpoint:** `GET /idp/cli_login/nonce` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Cli Login CLI login endpoint for hotkey signature authentication. **Endpoint:** `GET /idp/cli_login` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | hotkey | string | Yes | | | signature | string | Yes | | | nonce | string | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## List Apps List OAuth applications. By default, returns apps owned by the current user, public apps, and apps shared with the user. Set include_public=false to exclude public apps. Set include_shared=false to exclude apps shared with the user. Use search to filter by name or description. **Endpoint:** `GET /idp/apps` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | include_public | boolean \| null | No | | | include_shared | boolean \| null | No | | | search | string \| null | No | | | page | integer \| null | No | | | limit | integer \| null | No | | | user_id | string \| null | No | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Create App Create a new OAuth application. **Endpoint:** `POST /idp/apps` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | name | string | Yes | | | description | string \| null | No | | | redirect_uris | string[] | Yes | | | homepage_url | string \| null | No | | | logo_url | string \| null | No | | | public | boolean | No | | | refresh_token_lifetime_days | integer \| null | No | | | allowed_scopes | string[] \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Get App Get details of an OAuth application. **Endpoint:** `GET /idp/apps/{app_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | app_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Update App Update an OAuth application. **Endpoint:** `PATCH /idp/apps/{app_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | app_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | name | string \| null | No | | | description | string \| null | No | | | redirect_uris | string[] \| null | No | | | homepage_url | string \| null | No | | | logo_url | string \| null | No | | | active | boolean \| null | No | | | public | boolean \| null | No | | | refresh_token_lifetime_days | integer \| null | No | | | allowed_scopes | string[] \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Delete App Delete an OAuth application. **Endpoint:** `DELETE /idp/apps/{app_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | app_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Regenerate App Secret Regenerate the client secret for an OAuth application. **Endpoint:** `POST /idp/apps/{app_id}/regenerate-secret` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | app_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Share App Share an OAuth application with another user. **Endpoint:** `POST /idp/apps/{app_id}/share` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | app_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | app_id_or_name | string | Yes | | | user_id_or_name | string | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Unshare App Remove sharing of an OAuth application with a user. **Endpoint:** `DELETE /idp/apps/{app_id}/share/{user_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | app_id | string | Yes | | | user_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## List App Shares List users an OAuth application is shared with. **Endpoint:** `GET /idp/apps/{app_id}/shares` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | app_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## List Authorizations List apps the current user has authorized. **Endpoint:** `GET /idp/authorizations` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | page | integer \| null | No | | | limit | integer \| null | No | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Revoke App Authorization Revoke authorization for an app. **Endpoint:** `DELETE /idp/authorizations/{app_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | app_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Authorize Get OAuth2 Authorization Endpoint. Displays login page if not authenticated, consent page if authenticated. Checks for existing chutes-session-token cookie for SSO. **Endpoint:** `GET /idp/authorize` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | response_type | string | Yes | | | client_id | string | Yes | | | redirect_uri | string | Yes | | | scope | string \| null | No | | | state | string \| null | No | | | code_challenge | string \| null | No | | | code_challenge_method | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Login Post Handle login form submission. **Endpoint:** `POST /idp/login` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Authorize Consent Page Show authorization consent page. **Endpoint:** `GET /idp/authorize/consent` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | session_id | string | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Authorize Consent Handle authorization consent form submission. **Endpoint:** `POST /idp/authorize/consent` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | session_id | string | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Token Endpoint OAuth2 Token Endpoint. **Endpoint:** `POST /idp/token` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Revoke Token Endpoint OAuth2 Token Revocation Endpoint (RFC 7009). **Endpoint:** `POST /idp/token/revoke` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Userinfo Endpoint OpenID Connect UserInfo Endpoint. **Endpoint:** `GET /idp/userinfo` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Introspect Token OAuth2 Token Introspection Endpoint (RFC 7662). Token format includes embedded token_id for O(1) lookup, so client auth is optional. Allows clients to check if a token is still valid and get metadata about it. Useful for determining if a user needs to re-authenticate. Returns: - active: Whether the token is currently valid - exp: Expiration timestamp (Unix epoch) - iat: Issued at timestamp - scope: Space-separated list of scopes - client_id: The client that the token was issued to - username: The user's username - sub: The user's ID **Endpoint:** `POST /idp/token/introspect` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- --- ## SOURCE: https://chutes.ai/docs/api-reference/images # Images API Reference This section covers all endpoints related to images. ## Stream Build Logs **Endpoint:** `GET /images/{image_id}/logs` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | image_id | string | Yes | | | offset | string \| null | No | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## List Images List (and optionally filter/paginate) images. **Endpoint:** `GET /images/` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | include_public | boolean \| null | No | | | name | string \| null | No | | | tag | string \| null | No | | | page | integer \| null | No | | | limit | integer \| null | No | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Create Image Create an image; really here we're just storing the metadata in the DB and kicking off the image build asynchronously. **Endpoint:** `POST /images/` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 202 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Get Image Load a single image by ID or name. **Endpoint:** `GET /images/{image_id_or_name}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | image_id_or_name | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Delete Image Delete an image by ID or name:tag. **Endpoint:** `DELETE /images/{image_id_or_name}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | image_id_or_name | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- --- ## SOURCE: https://chutes.ai/docs/api-reference/instances # Instances API Reference This section covers all endpoints related to instances. ## Get Instance Reconciliation Csv Get all instance audit instance_id, deleted_at records to help reconcile audit data. **Endpoint:** `GET /instances/reconciliation_csv` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Get Instance Compute History Csv Get instance_compute_history records for the scoring period (last 7 days + buffer). Used by the auditor to reconcile compute history data on startup. **Endpoint:** `GET /instances/compute_history_csv` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Get Launch Config **Endpoint:** `GET /instances/launch_config` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | chute_id | string | Yes | | | server_id | string \| null | No | | | job_id | string \| null | No | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Get Rint Nonce Get runtime integrity nonce for a launch config. This endpoint consumes the nonce from Redis (one-time use). Only available for chutes_version >= 0.4.9. **Endpoint:** `GET /instances/launch_config/{config_id}/nonce` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | config_id | string | Yes | | | Authorization | string | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Claim Tee Launch Config Claim a TEE launch config, verify attestation, and receive symmetric key. **Endpoint:** `POST /instances/launch_config/{config_id}/tee` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | config_id | string | Yes | | | Authorization | string | No | | | X-Chutes-Nonce | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | gpus | object[] | Yes | | | host | string | Yes | | | port_mappings | PortMap[] | Yes | | | fsv | string \| null | No | | | egress | boolean \| null | No | | | lock_modules | boolean \| null | No | | | netnanny_hash | string \| null | No | | | run_path | string \| null | No | | | py_dirs | string[] \| null | No | | | rint_commitment | string \| null | No | | | rint_nonce | string \| null | No | | | rint_pubkey | string \| null | No | | | tls_cert | string \| null | No | | | tls_cert_sig | string \| null | No | | | tls_ca_cert | string \| null | No | | | tls_client_cert | string \| null | No | | | tls_client_key | string \| null | No | | | tls_client_key_password | string \| null | No | | | e2e_pubkey | string \| null | No | | | cllmv_session_init | string \| null | No | | | env | string | Yes | | | code | string \| null | No | | | run_code | string \| null | No | | | inspecto | string \| null | No | | | deployment_id | string | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Verify Tee Launch Config Instance Verify TEE launch config instance by validating symmetric key usage via dummy ports. **Endpoint:** `PUT /instances/launch_config/{config_id}/tee` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | config_id | string | Yes | | | Authorization | string | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Validate Tee Launch Config Instance **Endpoint:** `POST /instances/launch_config/{config_id}/attest` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | config_id | string | Yes | | | Authorization | string | No | | | X-Chutes-Nonce | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | gpus | object[] | Yes | | | host | string | Yes | | | port_mappings | PortMap[] | Yes | | | fsv | string \| null | No | | | egress | boolean \| null | No | | | lock_modules | boolean \| null | No | | | netnanny_hash | string \| null | No | | | run_path | string \| null | No | | | py_dirs | string[] \| null | No | | | rint_commitment | string \| null | No | | | rint_nonce | string \| null | No | | | rint_pubkey | string \| null | No | | | tls_cert | string \| null | No | | | tls_cert_sig | string \| null | No | | | tls_ca_cert | string \| null | No | | | tls_client_cert | string \| null | No | | | tls_client_key | string \| null | No | | | tls_client_key_password | string \| null | No | | | e2e_pubkey | string \| null | No | | | cllmv_session_init | string \| null | No | | | env | string | Yes | | | code | string \| null | No | | | run_code | string \| null | No | | | inspecto | string \| null | No | | | gpu_evidence | object[] | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Claim Launch Config **Endpoint:** `POST /instances/launch_config/{config_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | config_id | string | Yes | | | Authorization | string | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | gpus | object[] | Yes | | | host | string | Yes | | | port_mappings | PortMap[] | Yes | | | fsv | string \| null | No | | | egress | boolean \| null | No | | | lock_modules | boolean \| null | No | | | netnanny_hash | string \| null | No | | | run_path | string \| null | No | | | py_dirs | string[] \| null | No | | | rint_commitment | string \| null | No | | | rint_nonce | string \| null | No | | | rint_pubkey | string \| null | No | | | tls_cert | string \| null | No | | | tls_cert_sig | string \| null | No | | | tls_ca_cert | string \| null | No | | | tls_client_cert | string \| null | No | | | tls_client_key | string \| null | No | | | tls_client_key_password | string \| null | No | | | e2e_pubkey | string \| null | No | | | cllmv_session_init | string \| null | No | | | env | string | Yes | | | code | string \| null | No | | | run_code | string \| null | No | | | inspecto | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Verify Launch Config Instance **Endpoint:** `PUT /instances/launch_config/{config_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | config_id | string | Yes | | | Authorization | string | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Claim Graval Launch Config Claim a Graval launch config and receive PoVW challenge. **Endpoint:** `POST /instances/launch_config/{config_id}/graval` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | config_id | string | Yes | | | Authorization | string | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | gpus | object[] | Yes | | | host | string | Yes | | | port_mappings | PortMap[] | Yes | | | fsv | string \| null | No | | | egress | boolean \| null | No | | | lock_modules | boolean \| null | No | | | netnanny_hash | string \| null | No | | | run_path | string \| null | No | | | py_dirs | string[] \| null | No | | | rint_commitment | string \| null | No | | | rint_nonce | string \| null | No | | | rint_pubkey | string \| null | No | | | tls_cert | string \| null | No | | | tls_cert_sig | string \| null | No | | | tls_ca_cert | string \| null | No | | | tls_client_cert | string \| null | No | | | tls_client_key | string \| null | No | | | tls_client_key_password | string \| null | No | | | e2e_pubkey | string \| null | No | | | cllmv_session_init | string \| null | No | | | env | string | Yes | | | code | string \| null | No | | | run_code | string \| null | No | | | inspecto | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Verify Graval Launch Config Instance Verify Graval launch config instance by validating PoVW proof and symmetric key usage. **Endpoint:** `PUT /instances/launch_config/{config_id}/graval` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | config_id | string | Yes | | | Authorization | string | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Activate Launch Config Instance **Endpoint:** `GET /instances/launch_config/{config_id}/activate` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | config_id | string | Yes | | | Authorization | string | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Get Instance Nonce Generate a nonce for TEE instance verification. This endpoint is called by chute instances during TEE verification (Phase 1). The nonce is used to bind the attestation evidence to this specific verification request. **Endpoint:** `GET /instances/nonce` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Get Token **Endpoint:** `GET /instances/token_check` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | salt | string | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Get Tee Instance Evidence Get TEE evidence for a specific instance (TDX quote, GPU evidence, certificate). Args: instance_id: Instance ID nonce: User-provided nonce (64 hex characters, 32 bytes) Returns: TeeInstanceEvidence with quote, gpu_evidence, and certificate Raises: 404: Instance not found 400: Invalid nonce format or instance not TEE-enabled 403: User cannot access instance 429: Rate limit exceeded 500: Server attestation failures **Endpoint:** `GET /instances/{instance_id}/evidence` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | instance_id | string | Yes | | | nonce | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Stream Logs Fetch raw kubernetes pod logs. NOTE: These are pod logs, not request data/etc., so it will never include prompts, responses, etc. Used for troubleshooting and checking status of warmup, etc. **Endpoint:** `GET /instances/{instance_id}/logs` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | instance_id | string | Yes | | | backfill | integer \| null | No | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Disable Instance Endpoint **Endpoint:** `POST /instances/{chute_id}/{instance_id}/disable` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | chute_id | string | Yes | | | instance_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Delete Instance **Endpoint:** `DELETE /instances/{chute_id}/{instance_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | chute_id | string | Yes | | | instance_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- --- ## SOURCE: https://chutes.ai/docs/api-reference/invocations # Invocations API Reference This section covers all endpoints related to invocations. ## Get Usage Get aggregated usage data, which is the amount of revenue we would be receiving if no usage was free. **Endpoint:** `GET /invocations/usage` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Get Llm Stats **Endpoint:** `GET /invocations/stats/llm` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | start_date | string | No | | | end_date | string | No | | | chute_id | string | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Get Diffusion Stats **Endpoint:** `GET /invocations/stats/diffusion` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Get Export Get invocation exports (and reports) for a particular hour. **Endpoint:** `GET /invocations/exports/{year}/{month}/{day}/{hour_format}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | year | integer | Yes | | | month | integer | Yes | | | day | integer | Yes | | | hour_format | string | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Get Recent Export Get an export for recent data, which may not yet be in S3. **Endpoint:** `GET /invocations/exports/recent` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | hotkey | string \| null | No | | | limit | integer \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Report Invocation **Endpoint:** `POST /invocations/{invocation_id}/report` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | invocation_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | reason | string | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- --- ## SOURCE: https://chutes.ai/docs/api-reference/job # Job API Reference This section covers all endpoints related to job. ## Create Job Create a job. **Endpoint:** `POST /jobs/{chute_id}/{method}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | chute_id | string | Yes | | | method | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Delete Job Delete a job. **Endpoint:** `DELETE /jobs/{job_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | job_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Finish Job And Get Upload Targets Mark a job as complete (which could be failed; "done" either way) **Endpoint:** `POST /jobs/{job_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | job_id | string | Yes | | | token | string | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Complete Job Final update, which checks the file uploads to see which were successfully transferred etc. **Endpoint:** `PUT /jobs/{job_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | job_id | string | Yes | | | token | string | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Get Job Get a job. **Endpoint:** `GET /jobs/{job_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | job_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Upload Job File Upload a job's output file. **Endpoint:** `PUT /jobs/{job_id}/upload` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | job_id | string | Yes | | | token | string | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Download Output File Download a job's output file. **Endpoint:** `GET /jobs/{job_id}/download/{file_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | job_id | string | Yes | | | file_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- --- ## SOURCE: https://chutes.ai/docs/api-reference/logo # Logo API Reference This section covers all endpoints related to logo. ## Create Logo Create/upload a new logo. **Endpoint:** `POST /logos/` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Render Logo Logo image response. **Endpoint:** `GET /logos/{logo_id}.{extension}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | logo_id | string | Yes | | | extension | string | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- --- ## SOURCE: https://chutes.ai/docs/api-reference/miner # Miner API Reference This section covers all endpoints related to miner. ## List Chutes **Endpoint:** `GET /miner/chutes/` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## List Images **Endpoint:** `GET /miner/images/` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## List Nodes **Endpoint:** `GET /miner/nodes/` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## List Servers List all servers for the authenticated miner, with nested GPU info. Provides full visibility into server inventory (including servers with no GPUs, duplicate IPs, or name collisions). **Endpoint:** `GET /miner/servers/` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## List Instances **Endpoint:** `GET /miner/instances/` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | explicit_null | boolean \| null | No | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## List Available Jobs **Endpoint:** `GET /miner/jobs/` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Release Job **Endpoint:** `DELETE /miner/jobs/{job_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | job_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Get Full Inventory **Endpoint:** `GET /miner/inventory` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Metrics **Endpoint:** `GET /miner/metrics/` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## List Active Instances Get all active instances across the platform. Used by miners to make informed preemption decisions based on global state. **Endpoint:** `GET /miner/active_instances/` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Get Chute **Endpoint:** `GET /miner/chutes/{chute_id}/{version}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | chute_id | string | Yes | | | version | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Get Stats Get miner stats over different intervals based on instance data (matching actual scoring). Returns instance-based metrics (total_instances, compute_seconds, compute_units, bounty_count) which align with how miners are actually scored for validator weights. **Endpoint:** `GET /miner/stats` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | miner_hotkey | string \| null | No | | | per_chute | boolean \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Get Scores **Endpoint:** `GET /miner/scores` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | hotkey | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Unique Chute History **Endpoint:** `GET /miner/unique_chute_history/{hotkey}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | hotkey | string | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Get Thrash Cooldowns Return all chutes where this miner is currently in a thrash cooldown, along with when the cooldown expires. **Endpoint:** `GET /miner/thrash_cooldowns` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Get Metagraph **Endpoint:** `GET /miner/metagraph` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- --- ## SOURCE: https://chutes.ai/docs/api-reference/miscellaneous # Miscellaneous API Reference This section covers all endpoints related to miscellaneous. ## Proxy **Endpoint:** `GET /misc/proxy` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | url | string | Yes | | | stream | boolean | No | Stream the response for large files/videos | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Get Hf Repo Info Proxy endpoint for HF repo file info. **Endpoint:** `GET /misc/hf_repo_info` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | repo_id | string | Yes | | | repo_type | string | No | | | revision | string | No | | | hf_token | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- --- ## SOURCE: https://chutes.ai/docs/api-reference/model-aliases # Model Aliases API Reference This section covers all endpoints related to model aliases. ## List Aliases **Endpoint:** `GET /model_aliases/` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Create Or Update Alias **Endpoint:** `POST /model_aliases/` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | alias | string | Yes | | | chute_ids | string[] | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 201 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Delete Alias **Endpoint:** `DELETE /model_aliases/{alias}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | alias | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 204 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- --- ## SOURCE: https://chutes.ai/docs/api-reference/nodes # Nodes API Reference This section covers all endpoints related to nodes. ## List Nodes List full inventory, optionally in detailed view (which lists chutes). **Endpoint:** `GET /nodes/` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | model | string \| null | No | | | detailed | boolean \| null | No | | | hotkey | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Create Nodes Add nodes/GPUs to inventory. **Endpoint:** `POST /nodes/` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | server_id | string | Yes | | | server_name | string \| null | No | | | nodes | NodeArgs[] | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 202 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## List Supported Gpus Show all currently supported GPUs. **Endpoint:** `GET /nodes/supported` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Check Verification Status Check taskiq task status, to see if the validator has finished GPU verification. **Endpoint:** `GET /nodes/verification_status` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | task_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Delete Node Remove a node from inventory. **Endpoint:** `DELETE /nodes/{node_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | node_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- --- ## SOURCE: https://chutes.ai/docs/api-reference/pricing # Pricing API Reference This section covers all endpoints related to pricing. ## Get Daily Revenue Summary Get the summary of daily revenue including paygo, invoiced users, subscriptions and pending private instances. **Endpoint:** `GET /daily_revenue_summary` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | days | integer \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Get Tao Payment Totals Get the amount (as USD equivalent) of payments made by tao for today, the current month, and total. **Endpoint:** `GET /payments/summary/tao` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Get Fmv Get the current FMV for tao. **Endpoint:** `GET /fmv` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Get Pricing Get the current compute unit pricing. **Endpoint:** `GET /pricing` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Return Developer Deposit **Endpoint:** `POST /return_developer_deposit` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | address | string | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## List Payments List all payments. **Endpoint:** `GET /payments` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | page | integer \| null | No | | | limit | integer \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- --- ## SOURCE: https://chutes.ai/docs/api-reference/secret # Secret API Reference This section covers all endpoints related to secret. ## List Secrets List secrets. **Endpoint:** `GET /secrets/` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | page | integer \| null | No | | | limit | integer \| null | No | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Create Secret Create a secret (e.g. private HF token). **Endpoint:** `POST /secrets/` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | purpose | string | Yes | | | key | string | Yes | | | value | string | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Get Secret Load a single secret by ID. **Endpoint:** `GET /secrets/{secret_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | secret_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Delete Secret Delete a secret by ID. **Endpoint:** `DELETE /secrets/{secret_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | secret_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- --- ## SOURCE: https://chutes.ai/docs/api-reference/servers # Servers API Reference This section covers all endpoints related to servers. ## Get Nonce Generate a nonce for boot attestation. This endpoint is called by VMs during boot before any registration. No authentication required as the VM doesn't exist in the system yet. **Endpoint:** `GET /servers/nonce` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Verify Boot Attestation Verify boot attestation and return LUKS passphrase. This endpoint verifies the TDX quote against expected boot measurements and returns the LUKS passphrase for disk decryption if valid. **Endpoint:** `POST /servers/boot/attestation` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Nonce | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | quote | string | Yes | Base64 encoded TDX quote | | miner_hotkey | string | Yes | Miner hotkey that owns this VM | | vm_name | string | Yes | VM name/identifier | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Get Cache Luks Passphrase Retrieve existing LUKS passphrase for cache volume encryption. This endpoint is called when the initramfs detects that the cache volume is already encrypted. It retrieves the passphrase that was previously generated for this VM configuration (miner_hotkey + vm_name). The hotkey must be provided as a query parameter. The boot token must be provided in the X-Boot-Token header. **Endpoint:** `GET /servers/{vm_name}/luks` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | vm_name | string | Yes | | | hotkey | string | Yes | | | X-Boot-Token | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Sync Luks Passphrases Sync LUKS passphrases: VM sends volume list; API returns keys for existing volumes, creates keys for new volumes, rekeys volumes in rekey list, and prunes stored keys for volumes not in the list. Boot token is consumed after successful POST. **Endpoint:** `POST /servers/{vm_name}/luks` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | vm_name | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Boot-Token | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | volumes | string[] | Yes | Volume names the VM is managing (defines full set) | | rekey | string[] \| null | No | Volume names that must receive new passphrases (no reuse); must be subset of volumes | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Create Server Register a new server. This is called via CLI after the server has booted and decrypted its disk. Links the server to any existing boot attestation history via server ip. **Endpoint:** `POST /servers/` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | host | string | Yes | Public IP address or DNS Name of the server | | id | string | Yes | Server ID (e.g. k8s node uid) | | name | string \| null | No | Server name (defaults to server id if omitted) | | gpus | NodeArgs[] | Yes | GPU info for this server | ### Responses | Status Code | Description | |-------------|-------------| | 201 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Patch Server Name Update name for an existing server. Path is server_id; query param is the new name. The server row is updated when hotkey and server_id match. **Endpoint:** `PATCH /servers/{server_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | server_id | string | Yes | | | server_name | string | Yes | New VM name to set | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Get Server Details Get details for a specific server by miner hotkey and server id. **Endpoint:** `GET /servers/{server_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | server_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Remove Server Remove a server by miner hotkey and server id or VM name (path param server_name_or_id). **Endpoint:** `DELETE /servers/{server_name_or_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | server_name_or_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Get Runtime Nonce Generate a nonce for runtime attestation. **Endpoint:** `GET /servers/{server_id}/nonce` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | server_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Verify Runtime Attestation Verify runtime attestation with full measurement validation. **Endpoint:** `POST /servers/{server_id}/attestation` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | server_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | quote | string | Yes | Base64 encoded TDX quote | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Get Attestation Status Get current attestation status for a server by miner hotkey and server id. **Endpoint:** `GET /servers/{server_id}/attestation/status` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | server_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- --- ## SOURCE: https://chutes.ai/docs/api-reference/users # Users API Reference This section covers all endpoints related to users. ## Get User Growth **Endpoint:** `GET /users/growth` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## List Chute Shares **Endpoint:** `GET /users/{user_id}/shares` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | user_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Admin User Id Lookup **Endpoint:** `GET /users/user_id_lookup` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | username | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Admin Balance Lookup **Endpoint:** `GET /users/{user_id_or_username}/balance` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | user_id_or_username | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Admin Invoiced User List **Endpoint:** `GET /users/invoiced_user_list` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Admin Batch User Lookup **Endpoint:** `POST /users/batch_user_lookup` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Admin Balance Change **Endpoint:** `POST /users/admin_balance_change` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | user_id | string | Yes | | | amount | number | Yes | | | reason | string | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Balance Transfer Transfer balance from the authenticated user to a target user. Supports three authentication methods: 1. Hotkey authentication (X-Chutes-Hotkey + X-Chutes-Signature + X-Chutes-Nonce) 2. Admin API key (Authorization: cpk_...) 3. Fingerprint (Authorization: ) **Endpoint:** `POST /users/balance_transfer` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | user_id | string | Yes | | | amount | number \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Grant Subnet Role **Endpoint:** `POST /users/grant_subnet_role` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | user | string | Yes | | | netuid | integer | Yes | | | admin | boolean | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Revoke Subnet Role **Endpoint:** `POST /users/revoke_subnet_role` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | user | string | Yes | | | netuid | integer | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Admin Quotas Change **Endpoint:** `POST /users/{user_id}/quotas` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | user_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Admin Get User Quotas Load quotas for a user. **Endpoint:** `GET /users/{user_id}/quotas` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | user_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Admin Quota Effective Date Change **Endpoint:** `PUT /users/{user_id}/quotas/{chute_id}/effective_date` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | user_id | string | Yes | | | chute_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | effective_date | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Admin Discounts Change **Endpoint:** `POST /users/{user_id}/discounts` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | user_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Admin List Discounts **Endpoint:** `GET /users/{user_id}/discounts` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | user_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Admin Enable Invoicing **Endpoint:** `POST /users/{user_id}/enable_invoicing` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | user_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## My Quotas Load quotas for the current user. **Endpoint:** `GET /users/me/quotas` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## My Discounts Load discounts for the current user. **Endpoint:** `GET /users/me/discounts` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## My Price Overrides Load price overrides for the current user. **Endpoint:** `GET /users/me/price_overrides` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Chute Quota Usage Check the current quota usage for a chute. **Endpoint:** `GET /users/me/quota_usage/{chute_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | chute_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## My Subscription Usage Get current subscription usage and caps for the authenticated user. Returns monthly and 4-hour window usage vs limits. **Endpoint:** `GET /users/me/subscription_usage` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Delete My User Delete account. **Endpoint:** `DELETE /users/me` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | Authorization | string | Yes | Authorization header | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Set Logo Get a detailed response for the current user. **Endpoint:** `GET /users/set_logo` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | logo_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Check Username Check if a username is valid and available. **Endpoint:** `GET /users/name_check` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | username | string | Yes | | | readonly | boolean \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Register Register a user. **Endpoint:** `POST /users/register` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | token | string \| null | No | | | X-Chutes-Hotkey | string | Yes | The hotkey of the user | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | username | string | Yes | | | coldkey | string | Yes | | | logo_id | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Get Registration Token Initial form with cloudflare + hcaptcha to generate a registration token. **Endpoint:** `GET /users/registration_token` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Post Rtok Verify hCaptcha and get a short-lived registration token. **Endpoint:** `POST /users/registration_token` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Admin Create User Create a new user manually from an admin account, no bittensor stuff necessary. **Endpoint:** `POST /users/create_user` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | username | string | Yes | | | coldkey | string \| null | No | | | hotkey | string \| null | No | | | logo_id | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Change Fingerprint Reset a user's fingerprint using either the hotkey or coldkey. **Endpoint:** `POST /users/change_fingerprint` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | Authorization | string \| null | No | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Coldkey | string \| null | No | | | X-Chutes-Nonce | string | No | Nonce | | X-Chutes-Signature | string | No | Hotkey signature | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | fingerprint | string | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Get Login Nonce Get a nonce for hotkey signature login. The nonce is a UUID4 string that must be signed by the user's hotkey. Valid for 5 minutes. **Endpoint:** `GET /users/login/nonce` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Login Exchange credentials for a JWT. Supports two authentication methods: 1. Fingerprint: {"fingerprint": "your-fingerprint"} 2. Hotkey signature: {"hotkey": "5...", "signature": "hex...", "nonce": "uuid"} For hotkey auth, first call GET /users/login/nonce to get a nonce, sign it with your hotkey (e.g., `btcli w sign --message `), then submit the hotkey, signature, and nonce. **Endpoint:** `POST /users/login` ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | --- ## Change Bt Auth Change the bittensor hotkey/coldkey associated with an account via fingerprint auth. **Endpoint:** `POST /users/change_bt_auth` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | Authorization | string | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Update Squad Access Enable squad access. **Endpoint:** `PUT /users/squad_access` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## List Usage List usage summary data. **Endpoint:** `GET /users/{user_id}/usage` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | user_id | string | Yes | | | page | integer \| null | No | | | limit | integer \| null | No | | | per_chute | boolean \| null | No | | | chute_id | string \| null | No | | | start_date | string \| null | No | | | end_date | string \| null | No | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Get User Info Get user info. **Endpoint:** `GET /users/{user_id}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | user_id | string | Yes | | | X-Chutes-Hotkey | string \| null | No | | | X-Chutes-Signature | string \| null | No | | | X-Chutes-Nonce | string \| null | No | | | Authorization | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | ### Authentication This endpoint requires authentication. --- ## Agent Registration Register an AI agent programmatically using hotkey/coldkey/signature. Returns a payment address where the agent must send TAO to complete registration. **Endpoint:** `POST /users/agent_registration` ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | hotkey | string | Yes | | | coldkey | string | Yes | | | signature | string | Yes | | | username | string \| null | No | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Get Agent Registration Status Check the status of an agent registration by hotkey. Handles all states: pending payment, completed (converted to user), or expired. **Endpoint:** `GET /users/agent_registration/{hotkey}` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | hotkey | string | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- ## Agent Setup One-time setup endpoint for agent-registered users. Requires hotkey signature to prove ownership. Returns API key and config.ini template. **Endpoint:** `POST /users/{user_id}/agent_setup` ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | user_id | string | Yes | | ### Request Body | Field | Type | Required | Description | |-------|------|----------|-------------| | hotkey | string | Yes | | | signature | string | Yes | | ### Responses | Status Code | Description | |-------------|-------------| | 200 | Successful Response | | 422 | Validation Error | --- --- ## SOURCE: https://chutes.ai/docs/cli/account # Account Management This section covers CLI commands for managing your Chutes account, registration, authentication, and API keys. ## Account Registration ### `chutes register` Create a new account with the Chutes platform. ```bash chutes register [OPTIONS] ``` **Options:** - `--config-path TEXT`: Custom path to config file - `--username TEXT`: Desired username - `--wallets-path TEXT`: Path to Bittensor wallets directory (default: `~/.bittensor/wallets`) - `--wallet TEXT`: Name of the wallet to use - `--hotkey TEXT`: Hotkey to register with **Examples:** ```bash # Basic registration with interactive prompts chutes register # Register with specific username chutes register --username myusername # Register with specific wallet chutes register --wallet my_wallet --hotkey my_hotkey ``` **Registration Process:** 1. **Choose Username**: Select a unique username for your account 2. **Wallet Selection**: Choose from available Bittensor wallets 3. **Hotkey Selection**: Select which hotkey to use for signing 4. **Token Verification**: Complete registration token verification 5. **Config Generation**: Configuration file is generated and saved **What Happens During Registration:** - Creates your Chutes account - Generates initial configuration file at `~/.chutes/config.ini` - Sets up your payment address for adding balance - Provides your account fingerprint (keep this safe!) ## API Key Management API keys provide programmatic access to your Chutes account and are essential for CI/CD and automation. ### `chutes keys list` List all API keys for your account. ```bash chutes keys list [OPTIONS] ``` **Options:** - `--name TEXT`: Filter by name - `--limit INTEGER`: Number of items per page (default: 25) - `--page INTEGER`: Page number (default: 0) **Example:** ```bash chutes keys list ``` **Output:** ``` ┌──────────┬─────────────────────┬─────────┬──────────────────────────┐ │ ID │ Name │ Admin │ Scopes │ ├──────────┼─────────────────────┼─────────┼──────────────────────────┤ │ key_123 │ admin │ true │ - │ │ key_456 │ ci-cd │ false │ {"action": "invoke"...} │ │ key_789 │ dev │ false │ {"action": "read"...} │ └──────────┴─────────────────────┴─────────┴──────────────────────────┘ ``` ### `chutes keys create` Create a new API key. ```bash chutes keys create [OPTIONS] ``` **Options:** - `--name TEXT`: Name for the API key (required) - `--admin`: Create admin key with full permissions - `--images`: Allow full access to images - `--chutes`: Allow full access to chutes - `--image-ids TEXT`: Allow access to specific image IDs (can be repeated) - `--chute-ids TEXT`: Allow access to specific chute IDs (can be repeated) - `--action [read|write|delete|invoke]`: Specify action scope - `--json-input TEXT`: Provide raw scopes document as JSON for advanced usage - `--config-path TEXT`: Custom config path **Examples:** ```bash # Create admin key with full permissions chutes keys create --name admin --admin # Create key for invoking all chutes chutes keys create --name invoke-all --chutes --action invoke # Create key for reading specific chute chutes keys create --name readonly-key --chute-ids my-chute-id --action read # Create key for managing images chutes keys create --name image-manager --images --action write # Create key with advanced scopes using JSON chutes keys create --name advanced-key --json-input '{"scopes": [{"object_type": "chutes", "action": "invoke"}]}' ``` **Key Types:** - **Admin Keys**: Full account access including all resources - **Scoped Keys**: Limited access based on object type and action **Using Your API Key:** After creating a key, you'll receive output like: ``` API key created successfully { "api_key_id": "...", "name": "my-key", "secret_key": "cpk_xxxxxxxxxxxxxxxx" } To use the key, add "Authorization: Bearer cpk_xxxxxxxxxxxxxxxx" to your headers! ``` ### `chutes keys get` Get details about a specific API key. ```bash chutes keys get ``` **Example:** ```bash chutes keys get my-key ``` ### `chutes keys delete` Delete an API key. ```bash chutes keys delete ``` **Example:** ```bash # Delete by name chutes keys delete old-key ``` **Safety Notes:** - Deleted keys cannot be recovered - Active deployments using the key will lose access - Always rotate keys before deletion in production ## Secrets Management Secrets allow you to securely store sensitive values (like API tokens) that your chutes need at runtime. ### `chutes secrets create` Create a new secret for a chute. ```bash chutes secrets create [OPTIONS] ``` **Options:** - `--purpose TEXT`: The chute UUID or name this secret is for (required) - `--key TEXT`: The secret key/environment variable name (required) - `--value TEXT`: The secret value (required) - `--config-path TEXT`: Custom config path **Examples:** ```bash # Create a HuggingFace token secret for a chute chutes secrets create --purpose my-llm-chute --key HF_TOKEN --value hf_xxxxxxxxxxxx # Create an API key secret chutes secrets create --purpose my-chute --key EXTERNAL_API_KEY --value sk-xxxxxxxx ``` ### `chutes secrets list` List your secrets. ```bash chutes secrets list [OPTIONS] ``` **Options:** - `--limit INTEGER`: Number of items per page (default: 25) - `--page INTEGER`: Page number (default: 0) **Output:** ``` ┌────────────────┬─────────────────┬─────────────┬─────────────────────┐ │ Secret ID │ Purpose │ Key │ Created │ ├────────────────┼─────────────────┼─────────────┼─────────────────────┤ │ sec_123abc │ my-llm-chute │ HF_TOKEN │ 2024-01-15 10:30:00 │ │ sec_456def │ my-chute │ API_KEY │ 2024-01-20 14:45:00 │ └────────────────┴─────────────────┴─────────────┴─────────────────────┘ ``` ### `chutes secrets get` Get details about a specific secret. ```bash chutes secrets get ``` ### `chutes secrets delete` Delete a secret. ```bash chutes secrets delete ``` ## Configuration Management ### Config File Structure The Chutes configuration file (`~/.chutes/config.ini`) stores your account settings: ```ini [api] base_url = https://api.chutes.ai [auth] username = myusername user_id = user_123abc456def hotkey_seed = your_hotkey_seed hotkey_name = my_hotkey hotkey_ss58address = 5xxxxx... [payment] address = 5xxxxx... ``` ### Environment Variables Override config settings with environment variables: ```bash # Config path export CHUTES_CONFIG_PATH=/path/to/config.ini # API URL (for development/testing) export CHUTES_API_URL=https://api.chutes.ai # Allow missing config (useful during registration) export CHUTES_ALLOW_MISSING=true ``` ### Multiple Configurations Manage multiple accounts or environments: ```bash # Create environment-specific configs mkdir -p ~/.chutes/environments # Production config chutes register --config-path ~/.chutes/environments/prod.ini # Staging config chutes register --config-path ~/.chutes/environments/staging.ini # Use specific config for commands chutes build my_app:chute --config-path ~/.chutes/environments/prod.ini ``` ## Security Best Practices ### API Key Security ```bash # Use separate keys for different purposes chutes keys create --name production-deploy --chutes --action write chutes keys create --name monitoring --chutes --action read chutes keys create --name ci-invoke --chutes --action invoke # Rotate keys regularly chutes keys create --name new-prod-key --admin # Update your deployments to use new key chutes keys delete old-prod-key ``` ### Account Security - **Keep Your Fingerprint Safe**: Your fingerprint is shown during registration - don't share it - **Secure Your Hotkey**: The hotkey seed in your config file should be kept private - **Regular Audits**: Review your API keys periodically and delete unused ones - **Environment Separation**: Use different keys for dev/staging/prod ### CI/CD Security ```yaml # GitHub Actions example env: CHUTES_API_KEY: ${{ secrets.CHUTES_API_KEY }} steps: - name: Deploy to Chutes run: | pip install chutes mkdir -p ~/.chutes cat > ~/.chutes/config.ini << EOF [api] base_url = https://api.chutes.ai [auth] # Use API key authentication EOF chutes deploy my_app:chute --accept-fee ``` ## Troubleshooting ### Common Issues **Registration fails?** ```bash # Check network connectivity curl -I https://api.chutes.ai/ping # Try with different username (may already be taken) chutes register --username alternative_username # Verify wallet path exists ls ~/.bittensor/wallets/ ``` **API key not working?** ```bash # Verify key exists and check scopes chutes keys list chutes keys get my-key # Ensure you're using the secret_key value with "Authorization: Bearer" header ``` **Configuration issues?** ```bash # Check config file exists and has correct format cat ~/.chutes/config.ini # Verify environment variables aren't overriding echo $CHUTES_CONFIG_PATH echo $CHUTES_API_URL ``` ### Getting Help - **Account Issues**: [Discord Community](https://discord.gg/wHrXwWkCRz) - **Technical Support**: [GitHub Issues](https://github.com/chutesai/chutes/issues) - **Documentation**: [Chutes Docs](https://chutes.ai/docs) ## Next Steps - **[Building Images](/docs/cli/build)** - Learn to build Docker images - **[Deploying Chutes](/docs/cli/deploy)** - Deploy your applications - **[Managing Resources](/docs/cli/manage)** - Manage your deployments - **[CLI Overview](/docs/cli/overview)** - Return to command overview --- ## SOURCE: https://chutes.ai/docs/cli/build # Building Images The `chutes build` command creates Docker images for your chutes with all necessary dependencies and optimizations for the Chutes platform. ## Basic Build Command ### `chutes build` Build a Docker image for your chute. ```bash chutes build [OPTIONS] ``` **Arguments:** - `chute_ref`: Chute reference in format `module:chute_name` **Options:** - `--config-path TEXT`: Custom config path - `--logo TEXT`: Path to logo image for the image - `--local`: Build locally instead of remotely (useful for testing/debugging) - `--debug`: Enable debug logging - `--include-cwd`: Include entire current directory in build context recursively - `--wait`: Wait for remote build to complete and stream logs - `--public`: Mark image as public/available to anyone ## Build Examples ### Basic Remote Build ```bash # Build on Chutes infrastructure (recommended) chutes build my_chute:chute --wait ``` **Benefits of Remote Building:** - 🚀 Faster build times with powerful infrastructure - 📦 Optimized caching and layer sharing - 🔒 Secure build environment - 💰 No local resource usage ### Local Development Build ```bash # Build locally for testing and development chutes build my_chute:chute --local --debug ``` **When to Use Local Builds:** - 🧪 Quick development iterations - 🔍 Debugging build issues - 🌐 Limited internet connectivity - 🔒 Sensitive code that shouldn't leave your machine ### Production Build with Assets ```bash # Build with logo and make public chutes build my_chute:chute --logo ./assets/logo.png --public --wait ``` ## Build Process ### What Happens During Build 1. **Code Analysis**: Chutes analyzes your Python code and image directives 2. **Context Packaging**: Build context files are packaged and uploaded 3. **Image Creation**: Dockerfile is generated from your Image definition 4. **Dependency Installation**: Python packages and system dependencies installed 5. **Validation**: Image is validated for compatibility ### Build Stages ```bash # Example build output Building chute: my_chute:chute ✓ Analyzing code structure ✓ Packaging build context ✓ Uploading to build server ✓ Building image layers ✓ Installing dependencies ✓ Pushing to registry Build completed successfully! Image ID: img_abc123def456 ``` ### Build Context When building remotely, the CLI will: 1. Collect all files referenced in your `Image` directives 2. Show you which files will be uploaded 3. Ask for confirmation before uploading 4. Package and send to the build server ```bash Found 15 files to include in build context -- these will be uploaded for remote builds! requirements.txt src/main.py src/utils.py ... Confirm submitting build context? (y/n) ``` ## Image Definition Images are defined in Python using the `Image` class: ```python from chutes.image import Image image = ( Image(username="myuser", name="my-chute", tag="1.0") .from_base("parachutes/python:3.12") .run_command("apt-get update && apt-get install -y git") .add("requirements.txt", "/app/requirements.txt") .run_command("pip install -r /app/requirements.txt") .add("src/", "/app/src/") ) ``` ### Recommended Base Image We **highly recommend** starting with our base image to avoid dependency issues: ```python .from_base("parachutes/python:3.12") ``` This base image includes: - CUDA 12.x installation - Python 3.12 - OpenCL libraries - Common ML dependencies ### Build Context Optimization Organize your directives for optimal caching: ```python # Good: Stable operations first, frequently changing code last image = ( Image(username="myuser", name="my-app", tag="1.0") .from_base("parachutes/python:3.12") # System deps (rarely change) .run_command("apt-get update && apt-get install -y git curl") # Python deps (change occasionally) .add("requirements.txt", "/app/requirements.txt") .run_command("pip install -r /app/requirements.txt") # Application code (changes frequently) .add("src/", "/app/src/") ) ``` ## Including Files ### Automatic Context Detection The build system automatically detects files referenced in your `Image.add()` directives: ```python image = ( Image(...) .add("requirements.txt", "/app/requirements.txt") # Only this file included .add("src/", "/app/src/") # This directory included ) ``` ### Including Entire Directory Use `--include-cwd` to include the entire current directory: ```bash chutes build my_chute:chute --include-cwd --wait ``` This is useful when your code has implicit dependencies not captured in the Image definition. ## Troubleshooting Builds ### Common Build Issues **Build fails with dependency errors?** ```bash # Build with debug to see full output chutes build my_chute:chute --local --debug # Check your requirements.txt versions are compatible cat requirements.txt ``` **Image already exists?** ```bash # Check existing images chutes images list --name my-chute # Delete old image if needed chutes images delete my-chute:1.0 ``` **Build takes too long?** - Use remote building (usually faster): `chutes build my_chute:chute --wait` - Optimize Docker layers in your Image definition - Put stable dependencies (like torch) before frequently changing code **Permission errors (local build)?** ```bash # Check Docker daemon is running sudo systemctl status docker # Check file permissions ls -la ``` ### Debug Commands ```bash # Inspect generated Dockerfile python -c "from my_chute import chute; print(chute.image)" # Check image exists after build chutes images list --name my-chute chutes images get my-chute ``` ## Build Strategies ### Development Workflow ```bash # Fast iteration during development with local builds chutes build my_chute:chute --local # Test the built image locally docker run --rm -it -p 8000:8000 my_chute:1.0 chutes run my_chute:chute --dev # Once stable, build remotely chutes build my_chute:chute --wait ``` ### CI/CD Integration ```yaml # GitHub Actions example name: Build and Deploy on: push: branches: [main] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Python uses: actions/setup-python@v4 with: python-version: "3.11" - name: Install Chutes run: pip install chutes - name: Configure Chutes env: CHUTES_CONFIG: ${{ secrets.CHUTES_CONFIG }} run: | mkdir -p ~/.chutes echo "$CHUTES_CONFIG" > ~/.chutes/config.ini - name: Build Image run: chutes build my_app:chute --wait ``` ### Production Builds ```bash #!/bin/bash set -e echo "Building production image..." # 1. Ensure clean workspace git status --porcelain [ -z "$(git status --porcelain)" ] || { echo "Uncommitted changes found"; exit 1; } # 2. Run tests python -m pytest tests/ # 3. Build image chutes build my_chute:chute --wait # 4. Deploy chutes deploy my_chute:chute --accept-fee echo "Production build and deploy completed!" ``` ## Best Practices ### 1. Pin Dependencies ```txt # requirements.txt - Good torch==2.1.0 transformers==4.30.2 numpy==1.24.3 # Bad - versions can change and break builds torch transformers numpy ``` ### 2. Use the Recommended Base Image ```python # Recommended .from_base("parachutes/python:3.12") # Not recommended unless you know what you're doing .from_base("nvidia/cuda:12.2-runtime-ubuntu22.04") ``` ### 3. Optimize Layer Order Put things that change less frequently earlier in your Image definition: 1. System packages 2. Python packages (requirements.txt) 3. Application code ### 4. Clean Up in Commands ```python # Good: Clean up in the same layer .run_command(""" apt-get update && apt-get install -y git curl && rm -rf /var/lib/apt/lists/* """) # Less optimal: Separate commands create more layers .run_command("apt-get update") .run_command("apt-get install -y git curl") ``` ### 5. Review Build Context Always review which files will be uploaded before confirming: ```bash Found 15 files to include in build context requirements.txt src/main.py ... Confirm submitting build context? (y/n) ``` Make sure no sensitive files (`.env`, credentials) are included. ## Next Steps - **[Deploying Chutes](/docs/cli/deploy)** - Deploy your built images - **[Managing Resources](/docs/cli/manage)** - Manage your chutes and images - **[Account Management](/docs/cli/account)** - API keys and configuration - **[CLI Overview](/docs/cli/overview)** - Return to command overview --- ## SOURCE: https://chutes.ai/docs/cli/deploy # Deploying Chutes The `chutes deploy` command takes your built images and deploys them as live, scalable AI applications on the Chutes platform. ## Basic Deploy Command ### `chutes deploy` Deploy a chute to the platform. ```bash chutes deploy [OPTIONS] ``` **Arguments:** - `chute_ref`: Chute reference in format `module:chute_name` **Options:** - `--config-path TEXT`: Custom config path - `--logo TEXT`: Path to logo image for the chute - `--debug`: Enable debug logging - `--public`: Mark chute as public/available to anyone - `--accept-fee`: Acknowledge and accept the deployment fee ## Deployment Examples ### Basic Deployment ```bash # Deploy with fee acknowledgment chutes deploy my_chute:chute --accept-fee ``` **What happens:** - ✅ Validates image exists and is built - ✅ Creates deployment configuration - ✅ Registers chute with the platform - ✅ Returns chute ID and version ### Production Deployment ```bash # Deploy with logo chutes deploy my_chute:chute \ --logo ./assets/logo.png \ --accept-fee ``` ### Private vs Public Deployments ```bash # Private deployment (default) - only you can access chutes deploy my_chute:chute --accept-fee # Public deployment (requires special permissions) chutes deploy my_chute:chute --public --accept-fee ``` > **Note:** Public chutes require special permissions. If you need to share your chute, use the `chutes share` command instead. ## Deployment Process ### Deployment Stages ```bash # Example deployment output Deploying chute: my_chute:chute You are about to upload my_chute.py and deploy my-chute, confirm? (y/n) y Successfully deployed chute my-chute chute_id=abc123 version=1 ``` ### What Gets Deployed When you deploy, the following is sent to the platform: - **Chute Configuration**: Name, readme, tagline - **Node Selector**: GPU requirements - **Cords**: API endpoints your chute exposes - **Code Reference**: Your chute's Python code - **Image Reference**: The built image to use ## Deployment Fees Deployment incurs a one-time fee based on your NodeSelector configuration: ```bash # Deploy and acknowledge the fee chutes deploy my_chute:chute --accept-fee ``` If you don't include `--accept-fee`, you may receive a 402 error indicating the deployment fee needs to be acknowledged. ### Fee Structure Deployment fees are calculated based on: - **GPU Type**: Higher-end GPUs cost more - **GPU Count**: More GPUs = higher fee - **VRAM Requirements**: Higher VRAM requirements cost more Example fee calculation: - Single RTX 3090 at $0.12/hr = $0.36 deployment fee - Multiple GPUs or premium GPUs will have higher fees ## Pre-Deployment Checklist Before deploying, ensure: ### 1. Image is Built and Ready ```bash # Check image status chutes images list --name my-image chutes images get my-image # Should show status: "built and pushed" ``` ### 2. Chute Configuration is Correct ```python # Verify your chute definition from chutes.chute import Chute, NodeSelector chute = Chute( username="myuser", name="my-chute", tagline="My awesome AI chute", readme="## My Chute\n\nDescription here...", image=my_image, concurrency=4, node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16, ), ) ``` ### 3. Cords are Defined ```python @chute.cord() async def my_function(self, input_data: str) -> str: return f"Processed: {input_data}" @chute.cord( public_api_path="/generate", public_api_method="POST", ) async def generate(self, prompt: str) -> str: # Your logic here return result ``` ## Chute Configuration Options ### NodeSelector Control which GPUs your chute runs on: ```python from chutes.chute import NodeSelector node_selector = NodeSelector( gpu_count=1, # Number of GPUs (1-8) min_vram_gb_per_gpu=16, # Minimum VRAM per GPU (16-80) include=["rtx4090"], # Only use these GPU types exclude=["rtx3090"], # Don't use these GPU types ) ``` ### Concurrency Set how many concurrent requests your chute can handle: ```python chute = Chute( ... concurrency=4, # Handle 4 concurrent requests per instance ) ``` ### Auto-Scaling Configure automatic scaling behavior: ```python chute = Chute( ... max_instances=10, # Maximum number of instances scaling_threshold=0.8, # Scale up threshold shutdown_after_seconds=300, # Shutdown idle instances after 5 minutes ) ``` ### Network Egress Control external network access: ```python chute = Chute( ... allow_external_egress=True, # Allow external network access ) ``` > **Note:** By default, `allow_external_egress` is **true** for custom chutes but **false** for vllm/sglang templates. Set to `True` if your chute needs to fetch external resources (e.g., image URLs for vision models). ## Sharing Chutes After deployment, you can share your chute with other users: ```bash # Share with another user chutes share --chute-id my-chute --user-id colleague # Remove sharing chutes share --chute-id my-chute --user-id colleague --remove ``` ### Billing When Sharing When you share a chute: - **You** (chute owner) pay the hourly rate while instances are running - **The user you shared with** pays the standard usage rate (per token, per step, etc.) ## Troubleshooting Deployments ### Common Deployment Issues **"Image is not available to be used (yet)!"** ```bash # Image hasn't finished building - check status chutes images get my-image # Wait for status: "built and pushed" ``` **"Unable to create public chutes from non-public images"** ```bash # If deploying public chute, image must also be public # Rebuild image with --public flag chutes build my_chute:chute --public --wait ``` **402 Payment Required** ```bash # Include --accept-fee flag chutes deploy my_chute:chute --accept-fee ``` **409 Conflict** ```bash # Chute with this name already exists # Delete existing chute first chutes chutes delete my-chute # Or use a different name in your chute definition ``` ### Debug Commands ```bash # Enable debug logging chutes deploy my_chute:chute --debug --accept-fee # Check existing chutes chutes chutes list chutes chutes get my-chute # Check image status chutes images get my-image ``` ## CI/CD Integration ### GitHub Actions ```yaml name: Deploy to Chutes on: push: branches: [main] jobs: deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Python uses: actions/setup-python@v4 with: python-version: '3.11' - name: Install Chutes run: pip install chutes - name: Configure Chutes env: CHUTES_CONFIG: ${{ secrets.CHUTES_CONFIG }} run: | mkdir -p ~/.chutes echo "$CHUTES_CONFIG" > ~/.chutes/config.ini - name: Build and Deploy run: | chutes build my_app:chute --wait chutes deploy my_app:chute --accept-fee ``` ### GitLab CI ```yaml deploy: stage: deploy script: - pip install chutes - mkdir -p ~/.chutes - echo "$CHUTES_CONFIG" > ~/.chutes/config.ini - chutes build my_app:chute --wait - chutes deploy my_app:chute --accept-fee only: - main ``` ## Production Deployment Checklist ### Pre-Deployment ```bash # ✅ Run tests locally python -m pytest tests/ # ✅ Build image and verify chutes build my_chute:chute --wait chutes images get my-chute # ✅ Test locally if possible docker run --rm -it -p 8000:8000 my_chute:tag chutes run my_chute:chute --dev ``` ### Deployment ```bash # ✅ Deploy with fee acknowledgment chutes deploy my_chute:chute --accept-fee # ✅ Note the chute_id and version from output ``` ### Post-Deployment ```bash # ✅ Verify deployment chutes chutes get my-chute # ✅ Warm up the chute chutes warmup my-chute # ✅ Test the endpoint curl -X POST https://your-chute-url/your-endpoint \ -H "Authorization: Bearer your-api-key" \ -H "Content-Type: application/json" \ -d '{"input": "test"}' ``` ## Best Practices ### 1. Use Meaningful Names ```python chute = Chute( name="sentiment-analyzer-v2", # Clear, versioned name tagline="Analyze sentiment in text using BERT", readme="## Sentiment Analyzer\n\n...", ) ``` ### 2. Set Appropriate Concurrency ```python # For LLMs with continuous batching (vllm/sglang) concurrency=64 # For single-request models (diffusion, custom) concurrency=1 # For models with some parallelism concurrency=4 ``` ### 3. Configure Shutdown Timer ```python # For development/testing - short timeout shutdown_after_seconds=60 # For production - longer timeout to avoid cold starts shutdown_after_seconds=300 ``` ### 4. Right-Size GPU Requirements ```python # Match your model's actual requirements NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24, # For ~13B parameter models ) # Don't over-provision NodeSelector( gpu_count=1, min_vram_gb_per_gpu=80, # Only if you actually need A100 ) ``` ## Next Steps - **[Managing Resources](/docs/cli/manage)** - Monitor and manage deployments - **[Building Images](/docs/cli/build)** - Optimize your build process - **[Account Management](/docs/cli/account)** - API keys and configuration - **[CLI Overview](/docs/cli/overview)** - Return to command overview --- ## SOURCE: https://chutes.ai/docs/cli/deployment-troubleshooting # Deployment troubleshooting steps ### Chute is deployed but cold Your chute image built and deployed successfully but is in a cold state. **Run the warmup endpoint** ```bash curl -X GET "https://api.chutes.ai/chutes/warmup/{chute_id_or_name}" -H "Content-Type: application/json" ``` This endpoint will notifiy the miners that this Chute is ready for use and will cause miners to deploy instances. Depending on the configuration of your Chute the warmup can take some time as it downloads any required files. It is possible that the message will timeout before the Chute actually goes hot so please be aware. **Check for active instances** ```bash # get the Chutes info and filter out everything but the instances. chutes chutes get your_chute_ID_here | jq '.instances' ``` **example output** ```bash Mac:~ chutist$ chutes chutes get 8f2105c5-b200-5aa5-969f-0720f7690f3c | jq '.instances' 2026-01-14 14:43:26.979 | INFO | chutes.config:get_config:55 - Loading chutes config from /Users/algowarry/.chutes/config.ini... 2026-01-14 14:43:26.979 | DEBUG | chutes.config:get_config:78 - Configured chutes: with api_base_url=https://api.chutes.ai 2026-01-14 14:43:26.979 | DEBUG | chutes.util.auth:sign_request:67 - Signing message: 5CPBAsApHkvdhviSry2xoN38MLX5xG5FUPy3QVWVPe46JxCT:1768419806:chutes 2026-01-14 14:43:27.189 | INFO | chutes.crud:_get_object:161 - chute 8f2105c5-b200-5aa5-969f-0720f7690f3c: [ { "instance_id": "b3f083a3-3460-4d37-83e1-6c945c5d3c61", "region": "n/a", "active": true, "verified": true, "last_verified_at": "2026-01-14T19:38:24.773055Z" }, { "instance_id": "dbf02090-aa98-4584-98c1-2f36899050b8", "region": "n/a", "active": true, "verified": true, "last_verified_at": "2026-01-14T19:38:24.773055Z" }, { "instance_id": "e6c2eb0e-4b61-43da-94cd-e17c933e6159", "region": "n/a", "active": true, "verified": true, "last_verified_at": "2026-01-14T19:38:24.773055Z" } ] ``` From the listed instances look for the "active" tag. If it is set to true the Chute is hot. If all of the "active" tags are set to false those instances are still starting and have not become hot yet. "active": true is hot and "active": false is starting up. If you see no instances at all then the warmup command failed and the Chute is sitting cold in an unused state. ## You ran the warmup but your Chute wont go hot Sometimes your build and deployment will succeed but the warmup will fail. This is usually due to a configuration error or limitation within your source file. The first common places to check are as follows. ### Model revision One of the first places to check on your source file is the model revision. This identifier is pulled from the huggingface.co repo for the model you are attempting to deploy. **Example source file with revision tag** ```bash chute = build_sglang_chute( username="chutes", readme="Qwen/Qwen3-32B", model_name="Qwen/Qwen3-32B", image="chutes/sglang:nightly-2025120900", revision="ba1f828c09458ab0ae83d42eaacc2cf8720c7957", concurrency=64, ``` To locate this revision identifier go to the models page on huggingface.co and click the Files and versions tab. Then on the right side click the commit history button. On the commit history page you will see all commits that have been made to the model. Click the small copy icon next to the most recent commit and you will have the revision identifier. This is the identifier that need to be on the revision tag in your source file. If the revision identifier in your source file does not match one on the models commit history page the Chute will fail to go hot. If you have confirmed the revisions are correct the next place to check is the node selector. ## Node Selector The node selector is the part of the Chutes source file that dictates the number of GPU's and GB of VRAM required to run your Chute. **Example correct Node Selector** ```bash node_selector=NodeSelector( gpu_count=4, min_vram_gb_per_gpu=80, ), ``` The node selector listed above is simplified and allows for a broad variety of cards to power your Chute. This is the recommended format as it allows any available nodes that fit your minimum GB of VRAM **example potentially problematic node selector** ```bash node_selector=NodeSelector( gpu_count=1, include=["A100", "h100"], ``` This node selector is very specific about the types of GPU's it wants. This can cause an issue if there are a limited number of those cards in the available inventory. If a matching card is not available the chute will fail to go hot even if everything else is correct. ## Pull Logs At this point if you have confirmed all of the above steps and your revisions and node selector are correct the next step is to review logs. To do this follow the following steps. **Warm up the chute** ```bash curl -X GET "https://api.chutes.ai/chutes/warmup/{chute_id_or_name}" -H "Content-Type: application/json" ``` now check for instances and not the instance ID's ( see above for examples) **Check for instances** ```bash # get the Chutes info and filter out everything but the instances. chutes chutes get your_chute_ID_here | jq '.instances' ``` now once you have those instance ID's you can pull live logs from those instances. **Pull logs** ```bash curl -N -H "Authorization: Bearer $YOUR_API_KEY_HERE" https://api.chutes.ai/instances/INSTANCE_ID_HERE/logs ``` You will not get live logs from the instance as it is preparing your Chute. If the instance fails the logs will end and you will see an error that caused the instance to fail. Use that to troubleshoot your Chute and resolve the issue. Ff needed create a ticket either on the site or Discord and present those logs for one of our team to review. ### Support Resources - 📖 **Documentation**: [Complete Docs](/docs) - 💬 **Discord**: [Community Chat](https://discord.gg/wHrXwWkCRz) - 📨 **Support**: [Email](support@chutes.ai) - 🐛 **Issues**: [GitHub Issues](https://github.com/chutesai/chutes/issues) --- Continue to specific command documentation: - **[Account Management](/docs/cli/account)** - Detailed account commands - **[Building Images](/docs/cli/build)** - Advanced build options - **[Deploying Chutes](/docs/cli/deploy)** - Deployment strategies - **[Managing Resources](/docs/cli/manage)** - Resource management --- ## SOURCE: https://chutes.ai/docs/cli/manage # Managing Resources This section covers CLI commands for managing your deployed chutes, images, API keys, and secrets. ## Chute Management ### `chutes chutes list` List all your deployed chutes. ```bash chutes chutes list [OPTIONS] ``` **Options:** - `--name TEXT`: Filter by name - `--limit INTEGER`: Number of items per page (default: 25) - `--page INTEGER`: Page number (default: 0) - `--include-public`: Include public chutes in results **Examples:** ```bash # List all your chutes chutes chutes list # Filter by name chutes chutes list --name sentiment # Include public chutes chutes chutes list --include-public --limit 50 ``` **Output:** ``` ┌─────────────────┬─────────────────────┬────────┬───────────────────────────────┐ │ ID │ Name │ Status │ Cords │ ├─────────────────┼─────────────────────┼────────┼───────────────────────────────┤ │ chute_abc123 │ sentiment-api │ hot │ analyze │ │ │ │ │ stream=False │ │ │ │ │ POST /analyze │ ├─────────────────┼─────────────────────┼────────┼───────────────────────────────┤ │ chute_def456 │ image-gen │ cold │ generate │ │ │ │ │ stream=True │ │ │ │ │ POST /generate │ └─────────────────┴─────────────────────┴────────┴───────────────────────────────┘ ``` ### `chutes chutes get` Get detailed information about a specific chute. ```bash chutes chutes get ``` **Arguments:** - `name_or_id`: Name or UUID of the chute **Example:** ```bash chutes chutes get my-chute ``` **Output:** ```json { "chute_id": "abc123-def456-...", "name": "my-chute", "tagline": "My awesome AI chute", "slug": "myuser/my-chute", "hot": true, "created_at": "2024-01-15T10:30:00Z", "node_selector": { "gpu_count": 1, "min_vram_gb_per_gpu": 24 }, ... } ``` ### `chutes chutes delete` Delete a chute and all its resources. ```bash chutes chutes delete ``` **Arguments:** - `name_or_id`: Name or UUID of the chute to delete **Example:** ```bash chutes chutes delete old-chute ``` **Confirmation:** ``` Are you sure you want to delete chutes/old-chute? This action is irreversible. (y/n): y Successfully deleted chute chute_abc123 ``` > **⚠️ Warning:** Deletion is permanent and cannot be undone! ## Image Management ### `chutes images list` List all your Docker images. ```bash chutes images list [OPTIONS] ``` **Options:** - `--name TEXT`: Filter by name - `--limit INTEGER`: Number of items per page (default: 25) - `--page INTEGER`: Page number (default: 0) - `--include-public`: Include public images in results **Examples:** ```bash # List all your images chutes images list # Filter by name chutes images list --name my-app # Include public images chutes images list --include-public ``` **Output:** ``` ┌─────────────────┬─────────────────┬─────────┬──────────────────┬─────────────────────┐ │ ID │ Name │ Tag │ Status │ Created │ ├─────────────────┼─────────────────┼─────────┼──────────────────┼─────────────────────┤ │ img_abc123 │ sentiment-api │ 1.0 │ built and pushed │ 2024-01-15 10:30:00 │ │ img_def456 │ image-gen │ 2.1 │ built and pushed │ 2024-01-20 14:45:00 │ │ img_ghi789 │ test-app │ dev │ building │ 2024-01-25 09:15:00 │ └─────────────────┴─────────────────┴─────────┴──────────────────┴─────────────────────┘ ``` ### `chutes images get` Get detailed information about a specific image. ```bash chutes images get ``` **Arguments:** - `name_or_id`: Name or UUID of the image **Example:** ```bash chutes images get my-app ``` ### `chutes images delete` Delete an image. ```bash chutes images delete ``` **Arguments:** - `name_or_id`: Name or UUID of the image to delete **Example:** ```bash chutes images delete old-image:1.0 ``` > **Note:** You cannot delete images that are currently in use by deployed chutes. ## Sharing Chutes ### `chutes share` Share a chute with another user or remove sharing. ```bash chutes share [OPTIONS] ``` **Options:** - `--chute-id TEXT`: The chute UUID or name to share (required) - `--user-id TEXT`: The user UUID or username to share with (required) - `--config-path TEXT`: Custom config path - `--remove`: Remove sharing instead of adding **Examples:** ```bash # Share a chute with another user chutes share --chute-id my-chute --user-id colleague # Share by UUIDs chutes share --chute-id abc123-def456 --user-id user789-xyz # Remove sharing chutes share --chute-id my-chute --user-id colleague --remove ``` ### Sharing and Billing When you share a chute: - **Chute Owner**: Pays the hourly compute rate while instances are running - **Shared User**: Pays the standard invocation rate (per token, per step, etc.) This allows you to provide access to your deployed models while sharing the costs appropriately. ## Warming Up Chutes ### `chutes warmup` Warm up a chute to ensure an instance is ready to handle requests. ```bash chutes warmup [OPTIONS] ``` **Arguments:** - `chute_id_or_ref`: The chute UUID, name, or file reference (`filename:chutevarname`) **Options:** - `--config-path TEXT`: Custom config path - `--debug`: Enable debug logging **Examples:** ```bash # Warm up by name chutes warmup my-chute # Warm up by UUID chutes warmup abc123-def456 # Warm up from file reference chutes warmup my_chute:chute ``` **Output:** ``` Status: cold -- Starting instance... Status: warming -- Loading model... Status: hot -- Instance is ready! ``` Use warmup to reduce latency for the first request to a cold chute. ## Common Workflows ### Deploying Updates ```bash # 1. Build new image chutes build my_chute:chute --wait # 2. Delete old chute (if needed) chutes chutes delete my-chute # 3. Deploy new version chutes deploy my_chute:chute --accept-fee # 4. Warm up chutes warmup my-chute ``` ### Cleaning Up Resources **Important:** You must delete chutes *before* deleting the images they use. Images tied to existing chutes (even if not currently running) cannot be deleted. ```bash # List all chutes chutes chutes list # Delete unused chutes first chutes chutes delete old-chute-1 chutes chutes delete old-chute-2 # List all images chutes images list # Delete unused images (after their chutes are removed) chutes images delete old-image:1.0 chutes images delete test-image:dev ``` ### Sharing with Team Members ```bash # Share with multiple users chutes share --chute-id my-model --user-id alice chutes share --chute-id my-model --user-id bob chutes share --chute-id my-model --user-id charlie # Later, remove access chutes share --chute-id my-model --user-id bob --remove ``` ## Automation and Scripting ### Bash Scripting ```bash #!/bin/bash # Deploy and warm up script set -e CHUTE_REF="my_chute:chute" CHUTE_NAME="my-chute" echo "Building image..." chutes build $CHUTE_REF --wait echo "Deploying chute..." chutes deploy $CHUTE_REF --accept-fee echo "Warming up..." chutes warmup $CHUTE_NAME echo "Deployment complete!" ``` ### Python Scripting ```python #!/usr/bin/env python3 import subprocess import sys def run_command(command): """Run a chutes CLI command.""" result = subprocess.run( f"chutes {command}".split(), capture_output=True, text=True ) if result.returncode != 0: print(f"Error: {result.stderr}") sys.exit(1) return result.stdout def main(): # List all chutes print("Your chutes:") output = run_command("chutes list") print(output) # Check specific chute print("\nChute details:") output = run_command("chutes get my-chute") print(output) if __name__ == "__main__": main() ``` ## Troubleshooting ### Common Issues **Chute not found?** ```bash # Check exact name/ID chutes chutes list # Use the exact name or UUID from the list chutes chutes get exact-chute-name ``` **Cannot delete chute?** The deletion requires confirmation. Type `y` when prompted: ```bash chutes chutes delete my-chute # Are you sure you want to delete chutes/my-chute? This action is irreversible. (y/n): y ``` **Image status not "built and pushed"?** ```bash # Check image status chutes images get my-image # If status is "building", wait for build to complete # If status shows an error, rebuild the image chutes build my_chute:chute --wait ``` **Warmup fails?** ```bash # Enable debug logging chutes warmup my-chute --debug # Check chute exists chutes chutes get my-chute ``` ## Best Practices ### 1. Regular Cleanup Periodically review and delete unused resources: ```bash # Review chutes chutes chutes list # Review images chutes images list # Delete what you no longer need chutes chutes delete unused-chute chutes images delete old-image:tag ``` ### 2. Use Descriptive Names Name your chutes and images clearly: ``` # Good sentiment-analyzer-bert-v2 image-gen-sdxl-1.0 llm-llama3-8b-instruct # Not as good test1 my-app chute ``` ### 3. Warm Up Before Critical Usage If you need low latency, warm up your chute before sending requests: ```bash chutes warmup my-chute # Wait for "hot" status # Then send your requests ``` ### 4. Share Instead of Making Public For most use cases, sharing with specific users is better than making chutes public: ```bash # Better: Share with specific users chutes share --chute-id my-chute --user-id trusted-user # Only if needed: Deploy as public (requires permissions) chutes deploy my_chute:chute --public --accept-fee ``` ## Next Steps - **[Building Images](/docs/cli/build)** - Optimize your images - **[Deploying Chutes](/docs/cli/deploy)** - Advanced deployment strategies - **[Account Management](/docs/cli/account)** - API keys and billing - **[CLI Overview](/docs/cli/overview)** - Return to command overview --- ## SOURCE: https://chutes.ai/docs/cli/overview # CLI Command Overview The Chutes CLI provides a complete set of commands for managing your AI applications, from account setup to deployment and monitoring. ## Installation The CLI is included when you install the Chutes SDK: ```bash pip install chutes ``` Verify installation: ```bash chutes --help ``` ## Command Structure All Chutes commands follow this pattern: ```bash chutes [subcommand] [options] [arguments] ``` ## Account Management ### `chutes register` Create a new account with the Chutes platform. ```bash chutes register [OPTIONS] ``` **Options:** - `--config-path TEXT`: Custom path to config file - `--username TEXT`: Desired username - `--wallets-path TEXT`: Path to Bittensor wallets directory (default: `~/.bittensor/wallets`) - `--wallet TEXT`: Name of the wallet to use - `--hotkey TEXT`: Hotkey to register with **Example:** ```bash chutes register --username myuser ``` ## Building & Deployment ### `chutes build` Build a Docker image for your chute. ```bash chutes build [OPTIONS] ``` **Arguments:** - `chute_ref`: Chute reference in format `module:chute_name` **Options:** - `--config-path TEXT`: Custom config path - `--logo TEXT`: Path to logo image - `--local`: Build locally instead of remotely - `--debug`: Enable debug logging - `--include-cwd`: Include entire current directory in build context - `--wait`: Wait for build to complete - `--public`: Mark image as public **Examples:** ```bash # Build remotely and wait for completion chutes build my_chute:chute --wait # Build locally for testing chutes build my_chute:chute --local # Build with a logo and make public chutes build my_chute:chute --logo ./logo.png --public ``` ### `chutes deploy` Deploy a chute to the platform. ```bash chutes deploy [OPTIONS] ``` **Arguments:** - `chute_ref`: Chute reference in format `module:chute_name` **Options:** - `--config-path TEXT`: Custom config path - `--logo TEXT`: Path to logo image - `--debug`: Enable debug logging - `--public`: Mark chute as public - `--accept-fee`: Acknowledge the deployment fee and accept being charged **Examples:** ```bash # Basic deployment chutes deploy my_chute:chute # Deploy with logo chutes deploy my_chute:chute --logo ./logo.png # Deploy and accept the deployment fee chutes deploy my_chute:chute --accept-fee ``` ### `chutes run` Run a chute locally for development and testing. ```bash chutes run [OPTIONS] ``` **Arguments:** - `chute_ref`: Chute reference in format `module:chute_name` **Options:** - `--host TEXT`: Host to bind to (default: 0.0.0.0) - `--port INTEGER`: Port to listen on (default: 8000) - `--debug`: Enable debug logging - `--dev`: Enable development mode **Examples:** ```bash # Run on default port chutes run my_chute:chute --dev # Run on custom port with debug chutes run my_chute:chute --port 8080 --debug --dev ``` ### `chutes share` Share a chute with another user. ```bash chutes share [OPTIONS] ``` **Options:** - `--chute-id TEXT`: The chute UUID or name to share (required) - `--user-id TEXT`: The user UUID or username to share with (required) - `--config-path TEXT`: Custom config path - `--remove`: Unshare/remove the share instead of adding **Examples:** ```bash # Share a chute with another user chutes share --chute-id my-chute --user-id anotheruser # Remove sharing chutes share --chute-id my-chute --user-id anotheruser --remove ``` ### `chutes warmup` Warm up a chute to ensure an instance is ready for requests. ```bash chutes warmup [OPTIONS] ``` **Arguments:** - `chute_id_or_ref`: The chute UUID, name, or file reference (format: `filename:chutevarname`) **Options:** - `--config-path TEXT`: Custom config path - `--debug`: Enable debug logging **Example:** ```bash chutes warmup my-chute ``` ## Resource Management ### `chutes chutes` Manage your deployed chutes. #### `chutes chutes list` List your chutes. ```bash chutes chutes list [OPTIONS] ``` **Options:** - `--name TEXT`: Filter by name - `--limit INTEGER`: Number of items per page (default: 25) - `--page INTEGER`: Page number (default: 0) - `--include-public`: Include public chutes **Example:** ```bash chutes chutes list --limit 10 --include-public ``` #### `chutes chutes get` Get detailed information about a specific chute. ```bash chutes chutes get ``` **Example:** ```bash chutes chutes get my-awesome-chute ``` #### `chutes chutes delete` Delete a chute. ```bash chutes chutes delete ``` **Example:** ```bash chutes chutes delete my-old-chute ``` ### `chutes images` Manage your Docker images. #### `chutes images list` List your images. ```bash chutes images list [OPTIONS] ``` **Options:** - `--name TEXT`: Filter by name - `--limit INTEGER`: Number of items per page (default: 25) - `--page INTEGER`: Page number (default: 0) - `--include-public`: Include public images #### `chutes images get` Get detailed information about a specific image. ```bash chutes images get ``` #### `chutes images delete` Delete an image. ```bash chutes images delete ``` ### `chutes keys` Manage API keys. #### `chutes keys create` Create a new API key. ```bash chutes keys create [OPTIONS] ``` **Options:** - `--name TEXT`: Name for the API key (required) - `--admin`: Create admin key with full permissions - `--images`: Allow full access to images - `--chutes`: Allow full access to chutes - `--image-ids TEXT`: Specific image IDs to allow (can be repeated) - `--chute-ids TEXT`: Specific chute IDs to allow (can be repeated) - `--action [read|write|delete|invoke]`: Specify action scope - `--json-input TEXT`: Provide raw scopes document as JSON for advanced usage - `--config-path TEXT`: Custom config path **Examples:** ```bash # Admin key chutes keys create --name admin-key --admin # Key with invoke access to all chutes chutes keys create --name invoke-key --chutes --action invoke # Key with access to specific chute chutes keys create --name readonly-key --chute-ids 12345 --action read ``` #### `chutes keys list` List your API keys. ```bash chutes keys list [OPTIONS] ``` **Options:** - `--name TEXT`: Filter by name - `--limit INTEGER`: Number of items per page (default: 25) - `--page INTEGER`: Page number (default: 0) #### `chutes keys get` Get details about a specific API key. ```bash chutes keys get ``` #### `chutes keys delete` Delete an API key. ```bash chutes keys delete ``` ### `chutes secrets` Manage secrets for your chutes (e.g., HuggingFace tokens for private models). #### `chutes secrets create` Create a new secret. ```bash chutes secrets create [OPTIONS] ``` **Options:** - `--purpose TEXT`: The chute UUID or name this secret is for (required) - `--key TEXT`: The secret key/name (required) - `--value TEXT`: The secret value (required) - `--config-path TEXT`: Custom config path **Example:** ```bash chutes secrets create --purpose my-chute --key HF_TOKEN --value hf_xxxxxxxxxxxx ``` #### `chutes secrets list` List your secrets. ```bash chutes secrets list [OPTIONS] ``` **Options:** - `--limit INTEGER`: Number of items per page (default: 25) - `--page INTEGER`: Page number (default: 0) #### `chutes secrets get` Get details about a specific secret. ```bash chutes secrets get ``` #### `chutes secrets delete` Delete a secret. ```bash chutes secrets delete ``` ## Utilities ### `chutes report` Report an invocation for billing/tracking purposes. ```bash chutes report [OPTIONS] ``` ### `chutes refinger` Change your fingerprint. ```bash chutes refinger [OPTIONS] ``` ## Global Options These options work with most commands: - `--help`: Show help message - `--config-path TEXT`: Path to custom config file - `--debug`: Enable debug logging ## Configuration ### Config File Location Default: `~/.chutes/config.ini` Override with: ```bash export CHUTES_CONFIG_PATH=/path/to/config.ini ``` ### Environment Variables - `CHUTES_CONFIG_PATH`: Custom config file path - `CHUTES_API_URL`: API base URL - `CHUTES_ALLOW_MISSING`: Allow missing config ## Common Workflows ### 1. First-Time Setup ```bash # Register account chutes register # Create admin API key chutes keys create --name admin --admin ``` ### 2. Develop and Deploy ```bash # Build your image chutes build my_app:chute --wait # Test locally docker run --rm -it -e CHUTES_EXECUTION_CONTEXT=REMOTE -p 8000:8000 my_app:tag chutes run my_app:chute --port 8000 --dev # Deploy to production chutes deploy my_app:chute --accept-fee ``` ### 3. Manage Resources ```bash # List your chutes chutes chutes list # Get detailed info chutes chutes get my-app # Warm up a chute chutes warmup my-app # Share with another user chutes share --chute-id my-app --user-id colleague # Clean up old resources chutes chutes delete old-chute chutes images delete old-image ``` ### Support Resources - 📖 **Documentation**: [Complete Docs](/docs) - 💬 **Discord**: [Community Chat](https://discord.gg/wHrXwWkCRz) - 📨 **Support**: [Email](support@chutes.ai) - 🐛 **Issues**: [GitHub Issues](https://github.com/chutesai/chutes/issues) --- Continue to specific command documentation: - **[Account Management](/docs/cli/account)** - Detailed account commands - **[Building Images](/docs/cli/build)** - Advanced build options - **[Deploying Chutes](/docs/cli/deploy)** - Deployment strategies - **[Managing Resources](/docs/cli/manage)** - Resource management --- ## SOURCE: https://chutes.ai/docs/cli/troubleshooting # Troubleshooting the CLI ### Common Issues **Command not found** ```bash # Check installation pip show chutes # Try with Python module python -m chutes --help ``` **Authentication errors** ```bash # Re-register if needed chutes register # Check config file cat ~/.chutes/config.ini ``` **Build failures** ```bash # Try local build for debugging chutes build my_app:chute --local --debug # Check image syntax python -c "from my_app import chute; print(chute.image)" ``` **Deployment issues** ```bash # Verify image exists and is built chutes images list --name my-image chutes images get my-image # Check chute status chutes chutes get my-chute ``` ### Debug Mode Enable debug logging for detailed output: ```bash chutes build my_app:chute --debug ``` ## Getting Help ### Built-in Help ```bash # General help chutes --help # Command-specific help chutes build --help chutes deploy --help chutes chutes list --help ``` ### Support Resources - 📖 **Documentation**: [Complete Docs](/docs) - 💬 **Discord**: [Community Chat](https://discord.gg/wHrXwWkCRz) - 📨 **Support**: [Email](support@chutes.ai) - 🐛 **Issues**: [GitHub Issues](https://github.com/chutesai/chutes/issues) --- Continue to specific command documentation: - **[Account Management](/docs/cli/account)** - Detailed account commands - **[Building Images](/docs/cli/build)** - Advanced build options - **[Deploying Chutes](/docs/cli/deploy)** - Deployment strategies - **[Managing Resources](/docs/cli/manage)** - Resource management --- ## SOURCE: https://chutes.ai/docs/cli/website-account-update # Updating an account made on the website If you created your account from the website and now wish to use the CLI you will need to follow this guide. Through it you will create a bittensor wallet and a Chutes config file and sync that info with your account. ## Updating an account made with the website. There are several steps required to use the CLI if you originally registered on the website and did not provide a hotkey/coldkey. To begin if you do not already have a bittensor wallet you need to install the Bittensor CLI and create one ## Install the Bittensor CLI and create a new wallet. Install the Bittensor CLI: ```bash pip install bittensor-cli ``` Verify installation: ```bash btcli --version ``` Create a new walle with coldkey and hotkey: ```bash btcli wallet create --wallet.name --wallet.hotkey ``` You will then be prompted to configure the wallet by setting a password for the coldkey, and choosing the desired mnemonic length. Completing the prompts creates a complete Bittensor wallet by setting up both coldkey and hotkeys. A unique mnemonic is generated for each key and output to the terminal upon creation. your new wallet can then be found here: ```bash ~/.bittensor/wallets ``` you can see the full contents like this: ```bash tree ~/.bittensor/ ``` It should look something like this. ```bash tree ~/.bittensor/ /Users/docwriter/.bittensor/ # The Bittensor root directory. └── wallets # The folder contains all Bittensor wallets. └── my_coldkey # The name of the wallet.    ├── coldkey # The password-encrypted coldkey.    ├── coldkeypub.txt # The unencrypted version of the coldkey.    └── hotkeys # The folder contains all this coldkey's hotkeys.    └── my_hotkey # The unencrypted hotkey information. ``` You can then check the data in any of these files like this: ```bash cd ~/.bittensor/wallets/test-coldkey cat coldkeypub.txt | jq { "accountId": "0x36e49805b105af2b5572cfc86426247df111df2f584767ca739d9fa085246c51", "publicKey": "0x36e49805b105af2b5572cfc86426247df111df2f584767ca739d9fa085246c51", "privateKey": null, "secretPhrase": null, "secretSeed": null, "ss58Address": "5DJgMDvzC27QTBfmgGQaNWBQd8CKP9z5A12yjbG6TZ5bxNE1" } ``` Once the wallet is created you can now move on to the next step creating the config.ini file. ## Creating your Chutes config.ini file. create a file called config.ini and place it in this folder, `~/.chutes` final path should be `~/.chutes/config.ini` The contents of the config.ini file should be as follows: ```bash [api] base_url = https://api.chutes.ai [auth] username = me user_id = uid hotkey_seed = replaceme hotkey_name = replaceme hotkey_ss58address = replaceme [payment] address = replaceme ``` You can get your username and user_id with the get user info api endpoint: ```bash curl -X GET "https://api.chutes.ai/users/me" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer " \ ``` Add the username and user_id from the output of this command to the config.ini file in their designated spots. The hotkey_name is the base file name of your hotkey. In this example it would be my_hotkey. Next locate the required hotkey info from this location: ```bash cd ~/.bittensor/wallets/my_coldkey/hotkeys cat hotkeys/my_hotkey | jq { "accountId": "0xc66695556006c79e278f487b01d44cf4bc611f195615a321bf3208f5e351621e", "publicKey": "0xc66695556006c79e278f487b01d44cf4bc611f195615a321bf3208f5e351621e", "privateKey": "0x38d3ae3b6e4b5df8415d15f44f * * * 0f975749f835fc221b * * * cbaac9f5ba6b1c90978e3858 * * * f0e0470be681c0b28fe2d64", "secretPhrase": "pyramid xxx wide slush xxx hub xxx crew spin xxx easily xxx", "secretSeed": "0x6c359cc52ff1256c9e5 * * * 5536c * * * 892e9ffe4e4066ad2a6e35561d6964e", "ss58Address": "5GYqp3eKu6W7KxhCNrHrVaPjsJHHLuAs5jbYWfeNzVudH8DE" } ``` Update the missing fields in config.ini file with the info found here. in the hotkey_seed field place the value from secretSeed. (#remove the 0x prefix from the front of the secret seed before you add it to config.ini or it will not work) In the hotkey_ss58address field place the value from ss58Address. Finally locate the coldkey ss58Address and put it in the address field in the payment section. ```bash cd ~/.bittensor/wallets/my_coldkey cat coldkeypub.txt | jq { "accountId": "0x36e49805b105af2b5572cfc86426247df111df2f584767ca739d9fa085246c51", "publicKey": "0x36e49805b105af2b5572cfc86426247df111df2f584767ca739d9fa085246c51", "privateKey": null, "secretPhrase": null, "secretSeed": null, "ss58Address": "5DJgMDvzC27QTBfmgGQaNWBQd8CKP9z5A12yjbG6TZ5bxNE1" } ``` The config.ini file is not complete save it and close it. ## Update your Chutes account The final step is to update your Chutes account with the newly created hotkey and coldkey: ```bash curl -XPOST https://api.chutes.ai/users/change_bt_auth -H "Authorization: " -H "Content-Type: application/json" -d '{"coldkey": "ss58 of the coldkey, from ~/.bittensor/wallets/your-coldkey/coldkeypub.txt", "hotkey": "ss58Address from the hotkey"}' ``` When the command completed check your Chutes account from the website and confirm that the hotkey and coldkey match those in your wallet. ### Support Resources - 📖 **Documentation**: [Complete Docs](/docs) - 💬 **Discord**: [Community Chat](https://discord.gg/wHrXwWkCRz) - 📨 **Support**: [Email](support@chutes.ai) - 🐛 **Issues**: [GitHub Issues](https://github.com/chutesai/chutes/issues) --- ## SOURCE: https://chutes.ai/docs/examples/audio-processing # Audio Processing with Chutes This guide demonstrates comprehensive audio processing capabilities using Chutes, from basic audio manipulation to advanced machine learning tasks like speech recognition, synthesis, and audio analysis. ## Overview Audio processing with Chutes enables: - **Speech Recognition**: Convert speech to text with high accuracy - **Text-to-Speech**: Generate natural-sounding speech from text - **Audio Enhancement**: Noise reduction, audio restoration, and quality improvement - **Music Analysis**: Beat detection, genre classification, and audio fingerprinting - **Real-time Processing**: Stream audio processing with low latency - **Multi-format Support**: Handle various audio formats (WAV, MP3, FLAC, etc.) ## Quick Start ### Basic Audio Processing Setup ```python from chutes.image import Image from chutes.chute import Chute, NodeSelector from pydantic import BaseModel from typing import List, Dict, Any, Optional import base64 class AudioProcessingConfig(BaseModel): input_format: str = "wav" output_format: str = "wav" sample_rate: int = 16000 channels: int = 1 bit_depth: int = 16 # Audio processing image with all dependencies audio_image = ( Image( username="myuser", name="audio-processing", tag="1.0.0", python_version="3.11" ) .run_command(""" apt-get update && apt-get install -y \\ ffmpeg \\ libsndfile1 \\ libsndfile1-dev \\ portaudio19-dev \\ libportaudio2 \\ libportaudiocpp0 \\ pulseaudio """) .run_command("pip install librosa==0.10.1 soundfile==0.12.1 pydub==0.25.1 pyaudio==0.2.11 numpy==1.24.3 scipy==1.11.4 torch==2.1.0 torchaudio==2.1.0 transformers==4.35.0 whisper==1.1.10") .add("./audio_utils", "/app/audio_utils") .add("./models", "/app/models") ) ``` ## Speech Recognition ### Whisper-based Speech-to-Text ```python import whisper import librosa import soundfile as sf import numpy as np from pydantic import BaseModel from typing import Optional, List, Dict, Any import tempfile import os class TranscriptionRequest(BaseModel): audio_base64: str language: Optional[str] = None task: str = "transcribe" # "transcribe" or "translate" temperature: float = 0.0 word_timestamps: bool = False class TranscriptionResponse(BaseModel): text: str language: str segments: List[Dict[str, Any]] processing_time_ms: float class WhisperTranscriber: def __init__(self, model_size: str = "base"): self.model = whisper.load_model(model_size) self.model_size = model_size def preprocess_audio(self, audio_data: bytes) -> np.ndarray: """Preprocess audio for Whisper""" # Save bytes to temporary file with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_file: temp_file.write(audio_data) temp_path = temp_file.name try: # Load and resample to 16kHz (Whisper requirement) audio, sr = librosa.load(temp_path, sr=16000, mono=True) return audio finally: os.unlink(temp_path) def transcribe_audio(self, audio_data: bytes, options: TranscriptionRequest) -> TranscriptionResponse: """Transcribe audio using Whisper""" import time start_time = time.time() # Preprocess audio audio = self.preprocess_audio(audio_data) # Transcription options transcribe_options = { "language": options.language, "task": options.task, "temperature": options.temperature, "word_timestamps": options.word_timestamps } # Remove None values transcribe_options = {k: v for k, v in transcribe_options.items() if v is not None} # Transcribe result = self.model.transcribe(audio, **transcribe_options) processing_time = (time.time() - start_time) * 1000 return TranscriptionResponse( text=result["text"].strip(), language=result["language"], segments=result["segments"], processing_time_ms=processing_time ) # Global transcriber instance transcriber = None def initialize_transcriber(model_size: str = "base"): """Initialize Whisper transcriber""" global transcriber transcriber = WhisperTranscriber(model_size) return {"status": "initialized", "model": model_size} async def transcribe_speech(inputs: Dict[str, Any]) -> Dict[str, Any]: """Speech recognition endpoint""" request = TranscriptionRequest(**inputs) # Decode base64 audio audio_data = base64.b64decode(request.audio_base64) # Transcribe result = transcriber.transcribe_audio(audio_data, request) return result.dict() ``` ### Real-time Speech Recognition ```python import pyaudio import threading import queue import numpy as np from collections import deque class RealTimeTranscriber: def __init__(self, model_size: str = "base", chunk_duration: float = 2.0): self.model = whisper.load_model(model_size) self.chunk_duration = chunk_duration self.sample_rate = 16000 self.chunk_size = int(chunk_duration * self.sample_rate) # Audio streaming setup self.audio_queue = queue.Queue() self.is_recording = False self.audio_buffer = deque(maxlen=self.sample_rate * 10) # 10 second buffer def start_recording(self): """Start real-time audio recording""" self.is_recording = True audio = pyaudio.PyAudio() stream = audio.open( format=pyaudio.paFloat32, channels=1, rate=self.sample_rate, input=True, frames_per_buffer=1024, stream_callback=self._audio_callback ) stream.start_stream() # Start transcription thread transcription_thread = threading.Thread(target=self._transcription_worker) transcription_thread.start() return stream, audio def _audio_callback(self, in_data, frame_count, time_info, status): """Audio input callback""" audio_data = np.frombuffer(in_data, dtype=np.float32) self.audio_buffer.extend(audio_data) # Check if we have enough data for a chunk if len(self.audio_buffer) >= self.chunk_size: chunk = np.array(list(self.audio_buffer)[-self.chunk_size:]) self.audio_queue.put(chunk) return (None, pyaudio.paContinue) def _transcription_worker(self): """Background transcription worker""" while self.is_recording: try: # Get audio chunk audio_chunk = self.audio_queue.get(timeout=1.0) # Transcribe chunk result = self.model.transcribe(audio_chunk, language="en") if result["text"].strip(): yield { "text": result["text"].strip(), "timestamp": time.time(), "confidence": self._estimate_confidence(result) } except queue.Empty: continue except Exception as e: print(f"Transcription error: {e}") def _estimate_confidence(self, result): """Estimate transcription confidence""" # Simple confidence estimation based on segment probabilities if "segments" in result and result["segments"]: avg_prob = np.mean([seg.get("avg_logprob", -1.0) for seg in result["segments"]]) return max(0.0, min(1.0, (avg_prob + 1.0))) return 0.5 ``` ## Text-to-Speech ### Advanced TTS with Coqui TTS ```python import torch from TTS.api import TTS import tempfile import base64 from typing import Optional class TTSRequest(BaseModel): text: str speaker: Optional[str] = None language: str = "en" speed: float = 1.0 emotion: Optional[str] = None class TTSResponse(BaseModel): audio_base64: str sample_rate: int duration_seconds: float processing_time_ms: float class AdvancedTTSService: def __init__(self): # Initialize Coqui TTS self.device = "cuda" if torch.cuda.is_available() else "cpu" # Load multi-speaker TTS model self.tts = TTS( model_name="tts_models/multilingual/multi-dataset/xtts_v2", progress_bar=False ).to(self.device) # Available speakers and languages self.speakers = self.tts.speakers if hasattr(self.tts, 'speakers') else [] self.languages = self.tts.languages if hasattr(self.tts, 'languages') else ["en"] def synthesize_speech(self, request: TTSRequest) -> TTSResponse: """Synthesize speech from text""" import time start_time = time.time() # Create temporary output file with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_file: output_path = temp_file.name try: # Synthesize speech self.tts.tts_to_file( text=request.text, file_path=output_path, speaker=request.speaker, language=request.language, speed=request.speed ) # Load generated audio audio, sample_rate = librosa.load(output_path, sr=None) # Apply speed adjustment if needed if request.speed != 1.0: audio = librosa.effects.time_stretch(audio, rate=request.speed) # Convert to base64 with open(output_path, "rb") as f: audio_base64 = base64.b64encode(f.read()).decode() processing_time = (time.time() - start_time) * 1000 duration = len(audio) / sample_rate return TTSResponse( audio_base64=audio_base64, sample_rate=sample_rate, duration_seconds=duration, processing_time_ms=processing_time ) finally: # Cleanup if os.path.exists(output_path): os.unlink(output_path) # Global TTS service tts_service = None def initialize_tts(): """Initialize TTS service""" global tts_service tts_service = AdvancedTTSService() return { "status": "initialized", "speakers": tts_service.speakers, "languages": tts_service.languages } async def synthesize_text(inputs: Dict[str, Any]) -> Dict[str, Any]: """Text-to-speech endpoint""" request = TTSRequest(**inputs) result = tts_service.synthesize_speech(request) return result.dict() ``` ## Audio Enhancement ### Noise Reduction and Audio Restoration ```python import librosa import numpy as np from scipy import signal import noisereduce as nr class AudioEnhancer: def __init__(self): self.sample_rate = 22050 def reduce_noise(self, audio: np.ndarray, noise_profile: Optional[np.ndarray] = None) -> np.ndarray: """Reduce background noise using spectral subtraction""" if noise_profile is None: # Use first 0.5 seconds as noise profile noise_duration = int(0.5 * self.sample_rate) noise_profile = audio[:noise_duration] # Apply noise reduction reduced_noise = nr.reduce_noise( y=audio, sr=self.sample_rate, stationary=True, prop_decrease=0.8 ) return reduced_noise def normalize_audio(self, audio: np.ndarray, target_level: float = -23.0) -> np.ndarray: """Normalize audio to target loudness level (LUFS)""" # Simple peak normalization current_peak = np.max(np.abs(audio)) if current_peak > 0: target_peak = 10 ** (target_level / 20) normalization_factor = target_peak / current_peak return audio * normalization_factor return audio def apply_eq(self, audio: np.ndarray, eq_bands: List[Dict[str, float]]) -> np.ndarray: """Apply parametric EQ with multiple bands""" processed_audio = audio.copy() for band in eq_bands: frequency = band["frequency"] gain = band["gain"] q_factor = band.get("q", 1.0) # Design filter nyquist = self.sample_rate / 2 normalized_freq = frequency / nyquist if gain != 0: # Peaking filter b, a = signal.iirpeak(normalized_freq, Q=q_factor) if gain > 0: # Boost boost_factor = 10 ** (gain / 20) processed_audio = signal.lfilter(b * boost_factor, a, processed_audio) else: # Cut cut_factor = 10 ** (-abs(gain) / 20) processed_audio = signal.lfilter(b * cut_factor, a, processed_audio) return processed_audio def remove_clicks_pops(self, audio: np.ndarray, threshold: float = 0.1) -> np.ndarray: """Remove clicks and pops from audio""" # Detect sudden amplitude changes diff = np.diff(audio) click_indices = np.where(np.abs(diff) > threshold)[0] # Interpolate over detected clicks for idx in click_indices: if idx > 0 and idx < len(audio) - 1: # Linear interpolation audio[idx] = (audio[idx-1] + audio[idx+1]) / 2 return audio async def enhance_audio(inputs: Dict[str, Any]) -> Dict[str, Any]: """Audio enhancement endpoint""" # Decode input audio audio_base64 = inputs["audio_base64"] audio_data = base64.b64decode(audio_base64) # Load audio with tempfile.NamedTemporaryFile(suffix=".wav") as temp_file: temp_file.write(audio_data) temp_file.flush() audio, sr = librosa.load(temp_file.name, sr=None) enhancer = AudioEnhancer() # Apply enhancements based on options options = inputs.get("options", {}) if options.get("reduce_noise", False): audio = enhancer.reduce_noise(audio) if options.get("normalize", False): target_level = options.get("target_level", -23.0) audio = enhancer.normalize_audio(audio, target_level) if "eq_bands" in options: audio = enhancer.apply_eq(audio, options["eq_bands"]) if options.get("remove_clicks", False): audio = enhancer.remove_clicks_pops(audio) # Save enhanced audio with tempfile.NamedTemporaryFile(suffix=".wav") as temp_file: sf.write(temp_file.name, audio, sr) temp_file.seek(0) enhanced_audio_base64 = base64.b64encode(temp_file.read()).decode() return { "enhanced_audio_base64": enhanced_audio_base64, "sample_rate": sr, "duration_seconds": len(audio) / sr } ``` ## Music Analysis ### Beat Detection and Tempo Analysis ```python import librosa import numpy as np from typing import List, Tuple class MusicAnalyzer: def __init__(self): self.sample_rate = 22050 def detect_beats(self, audio: np.ndarray) -> Tuple[np.ndarray, float]: """Detect beats and estimate tempo""" # Extract tempo and beats tempo, beats = librosa.beat.beat_track( y=audio, sr=self.sample_rate, hop_length=512 ) # Convert beat frames to time beat_times = librosa.frames_to_time(beats, sr=self.sample_rate) return beat_times, tempo def analyze_key_signature(self, audio: np.ndarray) -> str: """Analyze musical key signature""" # Extract chromagram chroma = librosa.feature.chroma_stft(y=audio, sr=self.sample_rate) # Average chroma across time chroma_mean = np.mean(chroma, axis=1) # Key templates (major and minor) major_template = np.array([1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1]) minor_template = np.array([1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0]) # Find best matching key keys = ['C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B'] best_correlation = -1 best_key = 'C major' for i in range(12): # Test major major_corr = np.corrcoef(chroma_mean, np.roll(major_template, i))[0, 1] if major_corr > best_correlation: best_correlation = major_corr best_key = f"{keys[i]} major" # Test minor minor_corr = np.corrcoef(chroma_mean, np.roll(minor_template, i))[0, 1] if minor_corr > best_correlation: best_correlation = minor_corr best_key = f"{keys[i]} minor" return best_key def extract_spectral_features(self, audio: np.ndarray) -> Dict[str, float]: """Extract spectral features for music analysis""" # Compute spectral features spectral_centroid = np.mean(librosa.feature.spectral_centroid(y=audio, sr=self.sample_rate)) spectral_rolloff = np.mean(librosa.feature.spectral_rolloff(y=audio, sr=self.sample_rate)) spectral_bandwidth = np.mean(librosa.feature.spectral_bandwidth(y=audio, sr=self.sample_rate)) zero_crossing_rate = np.mean(librosa.feature.zero_crossing_rate(audio)) # MFCC features mfccs = librosa.feature.mfcc(y=audio, sr=self.sample_rate, n_mfcc=13) mfcc_means = np.mean(mfccs, axis=1) return { "spectral_centroid": float(spectral_centroid), "spectral_rolloff": float(spectral_rolloff), "spectral_bandwidth": float(spectral_bandwidth), "zero_crossing_rate": float(zero_crossing_rate), "mfcc_features": mfcc_means.tolist() } async def analyze_music(inputs: Dict[str, Any]) -> Dict[str, Any]: """Music analysis endpoint""" # Decode input audio audio_base64 = inputs["audio_base64"] audio_data = base64.b64decode(audio_base64) # Load audio with tempfile.NamedTemporaryFile(suffix=".wav") as temp_file: temp_file.write(audio_data) temp_file.flush() audio, sr = librosa.load(temp_file.name, sr=22050) analyzer = MusicAnalyzer() # Perform analysis beat_times, tempo = analyzer.detect_beats(audio) key_signature = analyzer.analyze_key_signature(audio) spectral_features = analyzer.extract_spectral_features(audio) return { "tempo": float(tempo), "beat_count": len(beat_times), "beat_times": beat_times.tolist(), "key_signature": key_signature, "spectral_features": spectral_features, "duration_seconds": len(audio) / sr } ``` ## Deployment Examples ### Speech Recognition Service ```python # Deploy speech recognition chute speech_chute = Chute( username="myuser", name="speech-recognition", image=audio_image, entry_file="speech_recognition.py", entry_point="transcribe_speech", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=8), timeout_seconds=300, concurrency=8 ) # Usage transcription_result = speech_chute.run({ "audio_base64": "...", # Base64 encoded audio "language": "en", "word_timestamps": True }) print(f"Transcription: {transcription_result['text']}") ``` ### Audio Enhancement Service ```python # Deploy audio enhancement chute enhancement_chute = Chute( username="myuser", name="audio-enhancement", image=audio_image, entry_file="audio_enhancement.py", entry_point="enhance_audio", node_selector=NodeSelector( gpu_count=0, # CPU-only for audio processing), timeout_seconds=120, concurrency=10 ) # Usage enhanced_result = enhancement_chute.run({ "audio_base64": "...", # Base64 encoded audio "options": { "reduce_noise": True, "normalize": True, "target_level": -20.0, "eq_bands": [ {"frequency": 100, "gain": -3.0, "q": 1.0}, {"frequency": 1000, "gain": 2.0, "q": 1.5}, {"frequency": 8000, "gain": 1.0, "q": 1.0} ] } }) ``` ## Real-time Audio Pipeline ### WebSocket Audio Streaming ```python import asyncio import websockets import json import numpy as np class RealTimeAudioProcessor: def __init__(self): self.transcriber = WhisperTranscriber("base") self.enhancer = AudioEnhancer() self.analyzer = MusicAnalyzer() async def process_audio_stream(self, websocket, path): """Handle real-time audio WebSocket connection""" try: async for message in websocket: data = json.loads(message) if data["type"] == "audio_chunk": # Process audio chunk audio_data = base64.b64decode(data["audio_base64"]) # Convert to numpy array audio = np.frombuffer(audio_data, dtype=np.float32) # Process based on request type if data.get("process_type") == "transcribe": result = await self.transcribe_chunk(audio) elif data.get("process_type") == "enhance": result = await self.enhance_chunk(audio) elif data.get("process_type") == "analyze": result = await self.analyze_chunk(audio) # Send result back await websocket.send(json.dumps({ "type": "result", "data": result })) except websockets.exceptions.ConnectionClosed: print("Client disconnected") async def transcribe_chunk(self, audio: np.ndarray) -> Dict[str, Any]: """Transcribe audio chunk""" # Simple transcription for real-time processing if len(audio) > 0: # Convert to bytes for transcriber audio_bytes = audio.tobytes() request = TranscriptionRequest( audio_base64=base64.b64encode(audio_bytes).decode(), temperature=0.0 ) result = self.transcriber.transcribe_audio(audio_bytes, request) return result.dict() return {"text": "", "confidence": 0.0} # Start WebSocket server async def start_audio_server(): processor = RealTimeAudioProcessor() server = await websockets.serve( processor.process_audio_stream, "0.0.0.0", 8765 ) print("Audio processing server started on ws://0.0.0.0:8765") await server.wait_closed() # Run the server if __name__ == "__main__": asyncio.run(start_audio_server()) ``` ## Next Steps - **[Music Generation](music-generation)** - Generate music and audio content - **[Text-to-Speech](text-to-speech)** - Advanced speech synthesis - **[Real-time Streaming](streaming-responses)** - Build streaming audio applications - **[Custom Training](custom-training)** - Train custom audio models For production audio processing pipelines, see the [Audio Infrastructure Guide](../guides/audio-infrastructure). --- ## SOURCE: https://chutes.ai/docs/examples/batch-processing # Batch Processing This example shows how to efficiently process multiple inputs in a single request, optimizing GPU utilization and reducing API overhead for high-throughput scenarios. ## What We'll Build A batch text processing service that: - 📊 **Processes multiple texts** in a single request - ⚡ **Optimizes GPU utilization** with efficient batching - 🔄 **Handles variable input sizes** with dynamic padding - 📈 **Provides performance metrics** and timing information - 🛡️ **Validates batch constraints** for stability ## Complete Example ### `batch_processor.py` ````python import torch import time from typing import List, Optional from transformers import AutoTokenizer, AutoModelForSequenceClassification from pydantic import BaseModel, Field, validator from fastapi import HTTPException from chutes.chute import Chute, NodeSelector from chutes.image import Image # === INPUT/OUTPUT SCHEMAS === class BatchTextInput(BaseModel): texts: List[str] = Field(..., min_items=1, max_items=100, description="List of texts to process") max_length: int = Field(512, ge=50, le=1024, description="Maximum token length") batch_size: int = Field(16, ge=1, le=32, description="Processing batch size") @validator('texts') def validate_texts(cls, v): for i, text in enumerate(v): if not text.strip(): raise ValueError(f'Text at index {i} cannot be empty') if len(text) > 10000: raise ValueError(f'Text at index {i} is too long (max 10000 chars)') return [text.strip() for text in v] class TextResult(BaseModel): text: str sentiment: str confidence: float token_count: int processing_order: int class BatchResult(BaseModel): results: List[TextResult] total_texts: int processing_time: float average_time_per_text: float batch_info: dict performance_metrics: dict # === CUSTOM IMAGE === image = ( Image(username="myuser", name="batch-processor", tag="1.0") .from_base("nvidia/cuda:12.2-runtime-ubuntu22.04") .with_python("3.11") .run_command("pip install torch==2.1.0 transformers==4.30.0 accelerate==0.20.0 numpy>=1.24.0") .with_env("TRANSFORMERS_CACHE", "/app/models") .with_env("TOKENIZERS_PARALLELISM", "false") # Avoid warnings ) # === CHUTE DEFINITION === chute = Chute( username="myuser", name="batch-processor", image=image, tagline="High-throughput batch text processing", readme=""" # Batch Text Processor Efficiently process multiple texts in a single request with optimized GPU utilization. ## Usage ```bash curl -X POST https://myuser-batch-processor.chutes.ai/process-batch \\ -H "Content-Type: application/json" \\ -d '{ "texts": [ "I love this product!", "This is terrible quality.", "Amazing service and support!" ], "batch_size": 8 }' ``` ## Features - Process up to 100 texts per request - Automatic batching for GPU optimization - Dynamic padding for efficient processing - Comprehensive performance metrics """, node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=12 ), concurrency=4 # Allow multiple concurrent requests ) # === MODEL LOADING === @chute.on_startup() async def load_model(self): """Load sentiment analysis model optimized for batch processing.""" print("Loading model for batch processing...") model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest" # Load tokenizer and model self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.model = AutoModelForSequenceClassification.from_pretrained(model_name) # Optimize for batch processing self.device = "cuda" if torch.cuda.is_available() else "cpu" self.model.to(self.device) self.model.eval() # Enable optimizations if torch.cuda.is_available(): # Enable mixed precision for faster processing self.scaler = torch.cuda.amp.GradScaler() # Enable TensorCore optimizations where available torch.backends.cudnn.benchmark = True # Cache for performance tracking self.batch_stats = { "total_requests": 0, "total_texts_processed": 0, "average_batch_time": 0.0, "peak_batch_size": 0 } print(f"Model loaded on {self.device} with batch optimizations enabled") # === BATCH PROCESSING ENDPOINTS === @chute.cord( public_api_path="/process-batch", method="POST", input_schema=BatchTextInput, output_content_type="application/json" ) async def process_batch(self, data: BatchTextInput) -> BatchResult: """Process multiple texts efficiently with batching.""" start_time = time.time() # Update statistics self.batch_stats["total_requests"] += 1 self.batch_stats["total_texts_processed"] += len(data.texts) try: # Process in chunks if batch is too large all_results = [] total_batches = 0 for chunk_start in range(0, len(data.texts), data.batch_size): chunk_end = min(chunk_start + data.batch_size, len(data.texts)) text_chunk = data.texts[chunk_start:chunk_end] # Process this chunk chunk_results = await self._process_chunk( text_chunk, data.max_length, chunk_start ) all_results.extend(chunk_results) total_batches += 1 # Calculate performance metrics processing_time = time.time() - start_time avg_time_per_text = processing_time / len(data.texts) # Update global stats self.batch_stats["average_batch_time"] = ( (self.batch_stats["average_batch_time"] * (self.batch_stats["total_requests"] - 1) + processing_time) / self.batch_stats["total_requests"] ) self.batch_stats["peak_batch_size"] = max( self.batch_stats["peak_batch_size"], len(data.texts) ) return BatchResult( results=all_results, total_texts=len(data.texts), processing_time=processing_time, average_time_per_text=avg_time_per_text, batch_info={ "requested_batch_size": data.batch_size, "actual_batches_used": total_batches, "max_length": data.max_length, "device": self.device }, performance_metrics={ "texts_per_second": len(data.texts) / processing_time, "gpu_memory_used": self._get_gpu_memory_usage(), "total_tokens_processed": sum(r.token_count for r in all_results) } ) except Exception as e: raise HTTPException(status_code=500, detail=f"Batch processing failed: {str(e)}") async def \_process_chunk(self, texts: List[str], max_length: int, start_index: int) -> List[TextResult]: """Process a chunk of texts efficiently.""" # Tokenize all texts in the chunk encoded = self.tokenizer( texts, padding=True, truncation=True, max_length=max_length, return_tensors="pt" ) # Move to device input_ids = encoded['input_ids'].to(self.device) attention_mask = encoded['attention_mask'].to(self.device) # Process with mixed precision if available with torch.no_grad(): if torch.cuda.is_available(): with torch.cuda.amp.autocast(): outputs = self.model(input_ids=input_ids, attention_mask=attention_mask) else: outputs = self.model(input_ids=input_ids, attention_mask=attention_mask) # Get predictions predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) predicted_classes = predictions.argmax(dim=-1) confidences = predictions.max(dim=-1).values # Convert to results labels = ["NEGATIVE", "NEUTRAL", "POSITIVE"] results = [] for i, (text, pred_class, confidence, tokens) in enumerate( zip(texts, predicted_classes, confidences, input_ids) ): results.append(TextResult( text=text, sentiment=labels[pred_class.item()], confidence=confidence.item(), token_count=tokens.ne(self.tokenizer.pad_token_id).sum().item(), processing_order=start_index + i )) return results def \_get_gpu_memory_usage(self) -> Optional[float]: """Get current GPU memory usage in GB.""" if torch.cuda.is_available(): return torch.cuda.memory_allocated() / 1024\*\*3 return None @chute.cord( public_api_path="/batch-stats", method="GET", output_content_type="application/json" ) async def get_batch_stats(self) -> dict: """Get performance statistics for batch processing.""" stats = self.batch_stats.copy() # Add current system info stats.update({ "device": self.device, "model_loaded": hasattr(self, 'model'), "current_gpu_memory": self._get_gpu_memory_usage(), "max_gpu_memory": torch.cuda.max_memory_allocated() / 1024**3 if torch.cuda.is_available() else None }) return stats # === STREAMING BATCH PROCESSING === @chute.cord( public_api_path="/process-batch-stream", method="POST", input_schema=BatchTextInput, stream=True, output_content_type="application/json" ) async def process_batch_stream(self, data: BatchTextInput): """Process batch with streaming progress updates.""" start_time = time.time() yield { "status": "started", "total_texts": len(data.texts), "batch_size": data.batch_size, "estimated_batches": (len(data.texts) + data.batch_size - 1) // data.batch_size } all_results = [] for batch_idx, chunk_start in enumerate(range(0, len(data.texts), data.batch_size)): chunk_end = min(chunk_start + data.batch_size, len(data.texts)) text_chunk = data.texts[chunk_start:chunk_end] yield { "status": "processing_batch", "batch_number": batch_idx + 1, "batch_size": len(text_chunk), "progress": chunk_end / len(data.texts) } # Process chunk batch_start = time.time() chunk_results = await self._process_chunk(text_chunk, data.max_length, chunk_start) batch_time = time.time() - batch_start all_results.extend(chunk_results) yield { "status": "batch_complete", "batch_number": batch_idx + 1, "batch_time": batch_time, "texts_per_second": len(text_chunk) / batch_time, "partial_results": chunk_results } # Final results total_time = time.time() - start_time yield { "status": "completed", "total_time": total_time, "average_time_per_text": total_time / len(data.texts), "final_results": all_results } # Test locally if **name** == "**main**": import asyncio async def test_batch_processing(): # Simulate startup await load_model(chute) # Test batch test_texts = [ "I love this product!", "Terrible quality, very disappointed.", "Pretty good, would recommend.", "Outstanding service and delivery!", "Not worth the money spent.", "Amazing features and great design!" ] test_input = BatchTextInput( texts=test_texts, batch_size=3 ) result = await process_batch(chute, test_input) print(f"Processed {result.total_texts} texts in {result.processing_time:.2f}s") print(f"Average time per text: {result.average_time_per_text:.3f}s") for r in result.results: print(f"'{r.text[:30]}...' -> {r.sentiment} ({r.confidence:.2f})") asyncio.run(test_batch_processing()) ```` ## Performance Optimization Techniques ### 1. **Dynamic Batching** ```python # Automatically adjust batch size based on text lengths def optimize_batch_size(texts: List[str], max_tokens: int = 8192) -> int: avg_length = sum(len(text.split()) for text in texts) / len(texts) estimated_tokens_per_text = avg_length * 1.3 # Account for subword tokenization optimal_batch_size = max(1, int(max_tokens / estimated_tokens_per_text)) return min(optimal_batch_size, 32) # Cap at 32 for memory safety ``` ### 2. **Memory-Efficient Processing** ```python # Process very large batches in chunks async def process_large_batch(self, texts: List[str], chunk_size: int = 50): results = [] for i in range(0, len(texts), chunk_size): chunk = texts[i:i + chunk_size] chunk_results = await self._process_chunk(chunk, 512, i) results.extend(chunk_results) # Clear GPU cache between chunks if torch.cuda.is_available(): torch.cuda.empty_cache() return results ``` ### 3. **Mixed Precision Training** ```python # Use automatic mixed precision for faster processing with torch.cuda.amp.autocast(): outputs = self.model(input_ids=input_ids, attention_mask=attention_mask) ``` ## Testing the Batch API ### Simple Batch Test ```python import requests import time # Prepare test data texts = [ "I absolutely love this new product!", "Worst purchase I've ever made.", "It's okay, nothing special.", "Fantastic quality and great service!", "Complete waste of money.", "Highly recommend to everyone!", "Poor customer support experience.", "Exceeded all my expectations!", "Not worth the high price.", "Perfect for my needs!" ] # Test different batch sizes for batch_size in [2, 5, 10]: print(f"\nTesting batch size: {batch_size}") start_time = time.time() response = requests.post( "https://myuser-batch-processor.chutes.ai/process-batch", json={ "texts": texts, "batch_size": batch_size, "max_length": 256 } ) result = response.json() print(f"Total time: {result['processing_time']:.2f}s") print(f"Texts/second: {result['performance_metrics']['texts_per_second']:.1f}") print(f"Avg time per text: {result['average_time_per_text']:.3f}s") ``` ### Performance Comparison ```python import asyncio import aiohttp import time async def compare_batch_vs_individual(): """Compare batch processing vs individual requests.""" texts = ["Sample text for testing"] * 20 # Test individual requests start_time = time.time() individual_results = [] async with aiohttp.ClientSession() as session: tasks = [] for text in texts: task = session.post( "https://myuser-batch-processor.chutes.ai/analyze-single", json={"text": text} ) tasks.append(task) responses = await asyncio.gather(*tasks) for resp in responses: result = await resp.json() individual_results.append(result) individual_time = time.time() - start_time # Test batch processing start_time = time.time() async with aiohttp.ClientSession() as session: async with session.post( "https://myuser-batch-processor.chutes.ai/process-batch", json={"texts": texts, "batch_size": 10} ) as resp: batch_result = await resp.json() batch_time = time.time() - start_time print(f"Individual requests: {individual_time:.2f}s") print(f"Batch processing: {batch_time:.2f}s") print(f"Speedup: {individual_time / batch_time:.1f}x") asyncio.run(compare_batch_vs_individual()) ``` ### Streaming Batch Processing ```python import asyncio import aiohttp import json async def test_streaming_batch(): """Test streaming batch processing with progress updates.""" texts = [f"Test message number {i} for batch processing" for i in range(25)] async with aiohttp.ClientSession() as session: async with session.post( "https://myuser-batch-processor.chutes.ai/process-batch-stream", json={"texts": texts, "batch_size": 5} ) as response: async for line in response.content: if line: try: data = json.loads(line.decode()) if data['status'] == 'processing_batch': print(f"Processing batch {data['batch_number']} ({data['progress']:.1%} complete)") elif data['status'] == 'batch_complete': print(f"Batch {data['batch_number']} completed in {data['batch_time']:.2f}s") elif data['status'] == 'completed': print(f"All processing completed in {data['total_time']:.2f}s") except json.JSONDecodeError: continue asyncio.run(test_streaming_batch()) ``` ## Key Performance Concepts ### 1. **Batch Size Optimization** ```python # Find optimal batch size for your hardware def find_optimal_batch_size(model, tokenizer, device, max_length=512): batch_sizes = [1, 2, 4, 8, 16, 32] test_texts = ["Sample text for testing"] * 32 best_throughput = 0 best_batch_size = 1 for batch_size in batch_sizes: try: start_time = time.time() # Test processing for i in range(0, len(test_texts), batch_size): batch = test_texts[i:i + batch_size] encoded = tokenizer(batch, padding=True, truncation=True, max_length=max_length, return_tensors="pt") with torch.no_grad(): _ = model(**encoded.to(device)) total_time = time.time() - start_time throughput = len(test_texts) / total_time if throughput > best_throughput: best_throughput = throughput best_batch_size = batch_size except RuntimeError as e: if "out of memory" in str(e): break return best_batch_size, best_throughput ``` ### 2. **Memory Management** ```python # Monitor and manage GPU memory def manage_gpu_memory(): if torch.cuda.is_available(): # Clear cache between large batches torch.cuda.empty_cache() # Get memory usage allocated = torch.cuda.memory_allocated() / 1024**3 cached = torch.cuda.memory_reserved() / 1024**3 print(f"GPU Memory - Allocated: {allocated:.2f}GB, Cached: {cached:.2f}GB") # Set memory fraction if needed torch.cuda.set_per_process_memory_fraction(0.8) ``` ### 3. **Padding Optimization** ```python # Minimize padding for better efficiency def optimize_padding(texts, tokenizer, max_length): # Sort by length to minimize padding text_lengths = [(len(text), i, text) for i, text in enumerate(texts)] text_lengths.sort() batches = [] current_batch = [] for length, original_idx, text in text_lengths: current_batch.append((original_idx, text)) # Create batch when we have enough similar-length texts if len(current_batch) >= batch_size: batches.append(current_batch) current_batch = [] if current_batch: batches.append(current_batch) return batches ``` ## Common Batch Processing Patterns ### 1. **Classification Tasks** ```python # Sentiment analysis batch processing async def batch_sentiment_analysis(texts: List[str]) -> List[dict]: results = [] batch_size = 16 for i in range(0, len(texts), batch_size): batch = texts[i:i + batch_size] batch_results = await process_sentiment_batch(batch) results.extend(batch_results) return results ``` ### 2. **Text Generation** ```python # Batch text generation with different prompts async def batch_text_generation(prompts: List[str]) -> List[str]: generated_texts = [] # Process prompts in batches for batch_start in range(0, len(prompts), batch_size): batch_prompts = prompts[batch_start:batch_start + batch_size] # Generate for batch batch_outputs = model.generate( **tokenizer(batch_prompts, return_tensors="pt", padding=True), max_length=100, num_return_sequences=1 ) batch_texts = tokenizer.batch_decode(batch_outputs, skip_special_tokens=True) generated_texts.extend(batch_texts) return generated_texts ``` ### 3. **Embedding Generation** ```python # Batch embedding generation async def batch_embeddings(texts: List[str]) -> List[List[float]]: embeddings = [] for i in range(0, len(texts), batch_size): batch = texts[i:i + batch_size] # Tokenize batch encoded = tokenizer(batch, padding=True, truncation=True, return_tensors="pt") # Generate embeddings with torch.no_grad(): outputs = model(**encoded.to(device)) batch_embeddings = outputs.last_hidden_state.mean(dim=1) embeddings.extend(batch_embeddings.cpu().tolist()) return embeddings ``` ## Next Steps - **[Multi-Model Analysis](/docs/examples/multi-model-analysis)** - Combine multiple AI models - **[Performance Optimization](/docs/guides/performance)** - Advanced speed optimization - **[Production Deployment](/docs/guides/production)** - Scale to production workloads - **[Cost Optimization](/docs/guides/cost-optimization)** - Manage processing costs --- ## SOURCE: https://chutes.ai/docs/examples/custom-chute-complete # Complete Text Analysis Service This guide demonstrates building a comprehensive text analysis service that combines multiple AI models for sentiment analysis, entity recognition, text classification, and content moderation. ## Overview This complete example showcases: - **Multi-model Architecture**: Combining different AI models in a single service - **Sentiment Analysis**: Understanding emotional tone of text - **Named Entity Recognition**: Extracting people, places, organizations - **Text Classification**: Categorizing content by topic or intent - **Content Moderation**: Detecting inappropriate or harmful content - **Batch Processing**: Handling multiple texts efficiently - **Error Handling**: Robust error management across models - **Monitoring**: Built-in metrics and health checks - **Caching**: Performance optimization for repeated queries ## Complete Implementation ### Input Schema Design Define comprehensive input validation for text analysis: ```python from pydantic import BaseModel, Field from typing import Optional, List, Dict, Any from enum import Enum class AnalysisType(str, Enum): SENTIMENT = "sentiment" ENTITIES = "entities" CLASSIFICATION = "classification" MODERATION = "moderation" ALL = "all" class TextInput(BaseModel): text: str = Field(..., min_length=1, max_length=10000) id: Optional[str] = Field(None, description="Optional identifier for tracking") metadata: Optional[Dict[str, Any]] = Field(default_factory=dict) class InputArgs(BaseModel): texts: List[TextInput] = Field(..., min_items=1, max_items=100) analysis_types: List[AnalysisType] = Field(default=[AnalysisType.ALL]) include_confidence: bool = Field(default=True) language: Optional[str] = Field(default="en", description="ISO language code") ``` ### Custom Image with Multiple Models Build a comprehensive image with all required AI models: ```python from chutes.image import Image from chutes.chute import Chute, NodeSelector image = ( Image( username="myuser", name="text-analysis-complete", tag="1.0.0", python_version="3.11" ) .run_command("pip install transformers==4.35.0 torch==2.1.0 spacy==3.7.2 scikit-learn==1.3.0 numpy==1.24.3 pandas==2.0.3 redis==5.0.0 prometheus-client==0.18.0") .run_command("python -m spacy download en_core_web_sm") .run_command("python -m spacy download en_core_web_lg") .add("./models", "/app/models") .add("./config", "/app/config") ) ``` ### Multi-Model Service Implementation Create a comprehensive service that orchestrates multiple AI models: ```python import asyncio import json import time from typing import Dict, List, Any, Optional from dataclasses import dataclass from datetime import datetime import logging import torch import spacy import redis from transformers import ( AutoTokenizer, AutoModelForSequenceClassification, pipeline ) from prometheus_client import Counter, Histogram, start_http_server import numpy as np # Metrics REQUEST_COUNT = Counter('analysis_requests_total', 'Total analysis requests', ['type']) REQUEST_DURATION = Histogram('analysis_duration_seconds', 'Request duration', ['type']) ERROR_COUNT = Counter('analysis_errors_total', 'Total errors', ['type', 'error']) @dataclass class AnalysisResult: text_id: Optional[str] sentiment: Optional[Dict[str, Any]] = None entities: Optional[List[Dict[str, Any]]] = None classification: Optional[Dict[str, Any]] = None moderation: Optional[Dict[str, Any]] = None processing_time_ms: Optional[float] = None metadata: Optional[Dict[str, Any]] = None class TextAnalysisService: def __init__(self, cache_enabled: bool = True): self.logger = logging.getLogger(__name__) self.cache_enabled = cache_enabled # Initialize Redis cache if cache_enabled: try: self.cache = redis.Redis(host='localhost', port=6379, db=0) self.cache.ping() self.logger.info("Cache connection established") except Exception as e: self.logger.warning(f"Cache disabled: {e}") self.cache_enabled = False # Load models self._load_models() # Start metrics server start_http_server(8001) self.logger.info("Metrics server started on port 8001") def _load_models(self): """Load all AI models with proper error handling""" self.logger.info("Loading AI models...") try: # Sentiment Analysis Model self.sentiment_tokenizer = AutoTokenizer.from_pretrained( "cardiffnlp/twitter-roberta-base-sentiment-latest" ) self.sentiment_model = AutoModelForSequenceClassification.from_pretrained( "cardiffnlp/twitter-roberta-base-sentiment-latest" ) self.logger.info("✓ Sentiment model loaded") # Text Classification Model self.classifier = pipeline( "text-classification", model="facebook/bart-large-mnli", device=0 if torch.cuda.is_available() else -1 ) self.logger.info("✓ Classification model loaded") # Content Moderation Model self.moderation_pipeline = pipeline( "text-classification", model="unitary/toxic-bert", device=0 if torch.cuda.is_available() else -1 ) self.logger.info("✓ Moderation model loaded") # Named Entity Recognition self.nlp = spacy.load("en_core_web_lg") self.logger.info("✓ NER model loaded") except Exception as e: self.logger.error(f"Failed to load models: {e}") raise def _get_cache_key(self, text: str, analysis_type: str) -> str: """Generate cache key for text and analysis type""" import hashlib text_hash = hashlib.md5(text.encode()).hexdigest() return f"analysis:{analysis_type}:{text_hash}" def _get_cached_result(self, cache_key: str) -> Optional[Dict]: """Retrieve cached analysis result""" if not self.cache_enabled: return None try: cached = self.cache.get(cache_key) if cached: return json.loads(cached) except Exception as e: self.logger.warning(f"Cache read error: {e}") return None def _cache_result(self, cache_key: str, result: Dict, ttl: int = 3600): """Cache analysis result with TTL""" if not self.cache_enabled: return try: self.cache.setex( cache_key, ttl, json.dumps(result, default=str) ) except Exception as e: self.logger.warning(f"Cache write error: {e}") async def analyze_sentiment(self, text: str) -> Dict[str, Any]: """Perform sentiment analysis with caching""" cache_key = self._get_cache_key(text, "sentiment") cached = self._get_cached_result(cache_key) if cached: return cached with REQUEST_DURATION.labels(type='sentiment').time(): try: inputs = self.sentiment_tokenizer( text, return_tensors="pt", truncation=True, max_length=512 ) with torch.no_grad(): outputs = self.sentiment_model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) labels = ['negative', 'neutral', 'positive'] scores = predictions[0].tolist() result = { 'label': labels[np.argmax(scores)], 'confidence': float(max(scores)), 'scores': {label: float(score) for label, score in zip(labels, scores)} } self._cache_result(cache_key, result) REQUEST_COUNT.labels(type='sentiment').inc() return result except Exception as e: ERROR_COUNT.labels(type='sentiment', error=type(e).__name__).inc() raise Exception(f"Sentiment analysis failed: {e}") async def extract_entities(self, text: str) -> List[Dict[str, Any]]: """Extract named entities with caching""" cache_key = self._get_cache_key(text, "entities") cached = self._get_cached_result(cache_key) if cached: return cached with REQUEST_DURATION.labels(type='entities').time(): try: doc = self.nlp(text) entities = [] for ent in doc.ents: entities.append({ 'text': ent.text, 'label': ent.label_, 'description': spacy.explain(ent.label_), 'start': ent.start_char, 'end': ent.end_char, 'confidence': float(ent.kb_id_) if ent.kb_id_ else 0.9 }) self._cache_result(cache_key, entities) REQUEST_COUNT.labels(type='entities').inc() return entities except Exception as e: ERROR_COUNT.labels(type='entities', error=type(e).__name__).inc() raise Exception(f"Entity extraction failed: {e}") async def classify_text(self, text: str, categories: List[str] = None) -> Dict[str, Any]: """Classify text into categories""" if categories is None: categories = [ "technology", "business", "health", "sports", "entertainment", "politics", "science", "education" ] cache_key = self._get_cache_key(f"{text}:{','.join(categories)}", "classification") cached = self._get_cached_result(cache_key) if cached: return cached with REQUEST_DURATION.labels(type='classification').time(): try: # Use zero-shot classification candidate_labels = categories result = self.classifier(text, candidate_labels) classification_result = { 'predicted_category': result['labels'][0], 'confidence': float(result['scores'][0]), 'all_scores': { label: float(score) for label, score in zip(result['labels'], result['scores']) } } self._cache_result(cache_key, classification_result) REQUEST_COUNT.labels(type='classification').inc() return classification_result except Exception as e: ERROR_COUNT.labels(type='classification', error=type(e).__name__).inc() raise Exception(f"Text classification failed: {e}") async def moderate_content(self, text: str) -> Dict[str, Any]: """Detect inappropriate content""" cache_key = self._get_cache_key(text, "moderation") cached = self._get_cached_result(cache_key) if cached: return cached with REQUEST_DURATION.labels(type='moderation').time(): try: result = self.moderation_pipeline(text) # Process toxicity detection result is_toxic = any(item['label'] == 'TOXIC' and item['score'] > 0.7 for item in result) max_toxicity_score = max((item['score'] for item in result if item['label'] == 'TOXIC'), default=0.0) moderation_result = { 'is_inappropriate': is_toxic, 'toxicity_score': float(max_toxicity_score), 'categories': result, 'action_required': is_toxic } self._cache_result(cache_key, moderation_result) REQUEST_COUNT.labels(type='moderation').inc() return moderation_result except Exception as e: ERROR_COUNT.labels(type='moderation', error=type(e).__name__).inc() raise Exception(f"Content moderation failed: {e}") async def analyze_single_text( self, text_input: TextInput, analysis_types: List[AnalysisType], include_confidence: bool = True ) -> AnalysisResult: """Analyze a single text with specified analysis types""" start_time = time.time() result = AnalysisResult(text_id=text_input.id, metadata=text_input.metadata) try: # Determine which analyses to run run_all = AnalysisType.ALL in analysis_types tasks = [] if run_all or AnalysisType.SENTIMENT in analysis_types: tasks.append(("sentiment", self.analyze_sentiment(text_input.text))) if run_all or AnalysisType.ENTITIES in analysis_types: tasks.append(("entities", self.extract_entities(text_input.text))) if run_all or AnalysisType.CLASSIFICATION in analysis_types: tasks.append(("classification", self.classify_text(text_input.text))) if run_all or AnalysisType.MODERATION in analysis_types: tasks.append(("moderation", self.moderate_content(text_input.text))) # Run analyses concurrently if tasks: task_names, task_coroutines = zip(*tasks) results = await asyncio.gather(*task_coroutines, return_exceptions=True) for name, task_result in zip(task_names, results): if isinstance(task_result, Exception): self.logger.error(f"Analysis {name} failed: {task_result}") else: setattr(result, name, task_result) result.processing_time_ms = (time.time() - start_time) * 1000 return result except Exception as e: self.logger.error(f"Text analysis failed: {e}") result.processing_time_ms = (time.time() - start_time) * 1000 raise Exception(f"Analysis failed: {e}") async def analyze_batch(self, inputs: InputArgs) -> List[AnalysisResult]: """Analyze multiple texts concurrently""" self.logger.info(f"Processing batch of {len(inputs.texts)} texts") # Process texts concurrently with controlled concurrency semaphore = asyncio.Semaphore(10) # Limit concurrent analyses async def analyze_with_semaphore(text_input): async with semaphore: return await self.analyze_single_text( text_input, inputs.analysis_types, inputs.include_confidence ) tasks = [analyze_with_semaphore(text_input) for text_input in inputs.texts] results = await asyncio.gather(*tasks, return_exceptions=True) # Convert exceptions to error results final_results = [] for i, result in enumerate(results): if isinstance(result, Exception): error_result = AnalysisResult( text_id=inputs.texts[i].id, metadata={"error": str(result)} ) final_results.append(error_result) else: final_results.append(result) return final_results # Global service instance service = None def get_service() -> TextAnalysisService: """Get or create the global service instance""" global service if service is None: service = TextAnalysisService() return service async def run(inputs: InputArgs) -> List[Dict[str, Any]]: """Main entry point for the chute""" analysis_service = get_service() try: results = await analysis_service.analyze_batch(inputs) # Convert results to serializable format output = [] for result in results: result_dict = { 'text_id': result.text_id, 'processing_time_ms': result.processing_time_ms, 'metadata': result.metadata } if result.sentiment: result_dict['sentiment'] = result.sentiment if result.entities: result_dict['entities'] = result.entities if result.classification: result_dict['classification'] = result.classification if result.moderation: result_dict['moderation'] = result.moderation output.append(result_dict) return output except Exception as e: logging.error(f"Batch processing failed: {e}") raise Exception(f"Analysis service error: {e}") ``` ### Creating the Complete Chute Deploy the comprehensive text analysis service: ```python from chutes.chute import Chute, NodeSelector # Create the complete text analysis chute chute = Chute( username="myuser", name="text-analysis-complete", image=image, entry_file="analysis_service.py", entry_point="run", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16), timeout_seconds=300, concurrency=5 ) # Deploy the service print("Deploying comprehensive text analysis service...") # Use the CLI to deploy: # chutes deploy analysis_service:chute print("✅ Service deployed! (Use `chutes deploy` CLI command)") ``` ## Usage Examples ### Basic Text Analysis ```python # Analyze a single text with all models response = chute.run({ "texts": [ { "text": "I absolutely love this new AI technology! It's revolutionary and will change everything.", "id": "text_1" } ], "analysis_types": ["all"], "include_confidence": True }) # Response includes all analysis types result = response[0] print(f"Sentiment: {result['sentiment']['label']} ({result['sentiment']['confidence']:.2f})") print(f"Category: {result['classification']['predicted_category']}") print(f"Entities: {[ent['text'] for ent in result['entities']]}") print(f"Content Safe: {not result['moderation']['is_inappropriate']}") ``` ### Batch Processing ```python # Analyze multiple texts efficiently texts = [ {"text": "This product is amazing!", "id": "review_1"}, {"text": "The service was terrible and slow.", "id": "review_2"}, {"text": "Apple Inc. reported strong quarterly earnings.", "id": "news_1"}, {"text": "The new iPhone features advanced AI capabilities.", "id": "tech_1"} ] response = chute.run({ "texts": texts, "analysis_types": ["sentiment", "entities", "classification"], "include_confidence": True }) # Process results for result in response: print(f"\nText ID: {result['text_id']}") print(f"Processing time: {result['processing_time_ms']:.2f}ms") if 'sentiment' in result: print(f"Sentiment: {result['sentiment']['label']}") if 'entities' in result: print(f"Entities: {[ent['text'] for ent in result['entities']]}") ``` ### Selective Analysis ```python # Run only specific analysis types response = chute.run({ "texts": [ {"text": "Breaking: Tech giant announces major acquisition", "id": "headline_1"} ], "analysis_types": ["entities", "classification"], # Only NER and classification "include_confidence": True }) ``` ### Content Moderation Focus ```python # Focus on content safety user_comments = [ {"text": "This is a great discussion!", "id": "comment_1"}, {"text": "I disagree but respect your opinion.", "id": "comment_2"}, {"text": "This platform needs better moderation.", "id": "comment_3"} ] response = chute.run({ "texts": user_comments, "analysis_types": ["moderation", "sentiment"], "include_confidence": True }) # Filter inappropriate content safe_comments = [ result for result in response if not result['moderation']['is_inappropriate'] ] ``` ## Performance Optimization ### Caching Strategy The service implements intelligent caching: - **Redis-based caching** for repeated text analyses - **1-hour TTL** for cached results - **Cache keys** based on text content and analysis type - **Graceful degradation** when cache is unavailable ### Concurrent Processing - **Semaphore-controlled concurrency** (max 10 concurrent analyses) - **Async/await patterns** for non-blocking operations - **Batch processing** for multiple texts - **Error isolation** prevents single failures from affecting the batch ### Resource Management ```python # Optimized node selection for production chute = Chute( username="myuser", name="text-analysis-production", image=image, entry_file="analysis_service.py", entry_point="run", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24, # Larger VRAM for complex models# More RAM for caching preferred_provider="runpod" # Specify provider if needed ), timeout_seconds=600, # Longer timeout for large batches concurrency=10 # Higher concurrency for production ) ``` ## Monitoring and Observability ### Built-in Metrics The service exposes Prometheus metrics on port 8001: - `analysis_requests_total` - Total requests by analysis type - `analysis_duration_seconds` - Request duration histograms - `analysis_errors_total` - Error counts by type ### Health Checks ```python # Health check endpoint async def health_check(): service = get_service() # Test all models with sample text test_text = "Hello world" try: await service.analyze_sentiment(test_text) await service.extract_entities(test_text) await service.classify_text(test_text) await service.moderate_content(test_text) return {"status": "healthy", "timestamp": datetime.now().isoformat()} except Exception as e: return {"status": "unhealthy", "error": str(e)} ``` ### Logging Configuration ```python import logging # Configure structured logging logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', handlers=[ logging.StreamHandler(), logging.FileHandler('/app/logs/analysis.log') ] ) ``` ## Error Handling and Recovery ### Graceful Degradation ```python async def robust_analysis(text_input: TextInput) -> AnalysisResult: """Analysis with fallback strategies""" result = AnalysisResult(text_id=text_input.id) # Try sentiment analysis with fallback try: result.sentiment = await analyze_sentiment(text_input.text) except Exception as e: result.sentiment = {"error": "Sentiment analysis unavailable", "fallback": True} logger.warning(f"Sentiment analysis failed: {e}") # Continue with other analyses even if one fails try: result.entities = await extract_entities(text_input.text) except Exception as e: result.entities = [] logger.warning(f"Entity extraction failed: {e}") return result ``` ### Circuit Breaker Pattern ```python class CircuitBreaker: def __init__(self, failure_threshold=5, timeout=60): self.failure_threshold = failure_threshold self.timeout = timeout self.failure_count = 0 self.last_failure_time = None self.state = "CLOSED" # CLOSED, OPEN, HALF_OPEN async def call(self, func, *args, **kwargs): if self.state == "OPEN": if time.time() - self.last_failure_time > self.timeout: self.state = "HALF_OPEN" else: raise Exception("Circuit breaker is OPEN") try: result = await func(*args, **kwargs) if self.state == "HALF_OPEN": self.state = "CLOSED" self.failure_count = 0 return result except Exception as e: self.failure_count += 1 self.last_failure_time = time.time() if self.failure_count >= self.failure_threshold: self.state = "OPEN" raise e ``` ## Advanced Features ### Custom Model Integration ```python # Add custom models to the service class CustomTextAnalysisService(TextAnalysisService): def _load_models(self): super()._load_models() # Load custom domain-specific model self.custom_classifier = pipeline( "text-classification", model="/app/models/custom-domain-classifier", device=0 if torch.cuda.is_available() else -1 ) async def custom_classification(self, text: str) -> Dict[str, Any]: """Domain-specific classification""" result = self.custom_classifier(text) return { 'custom_category': result[0]['label'], 'confidence': result[0]['score'] } ``` ### Multi-language Support ```python # Language detection and processing from langdetect import detect async def analyze_multilingual_text(self, text: str, language: str = None) -> Dict: """Analyze text with language-specific models""" # Auto-detect language if not provided if language is None: language = detect(text) # Load language-specific models if language == "es": nlp = spacy.load("es_core_news_sm") elif language == "fr": nlp = spacy.load("fr_core_news_sm") else: nlp = self.nlp # Default English model # Process with appropriate model doc = nlp(text) return self._extract_entities_from_doc(doc) ``` ## Deployment Best Practices ### Production Configuration ```python # Production-ready deployment production_chute = Chute( username="mycompany", name="text-analysis-prod", image=image, entry_file="analysis_service.py", entry_point="run", node_selector=NodeSelector( gpu_count=2, min_vram_gb_per_gpu=24preferred_provider="runpod", instance_type="RTX A6000" ), environment={ "REDIS_URL": "redis://cache.example.com:6379", "LOG_LEVEL": "INFO", "CACHE_TTL": "3600", "MAX_BATCH_SIZE": "100" }, timeout_seconds=900, concurrency=20, auto_scale=True, min_instances=2, max_instances=10 ) ``` ### Cost Optimization ```python # Cost-optimized configuration for development dev_chute = Chute( username="myuser", name="text-analysis-dev", image=image, entry_file="analysis_service.py", entry_point="run", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=8), timeout_seconds=300, concurrency=3, auto_scale=False ) ``` This comprehensive example demonstrates how to build a production-ready text analysis service that combines multiple AI models, implements proper error handling, includes monitoring and caching, and provides a robust API for various text analysis tasks. --- ## SOURCE: https://chutes.ai/docs/examples/custom-images # Custom Docker Images for Chutes This guide demonstrates how to build custom Docker images for specialized use cases and advanced configurations in your Chutes applications. ## Overview Custom images allow you to: - **Pre-install Dependencies**: Include specific libraries, models, or tools - **Optimize Performance**: Use custom Python versions or optimized libraries - **Add System Tools**: Include CLI tools, databases, or other services - **Custom Base Images**: Start from specialized base images (CUDA, Ubuntu, etc.) - **Security Hardening**: Apply security configurations and patches ## Quick Examples ### Basic Custom Image ```python from chutes.image import Image # Simple custom image with additional packages image = ( Image( username="myuser", name="custom-nlp", tag="1.0.0", python_version="3.11" ) .run_command("pip install transformers==4.35.0 torch==2.1.0 spacy==3.7.2") .run_command("python -m spacy download en_core_web_sm") ) ``` ### GPU-Optimized Image ```python from chutes.image import Image # CUDA-optimized image for deep learning image = ( Image( username="myuser", name="gpu-ml", tag="cuda-12.1", base_image="nvidia/cuda:12.1-devel-ubuntu22.04", python_version="3.11" ) .run_command("apt-get update && apt-get install -y git wget") .run_command("pip install torch==2.1.0+cu121 torchvision==0.16.0+cu121 torchaudio==2.1.0+cu121 --extra-index-url https://download.pytorch.org/whl/cu121") .run_command("pip install transformers>=4.35.0 accelerate>=0.24.0 bitsandbytes>=0.41.0") ) ``` ## Advanced Configurations ### Multi-Stage Build ```python from chutes.image import Image # Multi-stage build for smaller final image image = ( Image( username="myuser", name="optimized-app", tag="slim", python_version="3.11" ) # Build stage - install build dependencies .run_command(""" apt-get update && apt-get install -y \\ build-essential \\ cmake \\ git \\ wget """) .run_command("pip install torch==2.1.0 transformers==4.35.0 opencv-python==4.8.0.76") # Cleanup stage - remove build dependencies .run_command(""" apt-get autoremove -y build-essential cmake && \\ apt-get clean && \\ rm -rf /var/lib/apt/lists/* """) .add("./app", "/app") .set_workdir("/app") ) ``` ### Custom Base with Pre-trained Models ```python from chutes.image import Image # Include pre-downloaded models in the image image = ( Image( username="myuser", name="llm-server", tag="mistral-7b", base_image="python:3.11-slim" ) .run_command("mkdir -p /models") .add("./models/mistral-7b-instruct", "/models/mistral-7b-instruct") .run_command("pip install vllm==0.2.5 transformers==4.35.0 torch==2.1.0") .with_env("MODEL_PATH", "/models/mistral-7b-instruct") .with_env("CUDA_VISIBLE_DEVICES", "0") ) ``` ### Database Integration ```python from chutes.image import Image # Image with PostgreSQL and Redis image = ( Image( username="myuser", name="full-stack-ai", tag="latest", base_image="ubuntu:22.04" ) .run_command(""" apt-get update && apt-get install -y \\ python3.11 \\ python3.11-pip \\ postgresql-14 \\ redis-server \\ supervisor """) .run_command("pip install fastapi==0.104.1 uvicorn==0.24.0 psycopg2-binary==2.9.7 redis==5.0.0 sqlalchemy==2.0.23") .add("./config/supervisor.conf", "/etc/supervisor/conf.d/") .add("./app", "/app") .set_workdir("/app") ) ``` ## Specialized Use Cases ### Computer Vision Pipeline ```python from chutes.image import Image # OpenCV + deep learning for computer vision image = ( Image( username="myuser", name="cv-pipeline", tag="opencv-4.8", python_version="3.11" ) .run_command(""" apt-get update && apt-get install -y \\ libopencv-dev \\ libglib2.0-0 \\ libsm6 \\ libxext6 \\ libxrender-dev \\ libgomp1 \\ libglib2.0-0 """) .run_command("pip install opencv-python==4.8.0.76 opencv-contrib-python==4.8.0.76 pillow==10.0.1 numpy==1.24.3 scikit-image==0.21.0 ultralytics==8.0.206") .add("./models/yolo", "/app/models/yolo") .add("./utils", "/app/utils") ) ``` ### Audio Processing ```python from chutes.image import Image # Specialized audio processing environment image = ( Image( username="myuser", name="audio-ml", tag="latest", python_version="3.11" ) .run_command(""" apt-get update && apt-get install -y \\ ffmpeg \\ libsndfile1 \\ libsndfile1-dev \\ portaudio19-dev """) .run_command("pip install librosa==0.10.1 soundfile==0.12.1 pyaudio==0.2.11 pydub==0.25.1 whisper==1.1.10 torch==2.1.0 torchaudio==2.1.0") .add("./audio_models", "/app/models") ) ``` ### Scientific Computing ```python from chutes.image import Image # Scientific Python stack with CUDA support image = ( Image( username="myuser", name="scientific-gpu", tag="cuda-scipy", base_image="nvidia/cuda:12.1-devel-ubuntu22.04" ) .run_command(""" apt-get update && apt-get install -y \\ python3.11 \\ python3.11-pip \\ libhdf5-dev \\ libnetcdf-dev \\ gfortran """) .run_command("pip install numpy==1.24.3 scipy==1.11.4 pandas==2.0.3 matplotlib==3.7.2 seaborn==0.12.2 jupyter==1.0.0 cupy-cuda12x==12.3.0 numba==0.58.1") ) ``` ## Performance Optimization ### Layer Caching Strategy ```python from chutes.image import Image # Optimize layer caching for faster builds image = ( Image( username="myuser", name="cached-build", tag="optimized" ) # 1. Install system dependencies first (rarely change) .run_command("apt-get update && apt-get install -y git wget") # 2. Install stable Python packages next .run_command("pip install numpy==1.24.3 pandas==2.0.3 requests==2.31.0") # 3. Install ML frameworks (change occasionally) .run_command("pip install torch==2.1.0 transformers==4.35.0") # 4. Copy application code last (changes frequently) .add("./src", "/app/src") .add("requirements-dev.txt", "/app/") .run_command("pip install -r /app/requirements-dev.txt") ) ``` ### Minimizing Image Size ```python from chutes.image import Image # Minimal production image image = ( Image( username="myuser", name="minimal-prod", tag="slim", base_image="python:3.11-slim" ) # Install only runtime dependencies .run_command(""" apt-get update && \\ apt-get install -y --no-install-recommends \\ libgomp1 && \\ apt-get clean && \\ rm -rf /var/lib/apt/lists/* """) # Use --no-deps and specific versions .run_command("pip install torch==2.1.0+cpu transformers==4.35.0 --extra-index-url https://download.pytorch.org/whl/cpu") # Remove unnecessary files .run_command(""" find /usr/local/lib/python3.11/site-packages -name "*.pyc" -delete && \\ find /usr/local/lib/python3.11/site-packages -name "__pycache__" -delete """) ) ``` ## Security Best Practices ### Secure Base Configuration ```python from chutes.image import Image # Security-hardened image image = ( Image( username="myuser", name="secure-app", tag="hardened", python_version="3.11" ) # Create non-root user .run_command(""" groupadd -r appuser && \\ useradd -r -g appuser -d /app -s /sbin/nologin appuser """) # Install security updates .run_command(""" apt-get update && \\ apt-get upgrade -y && \\ apt-get install -y --no-install-recommends \\ ca-certificates && \\ apt-get clean """) # Set up application directory .run_command("mkdir -p /app && chown -R appuser:appuser /app") .add("./app", "/app") .run_command("chown -R appuser:appuser /app") .set_workdir("/app") .set_user("appuser") ) ``` ### Environment Variables Management ```python from chutes.image import Image # Secure environment setup image = ( Image( username="myuser", name="secure-env", tag="latest" ) .with_env("PYTHONUNBUFFERED", "1") .with_env("PYTHONHASHSEED", "random") .with_env("PIP_NO_CACHE_DIR", "off") .with_env("PIP_DISABLE_PIP_VERSION_CHECK", "on") # Security settings .with_env("PYTHONDONTWRITEBYTECODE", "1") .with_env("PYTHONASYNCIODEBUG", "0") ) ``` ## Integration Examples ### Using Custom Images in Chutes ```python from chutes.chute import Chute, NodeSelector # Deploy with custom image chute = Chute( username="myuser", name="custom-ml-service", image=image, # Your custom image from above entry_file="app.py", entry_point="run", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16 ), timeout_seconds=300, concurrency=4 ) result = chute.deploy() print(f"Deployed with custom image: {result}") ``` ### Multi-Environment Deployment ```python # Development image dev_image = Image( username="myuser", name="ml-app", tag="dev" ).run_command("pip install pytest black flake8") # Production image prod_image = Image( username="myuser", name="ml-app", tag="prod" ).run_command("pip install gunicorn prometheus-client") # Use different images per environment if environment == "development": chute = Chute(name="ml-dev", image=dev_image, ...) else: chute = Chute(name="ml-prod", image=prod_image, ...) ``` ## Troubleshooting ### Common Issues **Build Failures:** ```python # Fix: Use explicit package versions .run_command("pip install torch==2.1.0 numpy==1.24.3") # Pin exact versions ``` **Large Image Sizes:** ```python # Fix: Multi-stage builds and cleanup .run_command(""" apt-get update && apt-get install -y build-essential && \\ pip install package && \\ apt-get remove -y build-essential && \\ apt-get autoremove -y && \\ rm -rf /var/lib/apt/lists/* """) ``` **Permission Issues:** ```python # Fix: Set proper ownership .add("./app", "/app") .run_command("chown -R appuser:appuser /app") ``` ### Debugging Images ```python # Add debugging tools during development debug_image = ( base_image .run_command("pip install ipdb pdb++ memory-profiler") .run_command("apt-get install -y htop curl") ) ``` ## Next Steps - **[Performance Guide](../guides/performance)** - Optimize your custom images - **[Best Practices](../guides/best-practices)** - Production deployment patterns - **[Security Guide](../guides/security)** - Secure your applications - **[Template Images](../templates/)** - Pre-built optimized images For more complex configurations and enterprise use cases, see the [Advanced Docker Guide](../guides/advanced-docker). --- ## SOURCE: https://chutes.ai/docs/examples/custom-training # Custom Model Training with Chutes This guide demonstrates how to train custom machine learning models using Chutes, from data preparation through deployment of the trained models. ## Overview Custom training enables: - **Fine-tuning Pre-trained Models**: Adapt existing models to your specific use case - **Training from Scratch**: Build models for unique domains or tasks - **Distributed Training**: Scale training across multiple GPUs and nodes - **Experiment Tracking**: Monitor training progress and compare experiments - **Model Versioning**: Manage different model versions and deployments ## Quick Start ### Basic Fine-tuning Setup ```python from chutes.image import Image from chutes.chute import Chute, NodeSelector from pydantic import BaseModel from typing import List, Dict, Any, Optional class TrainingConfig(BaseModel): model_name: str dataset_path: str num_epochs: int = 3 batch_size: int = 16 learning_rate: float = 2e-5 output_dir: str = "/models/output" save_steps: int = 500 eval_steps: int = 100 # Training image with ML frameworks training_image = ( Image( username="myuser", name="custom-training", tag="1.0.0", base_image="nvidia/cuda:12.1-devel-ubuntu22.04", python_version="3.11" ) .run_command("pip install torch==2.1.0+cu121 transformers==4.35.0 datasets==2.14.0 accelerate==0.24.0 wandb==0.16.0 tensorboard==2.15.0 --extra-index-url https://download.pytorch.org/whl/cu121") .add("./training", "/app/training") .add("./data", "/app/data") ) ``` ## Text Classification Fine-tuning ### Complete Training Pipeline ```python import torch from transformers import ( AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding ) from datasets import Dataset, load_dataset import wandb import numpy as np from sklearn.metrics import accuracy_score, precision_recall_fscore_support import logging class TextClassificationTrainer: def __init__(self, config: TrainingConfig): self.config = config self.tokenizer = None self.model = None self.train_dataset = None self.val_dataset = None # Initialize logging logging.basicConfig(level=logging.INFO) self.logger = logging.getLogger(__name__) # Initialize W&B for experiment tracking wandb.init( project="chutes-training", config=config.dict(), name=f"training-{config.model_name.replace('/', '-')}" ) def load_model_and_tokenizer(self): """Load pre-trained model and tokenizer""" self.logger.info(f"Loading model: {self.config.model_name}") self.tokenizer = AutoTokenizer.from_pretrained(self.config.model_name) # Add padding token if missing if self.tokenizer.pad_token is None: self.tokenizer.pad_token = self.tokenizer.eos_token # Load model with number of labels self.model = AutoModelForSequenceClassification.from_pretrained( self.config.model_name, num_labels=len(self.get_label_names()) ) # Resize token embeddings if necessary self.model.resize_token_embeddings(len(self.tokenizer)) def load_and_prepare_data(self): """Load and preprocess training data""" self.logger.info(f"Loading dataset from: {self.config.dataset_path}") # Load dataset (assumes CSV format with 'text' and 'label' columns) if self.config.dataset_path.endswith('.csv'): dataset = load_dataset('csv', data_files=self.config.dataset_path)['train'] else: dataset = load_dataset(self.config.dataset_path)['train'] # Split into train/validation dataset = dataset.train_test_split(test_size=0.2, seed=42) # Tokenize datasets self.train_dataset = dataset['train'].map( self.tokenize_function, batched=True, remove_columns=dataset['train'].column_names ) self.val_dataset = dataset['test'].map( self.tokenize_function, batched=True, remove_columns=dataset['test'].column_names ) self.logger.info(f"Training samples: {len(self.train_dataset)}") self.logger.info(f"Validation samples: {len(self.val_dataset)}") def tokenize_function(self, examples): """Tokenize text data""" tokenized = self.tokenizer( examples['text'], truncation=True, padding=False, # Will be handled by data collator max_length=512 ) # Convert labels to integers if they're strings if isinstance(examples['label'][0], str): label_names = self.get_label_names() label_to_id = {name: idx for idx, name in enumerate(label_names)} tokenized['labels'] = [label_to_id[label] for label in examples['label']] else: tokenized['labels'] = examples['label'] return tokenized def get_label_names(self): """Get unique label names from dataset""" # This should be implemented based on your specific dataset # For example, for sentiment analysis: return ["negative", "neutral", "positive"] def compute_metrics(self, eval_pred): """Compute evaluation metrics""" predictions, labels = eval_pred predictions = np.argmax(predictions, axis=1) precision, recall, f1, _ = precision_recall_fscore_support( labels, predictions, average='weighted' ) accuracy = accuracy_score(labels, predictions) return { 'accuracy': accuracy, 'f1': f1, 'precision': precision, 'recall': recall } def train(self): """Train the model""" self.logger.info("Starting training...") # Training arguments training_args = TrainingArguments( output_dir=self.config.output_dir, num_train_epochs=self.config.num_epochs, per_device_train_batch_size=self.config.batch_size, per_device_eval_batch_size=self.config.batch_size, learning_rate=self.config.learning_rate, weight_decay=0.01, logging_dir=f"{self.config.output_dir}/logs", logging_steps=50, evaluation_strategy="steps", eval_steps=self.config.eval_steps, save_strategy="steps", save_steps=self.config.save_steps, load_best_model_at_end=True, metric_for_best_model="f1", greater_is_better=True, warmup_steps=100, fp16=True, # Enable mixed precision training dataloader_num_workers=4, report_to="wandb" ) # Data collator data_collator = DataCollatorWithPadding( tokenizer=self.tokenizer, padding=True ) # Initialize trainer trainer = Trainer( model=self.model, args=training_args, train_dataset=self.train_dataset, eval_dataset=self.val_dataset, tokenizer=self.tokenizer, data_collator=data_collator, compute_metrics=self.compute_metrics ) # Train the model train_result = trainer.train() # Save the final model trainer.save_model() trainer.save_state() # Log final metrics self.logger.info(f"Training completed!") self.logger.info(f"Final train loss: {train_result.training_loss}") # Final evaluation eval_result = trainer.evaluate() self.logger.info(f"Final evaluation: {eval_result}") return trainer async def run_training(inputs: Dict[str, Any]) -> Dict[str, Any]: """Main training entry point""" config = TrainingConfig(**inputs['config']) trainer = TextClassificationTrainer(config) # Load model and data trainer.load_model_and_tokenizer() trainer.load_and_prepare_data() # Train the model trained_model = trainer.train() return { "status": "completed", "model_path": config.output_dir, "training_samples": len(trainer.train_dataset), "validation_samples": len(trainer.val_dataset) } ``` ### Deploy Training Chute ```python # Create training chute training_chute = Chute( username="myuser", name="text-classification-training", image=training_image, entry_file="training.py", entry_point="run_training", node_selector=NodeSelector( gpu_count=2, min_vram_gb_per_gpu=24), timeout_seconds=3600, # 1 hour for training concurrency=1 # Training should run sequentially ) # Start training training_config = { "config": { "model_name": "bert-base-uncased", "dataset_path": "/app/data/sentiment_dataset.csv", "num_epochs": 3, "batch_size": 16, "learning_rate": 2e-5, "output_dir": "/models/sentiment-classifier" } } result = training_chute.run(training_config) print(f"Training result: {result}") ``` ## Computer Vision Training ### Image Classification ```python import torch import torch.nn as nn from torchvision import transforms, models, datasets from torch.utils.data import DataLoader import timm from PIL import Image class ImageClassificationTrainer: def __init__(self, config: TrainingConfig): self.config = config self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') self.model = None self.train_loader = None self.val_loader = None def load_model(self, num_classes: int): """Load pre-trained vision model""" if "vit" in self.config.model_name.lower(): # Vision Transformer self.model = timm.create_model( self.config.model_name, pretrained=True, num_classes=num_classes ) else: # ResNet or other CNN self.model = models.resnet50(pretrained=True) self.model.fc = nn.Linear(self.model.fc.in_features, num_classes) self.model.to(self.device) def prepare_data(self): """Prepare image datasets""" # Data transforms train_transform = transforms.Compose([ transforms.Resize((224, 224)), transforms.RandomHorizontalFlip(), transforms.RandomRotation(10), transforms.ColorJitter(brightness=0.2, contrast=0.2), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) val_transform = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) # Load datasets train_dataset = datasets.ImageFolder( root=f"{self.config.dataset_path}/train", transform=train_transform ) val_dataset = datasets.ImageFolder( root=f"{self.config.dataset_path}/val", transform=val_transform ) # Data loaders self.train_loader = DataLoader( train_dataset, batch_size=self.config.batch_size, shuffle=True, num_workers=4, pin_memory=True ) self.val_loader = DataLoader( val_dataset, batch_size=self.config.batch_size, shuffle=False, num_workers=4, pin_memory=True ) return len(train_dataset.classes) def train(self): """Train the vision model""" num_classes = self.prepare_data() self.load_model(num_classes) criterion = nn.CrossEntropyLoss() optimizer = torch.optim.AdamW( self.model.parameters(), lr=self.config.learning_rate, weight_decay=0.01 ) scheduler = torch.optim.lr_scheduler.CosineAnnealingLR( optimizer, T_max=self.config.num_epochs ) best_val_acc = 0.0 for epoch in range(self.config.num_epochs): # Training phase self.model.train() train_loss = 0.0 train_correct = 0 train_total = 0 for batch_idx, (data, targets) in enumerate(self.train_loader): data, targets = data.to(self.device), targets.to(self.device) optimizer.zero_grad() outputs = self.model(data) loss = criterion(outputs, targets) loss.backward() optimizer.step() train_loss += loss.item() _, predicted = outputs.max(1) train_total += targets.size(0) train_correct += predicted.eq(targets).sum().item() if batch_idx % 100 == 0: print(f'Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}') # Validation phase val_acc = self.evaluate() scheduler.step() # Save best model if val_acc > best_val_acc: best_val_acc = val_acc torch.save(self.model.state_dict(), f"{self.config.output_dir}/best_model.pth") print(f'Epoch {epoch}: Train Acc: {100.*train_correct/train_total:.2f}%, ' f'Val Acc: {val_acc:.2f}%') def evaluate(self): """Evaluate model on validation set""" self.model.eval() correct = 0 total = 0 with torch.no_grad(): for data, targets in self.val_loader: data, targets = data.to(self.device), targets.to(self.device) outputs = self.model(data) _, predicted = outputs.max(1) total += targets.size(0) correct += predicted.eq(targets).sum().item() return 100. * correct / total ``` ## Distributed Training ### Multi-GPU Training Setup ```python import torch.distributed as dist import torch.multiprocessing as mp from torch.nn.parallel import DistributedDataParallel as DDP from torch.utils.data.distributed import DistributedSampler class DistributedTrainer: def __init__(self, rank, world_size, config): self.rank = rank self.world_size = world_size self.config = config # Initialize distributed training dist.init_process_group( backend='nccl', rank=rank, world_size=world_size ) torch.cuda.set_device(rank) self.device = torch.device(f'cuda:{rank}') def setup_model(self, model): """Setup model for distributed training""" model = model.to(self.device) model = DDP(model, device_ids=[self.rank]) return model def setup_dataloader(self, dataset, batch_size): """Setup distributed dataloader""" sampler = DistributedSampler( dataset, num_replicas=self.world_size, rank=self.rank, shuffle=True ) dataloader = DataLoader( dataset, batch_size=batch_size, sampler=sampler, num_workers=4, pin_memory=True ) return dataloader, sampler def train_epoch(self, model, dataloader, optimizer, criterion, epoch): """Train one epoch with distributed setup""" model.train() total_loss = 0 for batch_idx, (data, targets) in enumerate(dataloader): data, targets = data.to(self.device), targets.to(self.device) optimizer.zero_grad() outputs = model(data) loss = criterion(outputs, targets) loss.backward() optimizer.step() total_loss += loss.item() if self.rank == 0 and batch_idx % 100 == 0: print(f'Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}') return total_loss / len(dataloader) def run_distributed_training(rank, world_size, config): """Run distributed training on multiple GPUs""" trainer = DistributedTrainer(rank, world_size, config) # Setup model, data, etc. # ... (model and data setup code) # Cleanup dist.destroy_process_group() async def run_multi_gpu_training(inputs: Dict[str, Any]) -> Dict[str, Any]: """Launch multi-GPU training""" config = TrainingConfig(**inputs['config']) world_size = torch.cuda.device_count() if world_size > 1: mp.spawn( run_distributed_training, args=(world_size, config), nprocs=world_size, join=True ) else: # Single GPU training trainer = TextClassificationTrainer(config) trainer.train() return {"status": "completed", "gpus_used": world_size} ``` ## Model Deployment Pipeline ### Trained Model Serving ```python from chutes.chute import Chute from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch class ModelInferenceService: def __init__(self, model_path: str): self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') self.tokenizer = AutoTokenizer.from_pretrained(model_path) self.model = AutoModelForSequenceClassification.from_pretrained(model_path) self.model.to(self.device) self.model.eval() def predict(self, text: str) -> Dict[str, Any]: """Make prediction on input text""" inputs = self.tokenizer( text, return_tensors="pt", truncation=True, padding=True, max_length=512 ).to(self.device) with torch.no_grad(): outputs = self.model(**inputs) probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1) predicted_class = torch.argmax(probabilities, dim=-1).item() confidence = probabilities[0][predicted_class].item() return { "predicted_class": predicted_class, "confidence": confidence, "probabilities": probabilities[0].tolist() } # Global model instance model_service = None async def load_model(model_path: str): """Load trained model for inference""" global model_service model_service = ModelInferenceService(model_path) return {"status": "model_loaded"} async def predict(inputs: Dict[str, Any]) -> Dict[str, Any]: """Inference endpoint""" text = inputs["text"] result = model_service.predict(text) return result # Deploy inference service inference_chute = Chute( username="myuser", name="trained-model-inference", image=training_image, # Reuse training image entry_file="inference.py", entry_point="predict", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=8 ), timeout_seconds=60, concurrency=10 ) ``` ## Experiment Tracking ### Advanced Monitoring ```python import mlflow import mlflow.pytorch from tensorboard.compat.tensorflow_stub.io.gfile import register_filesystem class ExperimentTracker: def __init__(self, experiment_name: str): mlflow.set_experiment(experiment_name) self.run = mlflow.start_run() def log_params(self, params: Dict[str, Any]): """Log hyperparameters""" for key, value in params.items(): mlflow.log_param(key, value) def log_metrics(self, metrics: Dict[str, float], step: int = None): """Log metrics""" for key, value in metrics.items(): mlflow.log_metric(key, value, step=step) def log_model(self, model, model_name: str): """Log trained model""" mlflow.pytorch.log_model(model, model_name) def log_artifacts(self, local_path: str): """Log training artifacts""" mlflow.log_artifacts(local_path) def finish(self): """End experiment run""" mlflow.end_run() # Integration with training class TrackedTrainer(TextClassificationTrainer): def __init__(self, config: TrainingConfig, experiment_name: str): super().__init__(config) self.tracker = ExperimentTracker(experiment_name) # Log hyperparameters self.tracker.log_params(config.dict()) def train(self): """Training with experiment tracking""" trainer = super().train() # Log final model self.tracker.log_model(self.model, "final_model") self.tracker.log_artifacts(self.config.output_dir) self.tracker.finish() return trainer ``` ## Next Steps - **[Model Deployment](../guides/model-deployment)** - Deploy trained models at scale - **[Performance Optimization](../guides/performance)** - Optimize training performance - **[MLOps Pipelines](../guides/mlops)** - Production ML workflows - **[Advanced Training](../guides/advanced-training)** - Advanced training techniques For production training workflows, see the [Enterprise Training Guide](../guides/enterprise-training). --- ## SOURCE: https://chutes.ai/docs/examples/embeddings # Text Embeddings with TEI This guide demonstrates how to build powerful text embedding services using Text Embeddings Inference (TEI), enabling semantic search, similarity analysis, and retrieval-augmented generation (RAG) applications. ## Overview Text Embeddings Inference (TEI) is a high-performance embedding server that provides: - **Fast Inference**: Optimized for batch processing and low latency - **Multiple Models**: Support for various embedding architectures - **Similarity Search**: Built-in similarity and ranking capabilities - **Pooling Strategies**: Multiple pooling methods for optimal embeddings - **Batch Processing**: Efficient handling of multiple texts - **Production Ready**: Auto-scaling and error handling ## Complete Implementation ### Input Schema Design Define comprehensive input validation for embedding operations: ```python from pydantic import BaseModel, Field from typing import List, Optional, Union from enum import Enum class PoolingStrategy(str, Enum): CLS = "cls" # Use [CLS] token MEAN = "mean" # Mean pooling MAX = "max" # Max pooling MEAN_SQRT_LEN = "mean_sqrt_len" # Mean pooling with sqrt normalization class EmbeddingInput(BaseModel): inputs: Union[str, List[str]] # Single text or batch normalize: bool = Field(default=True) truncate: bool = Field(default=True) pooling: Optional[PoolingStrategy] = PoolingStrategy.MEAN class SimilarityInput(BaseModel): source_text: str target_texts: List[str] = Field(max_items=100) normalize: bool = Field(default=True) class RerankInput(BaseModel): query: str texts: List[str] = Field(max_items=50) top_k: Optional[int] = Field(default=None, ge=1, le=50) class SearchInput(BaseModel): query: str corpus: List[str] = Field(max_items=1000) top_k: int = Field(default=10, ge=1, le=100) threshold: Optional[float] = Field(default=None, ge=0.0, le=1.0) ``` ### Custom Image with TEI Build a custom image with Text Embeddings Inference: ```python from chutes.image import Image from chutes.chute import Chute, NodeSelector image = ( Image( username="myuser", name="text-embeddings", tag="0.0.1", readme="High-performance text embeddings with TEI") .from_base("parachutes/base-python:3.11") .run_command("pip install --upgrade pip") .run_command("pip install text-embeddings-inference-client") .run_command("pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118") .run_command("pip install transformers sentence-transformers") .run_command("pip install numpy scikit-learn faiss-cpu") .run_command("pip install loguru pydantic fastapi") # Install TEI server .run_command( "wget https://github.com/huggingface/text-embeddings-inference/releases/download/v1.2.3/text-embeddings-inference-1.2.3-x86_64-unknown-linux-gnu.tar.gz && " "tar -xzf text-embeddings-inference-1.2.3-x86_64-unknown-linux-gnu.tar.gz && " "chmod +x text-embeddings-inference && " "mv text-embeddings-inference /usr/local/bin/" ) ) ``` ### Chute Configuration Configure the service with appropriate GPU and memory requirements: ```python chute = Chute( username="myuser", name="text-embeddings-service", tagline="High-performance text embeddings and similarity search", readme="Production-ready text embedding service with similarity search, reranking, and semantic analysis capabilities", image=image, node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16, # Sufficient for most embedding models ), concurrency=8, # Handle multiple concurrent requests ) ``` ### Model Initialization Initialize the embedding model and TEI server: ```python import subprocess import time import requests from loguru import logger @chute.on_startup() async def initialize_embeddings(self): """ Initialize TEI server and embedding capabilities. """ import torch import numpy as np from sentence_transformers import SentenceTransformer # Model configuration self.model_name = "sentence-transformers/all-MiniLM-L6-v2" # Default model self.tei_port = 8080 self.tei_url = f"http://localhost:{self.tei_port}" # Start TEI server in background logger.info("Starting TEI server...") self.tei_process = subprocess.Popen([ "text-embeddings-inference", "--model-id", self.model_name, "--port", str(self.tei_port), "--max-concurrent-requests", "32", "--max-batch-tokens", "16384", "--max-batch-requests", "16" ]) # Wait for server to start max_wait = 120 for i in range(max_wait): try: response = requests.get(f"{self.tei_url}/health", timeout=5) if response.status_code == 200: logger.success("TEI server started successfully") break except requests.exceptions.RequestException: if i < max_wait - 1: time.sleep(1) else: raise Exception("TEI server failed to start") # Initialize fallback model for local processing logger.info("Loading fallback sentence transformer...") self.sentence_transformer = SentenceTransformer(self.model_name) # Store utilities self.torch = torch self.numpy = np # Initialize vector storage (in-memory for this example) self.vector_store = {} self.text_store = {} # Warmup await self._warmup_model() async def _warmup_model(self): """Perform warmup embedding generation.""" warmup_texts = [ "This is a warmup sentence to initialize the embedding model.", "Another test sentence for model warming.", "Final warmup text to ensure optimal performance." ] try: # Warmup TEI server response = requests.post( f"{self.tei_url}/embed", json={"inputs": warmup_texts}, timeout=30 ) if response.status_code == 200: logger.info("TEI server warmed up successfully") else: logger.warning("TEI warmup failed, using fallback model") # Warmup fallback model _ = self.sentence_transformer.encode(warmup_texts) except Exception as e: logger.warning(f"Warmup failed: {e}, using fallback model") _ = self.sentence_transformer.encode(warmup_texts) ``` ### Core Embedding Functions Implement core embedding functionality: ```python import hashlib from typing import List, Dict, Tuple async def get_embeddings(self, texts: Union[str, List[str]], normalize: bool = True) -> np.ndarray: """ Get embeddings for text(s) using TEI server or fallback. """ if isinstance(texts, str): texts = [texts] try: # Try TEI server first response = requests.post( f"{self.tei_url}/embed", json={ "inputs": texts, "normalize": normalize, "truncate": True }, timeout=30 ) if response.status_code == 200: embeddings = self.numpy.array(response.json()) return embeddings else: logger.warning(f"TEI server error: {response.status_code}, using fallback") except Exception as e: logger.warning(f"TEI server failed: {e}, using fallback") # Fallback to local model embeddings = self.sentence_transformer.encode( texts, normalize_embeddings=normalize, convert_to_numpy=True ) return embeddings def compute_similarity(self, embeddings1: np.ndarray, embeddings2: np.ndarray) -> np.ndarray: """Compute cosine similarity between embeddings.""" # Normalize if not already normalized if embeddings1.ndim == 1: embeddings1 = embeddings1.reshape(1, -1) if embeddings2.ndim == 1: embeddings2 = embeddings2.reshape(1, -1) # Compute cosine similarity dot_product = self.numpy.dot(embeddings1, embeddings2.T) norms1 = self.numpy.linalg.norm(embeddings1, axis=1, keepdims=True) norms2 = self.numpy.linalg.norm(embeddings2, axis=1, keepdims=True) similarities = dot_product / (norms1 * norms2.T) return similarities def add_to_vector_store(self, texts: List[str], embeddings: np.ndarray, collection: str = "default"): """Add texts and embeddings to vector store.""" if collection not in self.vector_store: self.vector_store[collection] = [] self.text_store[collection] = [] for text, embedding in zip(texts, embeddings): text_id = hashlib.md5(text.encode()).hexdigest() self.vector_store[collection].append({ "id": text_id, "embedding": embedding, "text": text }) self.text_store[collection].append(text) ``` ### Embedding Generation Endpoints Create endpoints for different embedding operations: ```python from fastapi import HTTPException @chute.cord( public_api_path="/embed", public_api_method="POST", stream=False) async def generate_embeddings(self, args: EmbeddingInput) -> Dict: """ Generate embeddings for input text(s). """ try: embeddings = await get_embeddings(self, args.inputs, args.normalize) # Convert to list for JSON serialization embeddings_list = embeddings.tolist() if isinstance(args.inputs, str): return { "embeddings": embeddings_list[0], "model": self.model_name, "dimension": len(embeddings_list[0]) } else: return { "embeddings": embeddings_list, "model": self.model_name, "dimension": len(embeddings_list[0]), "count": len(embeddings_list) } except Exception as e: logger.error(f"Embedding generation failed: {e}") raise HTTPException(status_code=500, detail=f"Embedding generation failed: {str(e)}") @chute.cord( public_api_path="/similarity", public_api_method="POST", stream=False) async def compute_text_similarity(self, args: SimilarityInput) -> Dict: """ Compute similarity between source text and target texts. """ try: # Get embeddings for all texts all_texts = [args.source_text] + args.target_texts embeddings = await get_embeddings(self, all_texts, args.normalize) # Separate source and target embeddings source_embedding = embeddings[0:1] target_embeddings = embeddings[1:] # Compute similarities similarities = compute_similarity(self, source_embedding, target_embeddings) similarity_scores = similarities[0].tolist() # Create results with metadata results = [] for i, (text, score) in enumerate(zip(args.target_texts, similarity_scores)): results.append({ "text": text, "similarity": float(score), "rank": i + 1 }) # Sort by similarity (descending) results.sort(key=lambda x: x["similarity"], reverse=True) # Update ranks for i, result in enumerate(results): result["rank"] = i + 1 return { "source_text": args.source_text, "results": results, "model": self.model_name } except Exception as e: logger.error(f"Similarity computation failed: {e}") raise HTTPException(status_code=500, detail=f"Similarity computation failed: {str(e)}") @chute.cord( public_api_path="/rerank", public_api_method="POST", stream=False) async def rerank_texts(self, args: RerankInput) -> Dict: """ Rerank texts based on relevance to query. """ try: # Get embeddings query_embedding = await get_embeddings(self, args.query, normalize=True) text_embeddings = await get_embeddings(self, args.texts, normalize=True) # Compute similarities similarities = compute_similarity(self, query_embedding, text_embeddings) scores = similarities[0].tolist() # Create scored results scored_texts = [ { "text": text, "score": float(score), "index": i } for i, (text, score) in enumerate(zip(args.texts, scores)) ] # Sort by score (descending) scored_texts.sort(key=lambda x: x["score"], reverse=True) # Apply top_k limit if specified if args.top_k: scored_texts = scored_texts[:args.top_k] # Add ranks for rank, item in enumerate(scored_texts): item["rank"] = rank + 1 return { "query": args.query, "results": scored_texts, "total_results": len(scored_texts), "model": self.model_name } except Exception as e: logger.error(f"Reranking failed: {e}") raise HTTPException(status_code=500, detail=f"Reranking failed: {str(e)}") ``` ### Semantic Search Implementation Build a complete semantic search system: ```python @chute.cord( public_api_path="/search", public_api_method="POST", stream=False) async def semantic_search(self, args: SearchInput) -> Dict: """ Perform semantic search over a corpus of texts. """ try: # Get query embedding query_embedding = await get_embeddings(self, args.query, normalize=True) # Get corpus embeddings (batch processing for efficiency) corpus_embeddings = await get_embeddings(self, args.corpus, normalize=True) # Compute similarities similarities = compute_similarity(self, query_embedding, corpus_embeddings) scores = similarities[0] # Create results with scores results = [] for i, (text, score) in enumerate(zip(args.corpus, scores)): if args.threshold is None or score >= args.threshold: results.append({ "text": text, "score": float(score), "corpus_index": i }) # Sort by score (descending) and take top_k results.sort(key=lambda x: x["score"], reverse=True) results = results[:args.top_k] # Add ranks for rank, result in enumerate(results): result["rank"] = rank + 1 return { "query": args.query, "results": results, "total_corpus_size": len(args.corpus), "results_returned": len(results), "model": self.model_name, "threshold": args.threshold } except Exception as e: logger.error(f"Semantic search failed: {e}") raise HTTPException(status_code=500, detail=f"Semantic search failed: {str(e)}") ``` ## Advanced Features ### Vector Store Management Implement persistent vector storage: ```python class VectorStoreInput(BaseModel): collection: str = "default" texts: List[str] metadata: Optional[Dict] = None class SearchStoreInput(BaseModel): collection: str = "default" query: str top_k: int = Field(default=10, ge=1, le=100) filter_metadata: Optional[Dict] = None @chute.cord(public_api_path="/store/add", method="POST") async def add_to_store(self, args: VectorStoreInput) -> Dict: """Add texts to persistent vector store.""" try: # Generate embeddings embeddings = await get_embeddings(self, args.texts, normalize=True) # Add to store add_to_vector_store(self, args.texts, embeddings, args.collection) return { "collection": args.collection, "added_count": len(args.texts), "total_in_collection": len(self.text_store.get(args.collection, [])) } except Exception as e: raise HTTPException(status_code=500, detail=f"Failed to add to store: {str(e)}") @chute.cord(public_api_path="/store/search", method="POST") async def search_store(self, args: SearchStoreInput) -> Dict: """Search within a specific collection.""" if args.collection not in self.vector_store: raise HTTPException(status_code=404, detail=f"Collection '{args.collection}' not found") try: # Get query embedding query_embedding = await get_embeddings(self, args.query, normalize=True) # Get stored embeddings stored_items = self.vector_store[args.collection] stored_embeddings = self.numpy.array([item["embedding"] for item in stored_items]) # Compute similarities similarities = compute_similarity(self, query_embedding, stored_embeddings) scores = similarities[0] # Create results results = [] for i, (item, score) in enumerate(zip(stored_items, scores)): results.append({ "text": item["text"], "score": float(score), "id": item["id"] }) # Sort and limit results.sort(key=lambda x: x["score"], reverse=True) results = results[:args.top_k] # Add ranks for rank, result in enumerate(results): result["rank"] = rank + 1 return { "collection": args.collection, "query": args.query, "results": results, "total_in_collection": len(stored_items) } except Exception as e: raise HTTPException(status_code=500, detail=f"Store search failed: {str(e)}") @chute.cord(public_api_path="/store/collections", method="GET") async def list_collections(self) -> Dict: """List all available collections.""" collections = [] for name, texts in self.text_store.items(): collections.append({ "name": name, "size": len(texts), "sample_texts": texts[:3] if texts else [] }) return {"collections": collections} ``` ### Batch Processing Optimization Optimize for large-scale batch operations: ```python class BatchEmbeddingInput(BaseModel): texts: List[str] = Field(max_items=1000) batch_size: int = Field(default=32, ge=1, le=128) normalize: bool = True @chute.cord(public_api_path="/embed/batch", method="POST") async def batch_embeddings(self, args: BatchEmbeddingInput) -> Dict: """Process large batches of texts efficiently.""" try: all_embeddings = [] processed_count = 0 # Process in batches for i in range(0, len(args.texts), args.batch_size): batch_texts = args.texts[i:i + args.batch_size] batch_embeddings = await get_embeddings(self, batch_texts, args.normalize) all_embeddings.extend(batch_embeddings.tolist()) processed_count += len(batch_texts) # Optional: yield progress for very large batches if processed_count % 100 == 0: logger.info(f"Processed {processed_count}/{len(args.texts)} texts") return { "embeddings": all_embeddings, "processed_count": processed_count, "batch_size": args.batch_size, "model": self.model_name, "dimension": len(all_embeddings[0]) if all_embeddings else 0 } except Exception as e: logger.error(f"Batch embedding failed: {e}") raise HTTPException(status_code=500, detail=f"Batch processing failed: {str(e)}") ``` ### Clustering and Analysis Add text clustering capabilities: ```python from sklearn.cluster import KMeans from sklearn.decomposition import PCA class ClusterInput(BaseModel): texts: List[str] = Field(min_items=2, max_items=500) n_clusters: int = Field(default=5, ge=2, le=20) method: str = Field(default="kmeans") @chute.cord(public_api_path="/cluster", method="POST") async def cluster_texts(self, args: ClusterInput) -> Dict: """Cluster texts based on semantic similarity.""" try: # Get embeddings embeddings = await get_embeddings(self, args.texts, normalize=True) # Perform clustering if args.method == "kmeans": # Adjust number of clusters if needed n_clusters = min(args.n_clusters, len(args.texts)) kmeans = KMeans(n_clusters=n_clusters, random_state=42) cluster_labels = kmeans.fit_predict(embeddings) # Get cluster centers cluster_centers = kmeans.cluster_centers_ else: raise HTTPException(status_code=400, detail=f"Unsupported clustering method: {args.method}") # Organize results by cluster clusters = {} for i, (text, label) in enumerate(zip(args.texts, cluster_labels)): label = int(label) if label not in clusters: clusters[label] = [] clusters[label].append({ "text": text, "index": i }) # Calculate cluster statistics cluster_stats = [] for label, items in clusters.items(): # Find centroid text (closest to cluster center) cluster_embeddings = embeddings[[item["index"] for item in items]] center = cluster_centers[label] # Compute distances to center distances = self.numpy.linalg.norm(cluster_embeddings - center, axis=1) centroid_idx = self.numpy.argmin(distances) cluster_stats.append({ "cluster_id": label, "size": len(items), "centroid_text": items[centroid_idx]["text"], "texts": [item["text"] for item in items] }) return { "clusters": cluster_stats, "n_clusters": len(clusters), "method": args.method, "total_texts": len(args.texts) } except Exception as e: logger.error(f"Clustering failed: {e}") raise HTTPException(status_code=500, detail=f"Clustering failed: {str(e)}") ``` ## Deployment and Usage ### Deploy the Service ```bash # Build and deploy the embeddings service chutes deploy my_embeddings:chute # Monitor the deployment chutes chutes get my-embeddings ``` ### Using the API #### Basic Embedding Generation ```bash curl -X POST "https://myuser-my-embeddings.chutes.ai/embed" \ -H "Content-Type: application/json" \ -d '{ "inputs": "This is a sample text for embedding generation", "normalize": true }' ``` #### Similarity Search ```bash curl -X POST "https://myuser-my-embeddings.chutes.ai/similarity" \ -H "Content-Type: application/json" \ -d '{ "source_text": "machine learning algorithms", "target_texts": [ "artificial intelligence techniques", "cooking recipes", "neural network models", "gardening tips", "deep learning frameworks" ], "normalize": true }' ``` #### Python Client Example ```python import requests from typing import List, Dict, Optional class EmbeddingsClient: def __init__(self, base_url: str): self.base_url = base_url.rstrip('/') def embed(self, texts: Union[str, List[str]], normalize: bool = True) -> Dict: """Generate embeddings for text(s).""" response = requests.post( f"{self.base_url}/embed", json={ "inputs": texts, "normalize": normalize } ) if response.status_code == 200: return response.json() else: raise Exception(f"Embedding failed: {response.status_code} - {response.text}") def similarity(self, source_text: str, target_texts: List[str]) -> Dict: """Compute similarity between source and target texts.""" response = requests.post( f"{self.base_url}/similarity", json={ "source_text": source_text, "target_texts": target_texts, "normalize": True } ) return response.json() def search(self, query: str, corpus: List[str], top_k: int = 10) -> Dict: """Perform semantic search over corpus.""" response = requests.post( f"{self.base_url}/search", json={ "query": query, "corpus": corpus, "top_k": top_k } ) return response.json() def rerank(self, query: str, texts: List[str], top_k: Optional[int] = None) -> Dict: """Rerank texts by relevance to query.""" payload = { "query": query, "texts": texts } if top_k: payload["top_k"] = top_k response = requests.post( f"{self.base_url}/rerank", json=payload ) return response.json() def add_to_store(self, texts: List[str], collection: str = "default") -> Dict: """Add texts to vector store.""" response = requests.post( f"{self.base_url}/store/add", json={ "texts": texts, "collection": collection } ) return response.json() def search_store(self, query: str, collection: str = "default", top_k: int = 10) -> Dict: """Search within stored collection.""" response = requests.post( f"{self.base_url}/store/search", json={ "query": query, "collection": collection, "top_k": top_k } ) return response.json() def cluster(self, texts: List[str], n_clusters: int = 5) -> Dict: """Cluster texts by semantic similarity.""" response = requests.post( f"{self.base_url}/cluster", json={ "texts": texts, "n_clusters": n_clusters, "method": "kmeans" } ) return response.json() # Usage examples client = EmbeddingsClient("https://myuser-my-embeddings.chutes.ai") # Generate embeddings result = client.embed("This is a test sentence") embedding = result["embeddings"] print(f"Embedding dimension: {result['dimension']}") # Batch embeddings batch_result = client.embed([ "First document about machine learning", "Second document about cooking", "Third document about artificial intelligence" ]) # Find similar texts similarity_result = client.similarity( source_text="artificial intelligence research", target_texts=[ "machine learning algorithms", "cooking recipes", "neural networks", "gardening techniques" ] ) print("Most similar texts:") for result in similarity_result["results"][:3]: print(f"- {result['text']} (similarity: {result['similarity']:.3f})") # Build a knowledge base documents = [ "Python is a programming language", "Machine learning uses algorithms to learn patterns", "Deep learning is a subset of machine learning", "Natural language processing analyzes text", "Computer vision processes images", "Reinforcement learning learns through trial and error" ] # Add to vector store client.add_to_store(documents, collection="ai_knowledge") # Search the knowledge base search_result = client.search_store( query="algorithms for learning", collection="ai_knowledge", top_k=3 ) print("Knowledge base search results:") for result in search_result["results"]: print(f"- {result['text']} (score: {result['score']:.3f})") # Cluster documents cluster_result = client.cluster(documents, n_clusters=3) print(f"Clustered into {cluster_result['n_clusters']} groups:") for cluster in cluster_result["clusters"]: print(f"Cluster {cluster['cluster_id']} ({cluster['size']} items):") print(f" Centroid: {cluster['centroid_text']}") ``` ## Best Practices ### 1. Model Selection ```python # Different models for different use cases model_recommendations = { "general_purpose": "sentence-transformers/all-MiniLM-L6-v2", # Fast, good quality "multilingual": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2", "high_quality": "sentence-transformers/all-mpnet-base-v2", # Best quality "domain_specific": "sentence-transformers/allenai-specter", # Scientific papers "code": "microsoft/codebert-base", # Code similarity } def select_model_for_use_case(use_case: str) -> str: """Select optimal model based on use case.""" return model_recommendations.get(use_case, model_recommendations["general_purpose"]) ``` ### 2. Text Preprocessing ```python import re from typing import List def preprocess_text(text: str) -> str: """Clean and prepare text for embedding.""" # Remove excessive whitespace text = re.sub(r'\s+', ' ', text) # Remove special characters if needed text = re.sub(r'[^\w\s\-\.]', '', text) # Normalize case (optional, depends on model) # text = text.lower() # Remove very short texts if len(text.strip()) < 3: return "" return text.strip() def batch_preprocess(texts: List[str]) -> List[str]: """Preprocess batch of texts.""" processed = [] for text in texts: cleaned = preprocess_text(text) if cleaned: # Only add non-empty texts processed.append(cleaned) return processed ``` ### 3. Caching and Performance ```python import hashlib from typing import Dict import pickle class EmbeddingCache: """Simple LRU cache for embeddings.""" def __init__(self, max_size: int = 1000): self.cache: Dict[str, np.ndarray] = {} self.access_order = [] self.max_size = max_size def get_key(self, text: str, model: str) -> str: """Generate cache key.""" content = f"{text}_{model}" return hashlib.md5(content.encode()).hexdigest() def get(self, text: str, model: str) -> Optional[np.ndarray]: """Get cached embedding.""" key = self.get_key(text, model) if key in self.cache: # Update access order self.access_order.remove(key) self.access_order.append(key) return self.cache[key] return None def set(self, text: str, model: str, embedding: np.ndarray): """Cache embedding.""" key = self.get_key(text, model) # Remove oldest if at capacity if len(self.cache) >= self.max_size and key not in self.cache: oldest_key = self.access_order.pop(0) del self.cache[oldest_key] self.cache[key] = embedding if key not in self.access_order: self.access_order.append(key) # Add to chute initialization @chute.on_startup() async def initialize_with_cache(self): # ... existing initialization ... self.embedding_cache = EmbeddingCache(max_size=2000) async def get_embeddings_cached(self, texts: Union[str, List[str]], normalize: bool = True) -> np.ndarray: """Get embeddings with caching.""" if isinstance(texts, str): texts = [texts] cached_embeddings = [] uncached_texts = [] uncached_indices = [] # Check cache for i, text in enumerate(texts): cached = self.embedding_cache.get(text, self.model_name) if cached is not None: cached_embeddings.append((i, cached)) else: uncached_texts.append(text) uncached_indices.append(i) # Generate uncached embeddings if uncached_texts: new_embeddings = await get_embeddings(self, uncached_texts, normalize) # Cache new embeddings for text, embedding in zip(uncached_texts, new_embeddings): self.embedding_cache.set(text, self.model_name, embedding) # Combine cached and new embeddings all_embeddings = [None] * len(texts) # Place cached embeddings for orig_idx, embedding in cached_embeddings: all_embeddings[orig_idx] = embedding # Place new embeddings for new_idx, orig_idx in enumerate(uncached_indices): all_embeddings[orig_idx] = new_embeddings[new_idx] return self.numpy.array(all_embeddings) else: # All cached return self.numpy.array([emb for _, emb in sorted(cached_embeddings)]) ``` ### 4. Error Handling and Monitoring ```python import time from loguru import logger @chute.cord(public_api_path="/robust_embed", method="POST") async def robust_embeddings(self, args: EmbeddingInput) -> Dict: """Embeddings with comprehensive error handling.""" start_time = time.time() try: # Validate input if isinstance(args.inputs, list) and len(args.inputs) > 1000: raise HTTPException( status_code=400, detail="Batch size too large. Maximum 1000 texts allowed." ) # Preprocess texts if isinstance(args.inputs, str): processed_texts = preprocess_text(args.inputs) if not processed_texts: raise HTTPException(status_code=400, detail="Text too short after preprocessing") else: processed_texts = batch_preprocess(args.inputs) if not processed_texts: raise HTTPException(status_code=400, detail="No valid texts after preprocessing") # Generate embeddings with retry logic max_retries = 3 for attempt in range(max_retries): try: embeddings = await get_embeddings_cached(self, processed_texts, args.normalize) break except Exception as e: if attempt == max_retries - 1: raise e logger.warning(f"Embedding attempt {attempt + 1} failed: {e}") time.sleep(1) generation_time = time.time() - start_time logger.info(f"Embedding generation completed in {generation_time:.2f}s") # Return results embeddings_list = embeddings.tolist() return { "embeddings": embeddings_list if isinstance(args.inputs, list) else embeddings_list[0], "model": self.model_name, "dimension": len(embeddings_list[0]), "generation_time": generation_time, "processed_count": len(processed_texts) } except HTTPException: raise except Exception as e: error_time = time.time() - start_time logger.error(f"Embedding generation failed after {error_time:.2f}s: {e}") raise HTTPException( status_code=500, detail=f"Embedding generation failed: {str(e)}" ) ``` ## Performance Optimization ### Batch Size Tuning ```python def get_optimal_batch_size(text_lengths: List[int], max_tokens: int = 16384) -> int: """Calculate optimal batch size based on text lengths.""" # Estimate tokens (rough approximation: 1 token ≈ 4 characters) estimated_tokens = [length // 4 for length in text_lengths] # Calculate how many texts can fit in max_tokens cumulative_tokens = 0 optimal_batch = 0 for tokens in estimated_tokens: if cumulative_tokens + tokens <= max_tokens: cumulative_tokens += tokens optimal_batch += 1 else: break return max(1, optimal_batch) ``` ### Memory Management ```python async def memory_efficient_embeddings(self, texts: List[str], max_batch_size: int = 32) -> np.ndarray: """Generate embeddings with memory management.""" all_embeddings = [] for i in range(0, len(texts), max_batch_size): batch = texts[i:i + max_batch_size] # Clear cache before each batch if hasattr(self, 'torch'): self.torch.cuda.empty_cache() batch_embeddings = await get_embeddings(self, batch, normalize=True) all_embeddings.extend(batch_embeddings) # Optional: yield progress if (i + max_batch_size) % 100 == 0: logger.info(f"Processed {min(i + max_batch_size, len(texts))}/{len(texts)} texts") return self.numpy.array(all_embeddings) ``` ## Next Steps - **Fine-tuning**: Train custom embedding models on domain-specific data - **Advanced Search**: Implement hybrid search (dense + sparse) - **Real-time Updates**: Build dynamic vector databases - **Multimodal**: Extend to image and audio embeddings For more advanced examples, see: - [Custom Training](/docs/examples/custom-training) - [Vector Databases](/docs/examples/vector-databases) - [RAG Applications](/docs/examples/rag-applications) --- ## SOURCE: https://chutes.ai/docs/examples/image-generation # Image Generation with Diffusion Models This guide demonstrates how to build powerful image generation services using state-of-the-art diffusion models like FLUX.1. You'll learn to create a complete image generation API with custom parameters, validation, and optimization. ## Overview The Chutes platform makes it easy to deploy advanced image generation models: - **FLUX.1 [dev]**: 12 billion parameter rectified flow transformer - **Stable Diffusion**: Various versions and fine-tuned models - **Custom Models**: Support for any diffusion architecture - **GPU Optimization**: Automatic scaling and memory management ## Complete FLUX.1 Implementation ### Input Schema Design First, define comprehensive input validation using Pydantic: ```python from pydantic import BaseModel, Field from typing import Optional class GenerationInput(BaseModel): prompt: str height: int = Field(default=1024, ge=128, le=2048) width: int = Field(default=1024, ge=128, le=2048) num_inference_steps: int = Field(default=10, ge=1, le=30) guidance_scale: float = Field(default=7.5, ge=1.0, le=20.0) seed: Optional[int] = Field(default=None, ge=0, le=2**32 - 1) # Simplified input for basic usage class MinifiedGenerationInput(BaseModel): prompt: str = "a beautiful mountain landscape" ``` ### Custom Image Configuration Create a pre-built image with the FLUX.1 model: ```python from chutes.image import Image # Create a markdown readme from model documentation readme = """`FLUX.1 [dev]` is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. # Key Features 1. Cutting-edge output quality, second only to our state-of-the-art model `FLUX.1 [pro]`. 2. Competitive prompt following, matching the performance of closed source alternatives. 3. Trained using guidance distillation, making `FLUX.1 [dev]` more efficient. 4. Open weights to drive new scientific research, and empower artists to develop innovative workflows. 5. Generated outputs can be used for personal, scientific, and commercial purposes. """ # Use pre-built image with FLUX.1 model image = ( Image( username="myuser", name="flux.1-dev", tag="0.0.2", readme=readme) .from_base("parachutes/flux.1-dev:latest") ) ``` ### Chute Configuration Set up the service with appropriate hardware requirements: ```python from chutes.chute import Chute, NodeSelector chute = Chute( username="myuser", name="FLUX.1-dev-generator", readme=readme, image=image, # This model is quite large, so we'll require GPUs with at least 48GB VRAM to run it. node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=80, # 80GB for optimal performance ), # Limit one request at a time. concurrency=1, ) ``` ### Model Initialization Initialize the diffusion pipeline on startup: ```python @chute.on_startup() async def initialize_pipeline(self): """ Initialize the pipeline, download model if necessary. This code never runs on your machine directly, it runs on the GPU nodes powering chutes. """ import torch from diffusers import FluxPipeline self.torch = torch torch.cuda.empty_cache() torch.cuda.init() torch.cuda.set_device(0) self.pipeline = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16, local_files_only=True, cache_dir="/home/chutes/.cache/huggingface/hub", ).to("cuda") ``` ### Generation Endpoint Create the main image generation endpoint: ```python import uuid from io import BytesIO from fastapi import Response @chute.cord( # Expose this function via the subdomain-based chutes.ai HTTP invocation, e.g. # this becomes https://{username}-{chute slug}.chutes.ai/generate public_api_path="/generate", # The function is invoked in the subdomain-based system via POSTs. method="POST", # Input/minimal input schemas. input_schema=GenerationInput, minimal_input_schema=MinifiedGenerationInput, # Set output content type header to image/jpeg so we can return the raw image. output_content_type="image/jpeg", ) async def generate(self, params: GenerationInput) -> Response: """ Generate an image. """ generator = None if params.seed is not None: generator = self.torch.Generator(device="cuda").manual_seed(params.seed) with self.torch.inference_mode(): result = self.pipeline( prompt=params.prompt, height=params.height, width=params.width, num_inference_steps=params.num_inference_steps, guidance_scale=params.guidance_scale, max_sequence_length=256, generator=generator, ) image = result.images[0] buffer = BytesIO() image.save(buffer, format="JPEG", quality=85) buffer.seek(0) return Response( content=buffer.getvalue(), media_type="image/jpeg", headers={"Content-Disposition": f'attachment; filename="{uuid.uuid4()}.jpg"'}, ) ``` ## Alternative: Stable Diffusion Setup For a more customizable approach using Stable Diffusion: ```python from chutes.image import Image from chutes.chute import Chute, NodeSelector # Build custom Stable Diffusion image image = ( Image(username="myuser", name="stable-diffusion", tag="2.1") .from_base("nvidia/cuda:12.4.1-runtime-ubuntu22.04") .with_python("3.11") .run_command("apt update && apt install -y python3 python3-pip git") .run_command("pip3 install torch>=2.4.0 torchvision --index-url https://download.pytorch.org/whl/cu124") .run_command("pip3 install diffusers>=0.29.0 transformers>=4.44.0 accelerate>=0.33.0") .run_command("pip3 install fastapi uvicorn pydantic pillow") .set_workdir("/app") ) chute = Chute( username="myuser", name="stable-diffusion-xl", image=image, node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24), concurrency=2) @chute.on_startup() async def load_sd_pipeline(self): """Load Stable Diffusion XL pipeline.""" from diffusers import StableDiffusionXLPipeline import torch self.pipe = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, use_safetensors=True).to("cuda") # Enable memory efficient attention self.pipe.enable_memory_efficient_attention() @chute.cord(public_api_path="/sdxl", method="POST") async def generate_sdxl(self, prompt: str, width: int = 1024, height: int = 1024): """Generate images with Stable Diffusion XL.""" images = self.pipe( prompt, width=width, height=height, num_inference_steps=20).images # Return first image as base64 buffer = BytesIO() images[0].save(buffer, format="PNG") import base64 return {"image": base64.b64encode(buffer.getvalue()).decode()} ``` ## Advanced Features ### Batch Generation Generate multiple images in a single request: ```python from typing import List class BatchGenerationInput(BaseModel): prompts: List[str] = Field(max_items=4) # Limit batch size width: int = Field(default=1024, ge=512, le=2048) height: int = Field(default=1024, ge=512, le=2048) num_inference_steps: int = Field(default=20, ge=10, le=50) @chute.cord(public_api_path="/batch", method="POST") async def generate_batch(self, params: BatchGenerationInput) -> List[str]: """Generate multiple images from prompts.""" results = [] for prompt in params.prompts: with self.torch.inference_mode(): result = self.pipeline( prompt=prompt, width=params.width, height=params.height, num_inference_steps=params.num_inference_steps) # Convert to base64 buffer = BytesIO() result.images[0].save(buffer, format="JPEG", quality=90) b64_image = base64.b64encode(buffer.getvalue()).decode() results.append(b64_image) return results ``` ### Image-to-Image Generation Transform existing images with text prompts: ```python import base64 from PIL import Image as PILImage class Img2ImgInput(BaseModel): prompt: str image_b64: str # Base64 encoded input image strength: float = Field(default=0.75, ge=0.1, le=1.0) guidance_scale: float = Field(default=7.5, ge=1.0, le=20.0) @chute.cord(public_api_path="/img2img", method="POST") async def image_to_image(self, params: Img2ImgInput) -> Response: """Transform images with text prompts.""" # Decode input image image_data = base64.b64decode(params.image_b64) init_image = PILImage.open(BytesIO(image_data)).convert("RGB") # Generate transformed image with self.torch.inference_mode(): result = self.pipeline( prompt=params.prompt, image=init_image, strength=params.strength, guidance_scale=params.guidance_scale) # Return as JPEG buffer = BytesIO() result.images[0].save(buffer, format="JPEG", quality=85) buffer.seek(0) return Response( content=buffer.getvalue(), media_type="image/jpeg") ``` ### Inpainting Support Fill or edit specific regions of images: ```python class InpaintInput(BaseModel): prompt: str image_b64: str # Original image mask_b64: str # Mask (white = inpaint, black = keep) strength: float = Field(default=0.75, ge=0.1, le=1.0) @chute.on_startup() async def load_inpaint_pipeline(self): """Load inpainting-specific pipeline.""" from diffusers import StableDiffusionInpaintPipeline self.inpaint_pipe = StableDiffusionInpaintPipeline.from_pretrained( "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16).to("cuda") @chute.cord(public_api_path="/inpaint", method="POST") async def inpaint(self, params: InpaintInput) -> Response: """Inpaint regions of images.""" # Decode images image_data = base64.b64decode(params.image_b64) mask_data = base64.b64decode(params.mask_b64) image = PILImage.open(BytesIO(image_data)).convert("RGB") mask = PILImage.open(BytesIO(mask_data)).convert("L") # Generate inpainted result result = self.inpaint_pipe( prompt=params.prompt, image=image, mask_image=mask, strength=params.strength) # Return result buffer = BytesIO() result.images[0].save(buffer, format="PNG") buffer.seek(0) return Response(content=buffer.getvalue(), media_type="image/png") ``` ## Deployment and Usage ### Deploy Your Service ```bash # Build and deploy the image generation service chutes deploy my_image_gen:chute # Monitor deployment status chutes chutes get my-image-gen ``` ### Using the API #### Basic Generation ```bash curl -X POST "https://myuser-my-image-gen.chutes.ai/generate" \ -H "Content-Type: application/json" \ -d '{ "prompt": "a majestic dragon flying over a crystal lake at sunset", "width": 1024, "height": 1024, "num_inference_steps": 20, "guidance_scale": 7.5, "seed": 42 }' \ --output generated_image.jpg ``` #### Python Client ```python import requests import base64 from PIL import Image from io import BytesIO def generate_image(prompt, **kwargs): """Generate image using your Chutes service.""" url = "https://myuser-my-image-gen.chutes.ai/generate" payload = { "prompt": prompt, **kwargs } response = requests.post(url, json=payload) if response.status_code == 200: # Save image with open("generated.jpg", "wb") as f: f.write(response.content) # Or display in Jupyter image = Image.open(BytesIO(response.content)) return image else: print(f"Error: {response.status_code}") return None # Generate an image image = generate_image( "a cyberpunk cityscape with neon lights and flying cars", width=1920, height=1080, num_inference_steps=25, seed=123 ) ``` ## Performance Optimization ### Memory Management ```python # Enable memory efficient attention self.pipeline.enable_memory_efficient_attention() # Use attention slicing for large images self.pipeline.enable_attention_slicing() # Enable CPU offloading for very large models self.pipeline.enable_model_cpu_offload() ``` ### Speed Optimizations ```python # Compile the UNet for faster inference self.pipeline.unet = torch.compile(self.pipeline.unet, mode="reduce-overhead") # Use faster schedulers from diffusers import DPMSolverMultistepScheduler self.pipeline.scheduler = DPMSolverMultistepScheduler.from_config( self.pipeline.scheduler.config ) ``` ### Hardware Scaling ```python # Scale up for higher throughput node_selector = NodeSelector( gpu_count=2, # Multi-GPU setup min_vram_gb_per_gpu=40) # Or scale out with multiple instances chute = Chute( # ... configuration concurrency=4, # Handle more concurrent requests ) ``` ## Best Practices ### 1. Prompt Engineering ```python # Good prompts are specific and detailed good_prompt = """ a photorealistic portrait of a wise old wizard with a long white beard, wearing a starry blue robe, holding a glowing crystal staff, in a mystical forest clearing with soft golden sunlight filtering through trees, highly detailed, 8k resolution, fantasy art style """ # Add negative prompts to avoid unwanted elements negative_prompt = """ blurry, low quality, deformed, ugly, bad anatomy, watermark, signature, text, cropped """ ``` ### 2. Parameter Tuning ```python # High quality settings high_quality_params = { "num_inference_steps": 50, "guidance_scale": 7.5, "width": 1024, "height": 1024, } # Fast generation settings fast_params = { "num_inference_steps": 15, "guidance_scale": 5.0, "width": 512, "height": 512, } ``` ### 3. Error Handling ```python @chute.cord(public_api_path="/generate", method="POST") async def generate_with_fallback(self, params: GenerationInput) -> Response: """Generate with proper error handling.""" try: # Try high-quality generation first result = self.pipeline( prompt=params.prompt, width=params.width, height=params.height, num_inference_steps=params.num_inference_steps) except torch.cuda.OutOfMemoryError: # Fallback to lower resolution logger.warning("OOM error, reducing resolution") result = self.pipeline( prompt=params.prompt, width=params.width // 2, height=params.height // 2, num_inference_steps=params.num_inference_steps // 2) except Exception as e: logger.error(f"Generation failed: {e}") raise HTTPException(status_code=500, detail="Generation failed") # Return image... ``` ## Monitoring and Scaling ### Resource Monitoring ```bash # Check GPU utilization chutes chutes metrics my-image-gen # View generation logs chutes chutes logs my-image-gen --tail 100 # Monitor request patterns chutes chutes status my-image-gen ``` ### Auto-scaling Configuration ```python # Configure auto-scaling based on queue length chute = Chute( # ... other config concurrency=2, # Base concurrency max_replicas=5, # Scale up to 5 instances scale_up_threshold=10, # Scale when queue > 10 scale_down_delay=300, # Wait 5 min before scaling down ) ``` ## Next Steps - **Advanced Models**: Experiment with ControlNet, LoRA fine-tuning - **Custom Training**: Train models on your own datasets - **Integration**: Build web interfaces and mobile apps - **Optimization**: Implement caching and CDN distribution For more advanced examples, see: - [Video Generation](/docs/examples/video-generation) - [Custom Images](/docs/examples/custom-images) - [Streaming Responses](/docs/examples/streaming-responses) --- ## SOURCE: https://chutes.ai/docs/examples/llm-chat # LLM Chat Applications This guide shows how to build powerful chat applications using Large Language Models (LLMs) with Chutes. We'll cover both high-performance VLLM serving and flexible SGLang implementations. ## Overview Chutes provides pre-built templates for popular LLM serving frameworks: - **VLLM**: High-performance serving with OpenAI-compatible APIs - **SGLang**: Advanced serving with structured generation capabilities Both frameworks support: - Multi-GPU scaling for large models - OpenAI-compatible endpoints - Streaming responses - Custom model configurations ## Quick Start: VLLM Chat Service ### Basic VLLM Setup ```python from chutes.chute import NodeSelector from chutes.chute.template.vllm import build_vllm_chute # Create a high-performance chat service chute = build_vllm_chute( username="myuser", readme="## Meta Llama 3.2 1B Instruct\n### Hello.", model_name="unsloth/Llama-3.2-1B-Instruct", node_selector=NodeSelector( gpu_count=1, ), concurrency=4 ) ``` ### Production VLLM Configuration For production workloads with larger models: ```python from chutes.chute import NodeSelector from chutes.chute.template.vllm import build_vllm_chute from chutes.image import Image image = ( Image( username="chutes", name="vllm_gemma", tag="0.8.1", readme="## vLLM - fast, flexible llm inference", ) .from_base("parachutes/base-python:3.12.9") .run_command( "pip install --no-cache wheel packaging git+https://github.com/huggingface/transformers.git qwen-vl-utils[decord]==0.0.8" ) .run_command("pip install --upgrade vllm==0.8.1") .run_command("pip install --no-cache flash-attn") .add("gemma_chat_template.jinja", "/app/gemma_chat_template.jinja") ) chute = build_vllm_chute( username="chutes", readme="Gemma 3 1B IT", model_name="unsloth/gemma-3-1b-it", image=image, node_selector=NodeSelector( gpu_count=8, min_vram_gb_per_gpu=48, ), concurrency=8, engine_args=dict( revision="284477f075e7d8bfa2c7e2e0131c3fe4055baa7f", num_scheduler_steps=8, enforce_eager=False, max_num_seqs=8, tool_call_parser="pythonic", enable_auto_tool_choice=True, chat_template="/app/gemma_chat_template.jinja", ), ) ``` ## Advanced: SGLang with Custom Image For more control and advanced features, use SGLang with a custom image: ```python import os from chutes.chute import NodeSelector from chutes.chute.template.sglang import build_sglang_chute from chutes.image import Image # Optimize networking for multi-GPU setups os.environ["NO_PROXY"] = "localhost,127.0.0.1" for key in ["NCCL_P2P_DISABLE", "NCCL_IB_DISABLE", "NCCL_NET_GDR_LEVEL"]: if key in os.environ: del os.environ[key] # Build custom SGLang image with optimizations image = ( Image( username="myuser", name="sglang-optimized", tag="0.4.9.dev1", readme="SGLang with performance optimizations for large models") .from_base("parachutes/python:3.12") .run_command("pip install --upgrade pip") .run_command("pip install --upgrade 'sglang[all]'") .run_command( "git clone https://github.com/sgl-project/sglang sglang_src && " "cd sglang_src && pip install -e python[all]" ) .run_command( "pip install torch torchvision torchaudio " "--index-url https://download.pytorch.org/whl/cu128 --upgrade" ) .run_command("pip install datasets blobfile accelerate tiktoken") .run_command("pip install nvidia-nccl-cu12==2.27.6 --force-reinstall --no-deps") .with_env("SGL_ENABLE_JIT_DEEPGEMM", "1") ) # Deploy Kimi K2 Instruct model chute = build_sglang_chute( username="myuser", readme="Moonshot AI Kimi K2 Instruct - Advanced reasoning model", model_name="moonshotai/Kimi-K2-Instruct", image=image, concurrency=3, node_selector=NodeSelector( gpu_count=8, include=["h200"], # Use latest H200 GPUs ), engine_args=( "--trust-remote-code " "--cuda-graph-max-bs 3 " "--mem-fraction-static 0.97 " "--context-length 65536 " "--revision d1e2b193ddeae7776463443e7a9aa3c3cdc51003 " )) ``` ## Reasoning Models: DeepSeek R1 For advanced reasoning capabilities: ```python from chutes.chute import NodeSelector from chutes.chute.template.sglang import build_sglang_chute # Deploy DeepSeek R1 reasoning model chute = build_sglang_chute( username="myuser", readme="DeepSeek R1 - Advanced reasoning and problem-solving model", model_name="deepseek-ai/DeepSeek-R1", image="chutes/sglang:0.4.6.post5b", concurrency=24, node_selector=NodeSelector( gpu_count=8, min_vram_gb_per_gpu=140, # Large memory requirement include=["h200"]), engine_args=( "--trust-remote-code " "--revision f7361cd9ff99396dbf6bd644ad846015e59ed4fc" )) ``` ## Using Your Chat Service ### Deploy the Service ```bash # Build and deploy your chat service chutes deploy my_chat:chute # Monitor deployment chutes chutes get my-chat ``` ### OpenAI-Compatible API Both VLLM and SGLang provide OpenAI-compatible endpoints: ```bash # Chat completions endpoint curl -X POST "https://myuser-my-chat.chutes.ai/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "model": "microsoft/DialoGPT-medium", "messages": [ {"role": "user", "content": "Hello! How are you?"} ], "max_tokens": 100, "temperature": 0.7 }' ``` ### Streaming Responses Enable real-time streaming for better user experience: ```bash curl -X POST "https://myuser-my-chat.chutes.ai/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "model": "microsoft/DialoGPT-medium", "messages": [ {"role": "user", "content": "Write a short story about AI"} ], "stream": true, "max_tokens": 500 }' ``` ### Python Client Example ```python import openai # Configure client to use your Chutes deployment client = openai.OpenAI( base_url="https://myuser-my-chat.chutes.ai/v1", api_key="your-api-key" # Or use environment variable ) # Chat completion response = client.chat.completions.create( model="microsoft/DialoGPT-medium", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain quantum computing in simple terms."} ], max_tokens=200, temperature=0.7 ) print(response.choices[0].message.content) # Streaming chat stream = client.chat.completions.create( model="microsoft/DialoGPT-medium", messages=[ {"role": "user", "content": "Tell me a joke"} ], stream=True ) for chunk in stream: if chunk.choices[0].delta.content is not None: print(chunk.choices[0].delta.content, end="") ``` ## Performance Optimization ### GPU Selection Choose appropriate hardware for your model size: ```python # For smaller models (7B-13B parameters) node_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24 ) # For medium models (30B-70B parameters) node_selector = NodeSelector( gpu_count=4, min_vram_gb_per_gpu=80 ) # For large models (100B+ parameters) node_selector = NodeSelector( gpu_count=8, min_vram_gb_per_gpu=140, include=["h200"] # Use latest hardware ) ``` ### Engine Optimization Tune engine parameters for best performance: ```python # VLLM optimizations engine_args = dict( gpu_memory_utilization=0.97, # Use most GPU memory max_model_len=32768, # Context length max_num_seqs=16, # Batch size trust_remote_code=True, # Enable custom models enforce_eager=False, # Use CUDA graphs disable_log_requests=True, # Reduce logging overhead ) # SGLang optimizations engine_args = ( "--trust-remote-code " "--cuda-graph-max-bs 8 " # CUDA graph batch size "--mem-fraction-static 0.95 " # Memory allocation "--context-length 32768 " # Context window ) ``` ### Concurrency Settings Balance throughput and resource usage: ```python # High throughput setup chute = build_vllm_chute( # ... other parameters concurrency=16, # Handle many concurrent requests engine_args=dict( max_num_seqs=32, # Large batch size gpu_memory_utilization=0.90) ) # Low latency setup chute = build_vllm_chute( # ... other parameters concurrency=4, # Fewer concurrent requests engine_args=dict( max_num_seqs=8, # Smaller batch size gpu_memory_utilization=0.95) ) ``` ## Monitoring and Troubleshooting ### Check Service Status ```bash # View service health chutes chutes get my-chat # View recent logs chutes chutes logs my-chat # Monitor resource usage chutes chutes metrics my-chat ``` ### Common Issues **Out of Memory (OOM)** ```python # Reduce memory usage engine_args = dict( gpu_memory_utilization=0.85, # Lower memory usage max_model_len=16384, # Shorter context max_num_seqs=4, # Smaller batch ) ``` **Slow Response Times** ```python # Optimize for speed engine_args = dict( enforce_eager=False, # Enable CUDA graphs disable_log_requests=True, # Reduce logging quantization="awq", # Use quantization ) ``` **Connection Timeouts** ```python # Increase timeouts chute = build_vllm_chute( # ... other parameters concurrency=8, # Increase concurrent capacity engine_args=dict( max_num_seqs=16, # Larger batches ) ) ``` ## Best Practices ### 1. Model Selection - **For general chat**: Mistral, Llama, or Qwen models - **For reasoning**: DeepSeek R1, GPT-4 style models - **For coding**: CodeLlama, DeepSeek Coder - **For multilingual**: Qwen, multilingual Mistral variants ### 2. Resource Planning - Start with smaller configurations and scale up - Monitor GPU utilization and adjust concurrency - Use appropriate GPU types for your model size - Consider cost vs. performance trade-offs ### 3. Development Workflow ```bash # 1. Test locally with small model chutes deploy test-chat:chute --wait # 2. Validate API endpoints curl https://myuser-test-chat.chutes.ai/v1/models # 3. Load test with production model chutes deploy prod-chat:chute --wait # 4. Monitor and optimize chutes chutes metrics prod-chat ``` ### 4. Security Considerations - Use API keys for authentication - Implement rate limiting if needed - Monitor usage and costs - Keep model revisions pinned for reproducibility ## Next Steps - **Advanced Features**: Explore function calling and tool use - **Custom Templates**: Build specialized chat applications - **Integration**: Connect with web frontends and mobile apps - **Scaling**: Implement load balancing across multiple deployments For more examples, see: - [Streaming Responses](/docs/examples/streaming-responses) - [Custom Images](/docs/examples/custom-images) - [Templates Documentation](/docs/templates/) --- ## SOURCE: https://chutes.ai/docs/examples/multi-model-analysis # Multi-Model Analysis with Chutes This guide demonstrates how to build sophisticated analysis systems that combine multiple AI models to provide comprehensive insights from text, images, audio, and other data types. ## Overview Multi-model analysis enables: - **Comprehensive Understanding**: Combine different AI models for deeper insights - **Cross-Modal Analysis**: Analyze relationships between text, images, and audio - **Ensemble Predictions**: Improve accuracy by combining multiple model outputs - **Specialized Processing**: Use domain-specific models for different aspects - **Robust Error Handling**: Graceful degradation when individual models fail ## Architecture Patterns ### Sequential Processing Pipeline ```python from pydantic import BaseModel, Field from typing import List, Dict, Any, Optional, Union import asyncio from dataclasses import dataclass import logging import time @dataclass class ModelResult: model_name: str result: Dict[str, Any] confidence: float processing_time_ms: float status: str = "success" error: Optional[str] = None class MultiModelRequest(BaseModel): text: Optional[str] = None image_base64: Optional[str] = None audio_base64: Optional[str] = None analysis_types: List[str] = Field(default=["sentiment", "entities", "classification"]) combine_results: bool = True confidence_threshold: float = 0.5 class MultiModelResponse(BaseModel): individual_results: List[ModelResult] combined_analysis: Optional[Dict[str, Any]] = None overall_confidence: float total_processing_time_ms: float metadata: Dict[str, Any] = Field(default_factory=dict) class MultiModelAnalyzer: def __init__(self): self.models = {} self.logger = logging.getLogger(__name__) # Initialize individual model services self._initialize_models() def _initialize_models(self): """Initialize all AI model services""" # Text analysis models self.models["sentiment"] = SentimentAnalyzer() self.models["entities"] = EntityExtractor() self.models["classification"] = TextClassifier() self.models["summarization"] = TextSummarizer() # Image analysis models self.models["image_classification"] = ImageClassifier() self.models["object_detection"] = ObjectDetector() self.models["ocr"] = OpticalCharacterRecognition() # Audio analysis models self.models["speech_recognition"] = SpeechRecognizer() self.models["audio_classification"] = AudioClassifier() # Cross-modal models self.models["image_captioning"] = ImageCaptioner() self.models["visual_qa"] = VisualQuestionAnswering() async def analyze(self, request: MultiModelRequest) -> MultiModelResponse: """Perform multi-model analysis""" start_time = time.time() results = [] # Determine which models to run based on available inputs models_to_run = self._select_models(request) # Run models in parallel where possible tasks = [] for model_name in models_to_run: task = self._run_model_safe(model_name, request) tasks.append(task) # Execute all tasks model_results = await asyncio.gather(*tasks, return_exceptions=True) # Process results for model_name, result in zip(models_to_run, model_results): if isinstance(result, Exception): results.append(ModelResult( model_name=model_name, result={}, confidence=0.0, processing_time_ms=0.0, status="error", error=str(result) )) else: results.append(result) # Combine results if requested combined_analysis = None if request.combine_results: combined_analysis = self._combine_results(results, request) # Calculate overall metrics successful_results = [r for r in results if r.status == "success"] overall_confidence = ( sum(r.confidence for r in successful_results) / len(successful_results) if successful_results else 0.0 ) total_time = (time.time() - start_time) * 1000 return MultiModelResponse( individual_results=results, combined_analysis=combined_analysis, overall_confidence=overall_confidence, total_processing_time_ms=total_time, metadata={ "models_run": len(models_to_run), "successful_models": len(successful_results), "failed_models": len(results) - len(successful_results) } ) def _select_models(self, request: MultiModelRequest) -> List[str]: """Select which models to run based on available inputs and analysis types""" models_to_run = [] # Text-based models if request.text: if "sentiment" in request.analysis_types: models_to_run.append("sentiment") if "entities" in request.analysis_types: models_to_run.append("entities") if "classification" in request.analysis_types: models_to_run.append("classification") if "summarization" in request.analysis_types: models_to_run.append("summarization") # Image-based models if request.image_base64: if "image_classification" in request.analysis_types: models_to_run.append("image_classification") if "object_detection" in request.analysis_types: models_to_run.append("object_detection") if "ocr" in request.analysis_types: models_to_run.append("ocr") if "image_captioning" in request.analysis_types: models_to_run.append("image_captioning") # Audio-based models if request.audio_base64: if "speech_recognition" in request.analysis_types: models_to_run.append("speech_recognition") if "audio_classification" in request.analysis_types: models_to_run.append("audio_classification") # Cross-modal models if request.text and request.image_base64: if "visual_qa" in request.analysis_types: models_to_run.append("visual_qa") return models_to_run async def _run_model_safe(self, model_name: str, request: MultiModelRequest) -> ModelResult: """Safely run a model with error handling""" start_time = time.time() try: model = self.models[model_name] result = await self._execute_model(model, model_name, request) processing_time = (time.time() - start_time) * 1000 return ModelResult( model_name=model_name, result=result["output"], confidence=result.get("confidence", 0.5), processing_time_ms=processing_time ) except Exception as e: self.logger.error(f"Model {model_name} failed: {e}") processing_time = (time.time() - start_time) * 1000 return ModelResult( model_name=model_name, result={}, confidence=0.0, processing_time_ms=processing_time, status="error", error=str(e) ) async def _execute_model(self, model, model_name: str, request: MultiModelRequest) -> Dict[str, Any]: """Execute a specific model based on its type""" if model_name in ["sentiment", "entities", "classification", "summarization"]: return await model.analyze(request.text) elif model_name in ["image_classification", "object_detection", "ocr"]: return await model.analyze(request.image_base64) elif model_name == "image_captioning": return await model.generate_caption(request.image_base64) elif model_name in ["speech_recognition", "audio_classification"]: return await model.analyze(request.audio_base64) elif model_name == "visual_qa": return await model.answer(request.text, request.image_base64) else: raise ValueError(f"Unknown model: {model_name}") def _combine_results(self, results: List[ModelResult], request: MultiModelRequest) -> Dict[str, Any]: """Combine results from multiple models intelligently""" combined = { "summary": {}, "confidence_scores": {}, "cross_modal_insights": {}, "consensus": {} } # Extract successful results successful_results = [r for r in results if r.status == "success"] # Sentiment consensus sentiment_results = [r for r in successful_results if r.model_name == "sentiment"] if sentiment_results: combined["summary"]["sentiment"] = sentiment_results[0].result combined["confidence_scores"]["sentiment"] = sentiment_results[0].confidence # Entity consolidation entity_results = [r for r in successful_results if r.model_name == "entities"] if entity_results: entities = entity_results[0].result.get("entities", []) # Group entities by type entity_groups = {} for entity in entities: entity_type = entity.get("label", "UNKNOWN") if entity_type not in entity_groups: entity_groups[entity_type] = [] entity_groups[entity_type].append(entity["text"]) combined["summary"]["entities"] = entity_groups combined["confidence_scores"]["entities"] = entity_results[0].confidence # Cross-modal insights if request.text and request.image_base64: text_sentiment = next((r.result for r in successful_results if r.model_name == "sentiment"), None) image_caption = next((r.result for r in successful_results if r.model_name == "image_captioning"), None) if text_sentiment and image_caption: combined["cross_modal_insights"]["text_image_alignment"] = self._analyze_text_image_alignment( text_sentiment, image_caption ) # Generate overall consensus combined["consensus"] = self._generate_consensus(successful_results) return combined def _analyze_text_image_alignment(self, text_sentiment: Dict, image_caption: Dict) -> Dict[str, Any]: """Analyze alignment between text sentiment and image content""" # Simple alignment analysis text_polarity = text_sentiment.get("label", "neutral") caption_text = image_caption.get("caption", "") # Basic keyword matching for alignment positive_keywords = ["happy", "smile", "bright", "beautiful", "joy"] negative_keywords = ["sad", "dark", "angry", "broken", "disappointed"] caption_lower = caption_text.lower() positive_matches = sum(1 for word in positive_keywords if word in caption_lower) negative_matches = sum(1 for word in negative_keywords if word in caption_lower) if positive_matches > negative_matches: image_sentiment = "positive" elif negative_matches > positive_matches: image_sentiment = "negative" else: image_sentiment = "neutral" alignment_score = 1.0 if text_polarity == image_sentiment else 0.5 return { "text_sentiment": text_polarity, "inferred_image_sentiment": image_sentiment, "alignment_score": alignment_score, "caption": caption_text } def _generate_consensus(self, results: List[ModelResult]) -> Dict[str, Any]: """Generate consensus view across all successful models""" consensus = { "primary_insights": [], "confidence_level": "low", "recommendation": "further_analysis_needed" } # Aggregate confidence scores avg_confidence = sum(r.confidence for r in results) / len(results) if results else 0.0 if avg_confidence > 0.8: consensus["confidence_level"] = "high" consensus["recommendation"] = "results_reliable" elif avg_confidence > 0.6: consensus["confidence_level"] = "medium" consensus["recommendation"] = "results_moderately_reliable" # Extract key insights for result in results: if result.confidence > 0.7: if result.model_name == "sentiment": consensus["primary_insights"].append( f"Text sentiment: {result.result.get('label', 'unknown')}" ) elif result.model_name == "classification": consensus["primary_insights"].append( f"Content category: {result.result.get('predicted_class', 'unknown')}" ) elif result.model_name == "object_detection": objects = result.result.get("objects", []) if objects: consensus["primary_insights"].append( f"Key objects detected: {', '.join([obj['class'] for obj in objects[:3]])}" ) return consensus # Model implementations (simplified interfaces) class SentimentAnalyzer: async def analyze(self, text: str) -> Dict[str, Any]: # Implementation would use actual sentiment model return { "output": {"label": "positive", "score": 0.85}, "confidence": 0.85 } class EntityExtractor: async def analyze(self, text: str) -> Dict[str, Any]: # Implementation would use actual NER model return { "output": { "entities": [ {"text": "Apple", "label": "ORG", "start": 0, "end": 5} ] }, "confidence": 0.9 } class TextClassifier: async def analyze(self, text: str) -> Dict[str, Any]: # Implementation would use actual text classifier return { "output": {"predicted_class": "technology", "score": 0.95}, "confidence": 0.95 } class TextSummarizer: async def analyze(self, text: str) -> Dict[str, Any]: # Implementation would use actual summarizer return { "output": {"summary": "This is a summary."}, "confidence": 0.9 } class ImageClassifier: async def analyze(self, image_base64: str) -> Dict[str, Any]: # Implementation would use actual image classification model return { "output": {"class": "cat", "score": 0.92}, "confidence": 0.92 } class ObjectDetector: async def analyze(self, image_base64: str) -> Dict[str, Any]: # Implementation would use actual object detector return { "output": {"objects": [{"class": "cat", "box": [0, 0, 100, 100]}]}, "confidence": 0.9 } class OpticalCharacterRecognition: async def analyze(self, image_base64: str) -> Dict[str, Any]: # Implementation would use actual OCR return { "output": {"text": "Extracted text"}, "confidence": 0.85 } class ImageCaptioner: async def generate_caption(self, image_base64: str) -> Dict[str, Any]: # Implementation would use actual image captioning model return { "output": {"caption": "A cat sitting on a windowsill"}, "confidence": 0.88 } class VisualQuestionAnswering: async def answer(self, text: str, image_base64: str) -> Dict[str, Any]: # Implementation would use VQA model return { "output": {"answer": "Yes"}, "confidence": 0.9 } class SpeechRecognizer: async def analyze(self, audio_base64: str) -> Dict[str, Any]: # Implementation would use ASR model return { "output": {"text": "Transcribed audio"}, "confidence": 0.95 } class AudioClassifier: async def analyze(self, audio_base64: str) -> Dict[str, Any]: # Implementation would use audio classifier return { "output": {"class": "music"}, "confidence": 0.8 } # Global analyzer instance multi_analyzer = None def initialize_analyzer(): """Initialize the multi-model analyzer""" global multi_analyzer multi_analyzer = MultiModelAnalyzer() return {"status": "initialized", "models_available": len(multi_analyzer.models)} async def analyze_multi_modal(inputs: Dict[str, Any]) -> Dict[str, Any]: """Main multi-model analysis endpoint""" request = MultiModelRequest(**inputs) result = await multi_analyzer.analyze(request) return result.dict() ``` ## Production Deployment ### Scalable Multi-Model Service ```python from chutes.image import Image from chutes.chute import Chute, NodeSelector # Comprehensive multi-model image multi_model_image = ( Image( username="myuser", name="multi-model-analysis", tag="1.0.0", base_image="nvidia/cuda:12.1-devel-ubuntu22.04", python_version="3.11" ) .run_command("pip install torch>=2.4.0 transformers>=4.44.0 sentence-transformers>=3.0.0 opencv-python>=4.10.0 pillow>=10.4.0 ultralytics>=8.2.0 librosa>=0.10.2 soundfile>=0.12.1 pytesseract>=0.3.10 easyocr>=1.7.1 numpy>=1.26.0 scipy>=1.14.0 scikit-learn>=1.5.0 redis>=5.0.0") .run_command("apt-get update && apt-get install -y tesseract-ocr libgl1-mesa-glx") .add("./models", "/app/models") .add("./multi_model", "/app/multi_model") ) # Deploy multi-model service multi_model_chute = Chute( username="myuser", name="multi-model-analysis", image=multi_model_image, entry_file="multi_model_analyzer.py", entry_point="analyze_multi_modal", node_selector=NodeSelector( gpu_count=2, min_vram_gb_per_gpu=16), timeout_seconds=600, concurrency=5 ) # result = multi_model_chute.deploy() # print(f"Multi-model service deployed: {result}") ``` ## Advanced Use Cases ### Document Intelligence ```python class DocumentIntelligenceAnalyzer(MultiModelAnalyzer): """Specialized analyzer for document processing""" async def analyze_document(self, document_image: str, document_text: str = None) -> Dict[str, Any]: """Comprehensive document analysis""" # Extract text using OCR if not provided if not document_text: ocr_result = await self.models["ocr"].analyze(document_image) document_text = ocr_result["output"]["text"] # Parallel analysis tasks = [ self.models["entities"].analyze(document_text), # Named entities self.models["classification"].analyze(document_text), # Document type self.models["sentiment"].analyze(document_text), # Sentiment/tone self.models["object_detection"].analyze(document_image), # Layout analysis self._extract_document_structure(document_image), # Structure analysis self._detect_signatures_stamps(document_image) # Signature detection ] results = await asyncio.gather(*tasks, return_exceptions=True) # Combine into document intelligence report intelligence_report = { "document_type": results[1].get("predicted_class") if len(results) > 1 else "unknown", "extracted_entities": results[0].get("entities", []) if len(results) > 0 else [], "document_sentiment": results[2].get("label") if len(results) > 2 else "neutral", "layout_elements": results[3].get("objects", []) if len(results) > 3 else [], "structure_analysis": results[4] if len(results) > 4 else {}, "signature_analysis": results[5] if len(results) > 5 else {}, "extracted_text": document_text, "confidence_score": self._calculate_document_confidence(results) } return intelligence_report async def _extract_document_structure(self, image_base64: str) -> Dict[str, Any]: """Analyze document structure and layout""" # Implementation would use layout analysis model return { "sections": ["header", "body", "footer"], "tables_detected": 2, "figures_detected": 1, "text_blocks": 5 } async def _detect_signatures_stamps(self, image_base64: str) -> Dict[str, Any]: """Detect signatures and stamps in document""" # Implementation would use specialized signature detection return { "signatures_detected": 1, "stamps_detected": 0, "signature_locations": [{"x": 450, "y": 600, "width": 150, "height": 50}] } def _calculate_document_confidence(self, results: List[Any]) -> float: """Calculate overall confidence for document analysis""" # Simplified calculation confidences = [r.get("confidence", 0) for r in results if isinstance(r, dict)] return sum(confidences) / len(confidences) if confidences else 0.0 async def analyze_document_intelligence(inputs: Dict[str, Any]) -> Dict[str, Any]: """Document intelligence analysis endpoint""" analyzer = DocumentIntelligenceAnalyzer() result = await analyzer.analyze_document( document_image=inputs["document_image_base64"], document_text=inputs.get("document_text") ) return result ``` ### Social Media Content Analysis ```python class SocialMediaAnalyzer(MultiModelAnalyzer): """Specialized analyzer for social media content""" async def analyze_social_post(self, post_data: Dict[str, Any]) -> Dict[str, Any]: """Comprehensive social media post analysis""" text = post_data.get("text", "") images = post_data.get("images", []) video = post_data.get("video") audio = post_data.get("audio") analysis_tasks = [] # Text analysis if text: analysis_tasks.extend([ ("sentiment", self.models["sentiment"].analyze(text)), ("entities", self.models["entities"].analyze(text)), ("classification", self.models["classification"].analyze(text)), ("toxicity", self._analyze_toxicity(text)), ("engagement_prediction", self._predict_engagement(text)) ]) # Image analysis for i, image in enumerate(images): analysis_tasks.extend([ (f"image_{i}_classification", self.models["image_classification"].analyze(image)), (f"image_{i}_objects", self.models["object_detection"].analyze(image)), (f"image_{i}_caption", self.models["image_captioning"].generate_caption(image)), (f"image_{i}_faces", self._detect_faces(image)) ]) # Audio analysis (if present) if audio: analysis_tasks.extend([ ("speech_to_text", self.models["speech_recognition"].analyze(audio)), ("audio_mood", self.models["audio_classification"].analyze(audio)) ]) # Execute all analyses if not analysis_tasks: return {"error": "No content to analyze"} task_names, tasks = zip(*analysis_tasks) results = await asyncio.gather(*tasks, return_exceptions=True) # Compile comprehensive report social_analysis = { "content_summary": self._generate_content_summary(text, images, audio), "engagement_factors": self._analyze_engagement_factors(results, task_names), "risk_assessment": self._assess_content_risks(results, task_names), "recommendations": self._generate_recommendations(results, task_names), "virality_score": self._calculate_virality_score(results, task_names), "target_audience": self._identify_target_audience(results, task_names) } return social_analysis async def _analyze_toxicity(self, text: str) -> Dict[str, Any]: """Analyze text for toxic content""" # Implementation would use toxicity detection model return {"toxicity_score": 0.1, "is_toxic": False} async def _predict_engagement(self, text: str) -> Dict[str, Any]: """Predict engagement potential of text""" # Implementation would use engagement prediction model return {"predicted_likes": 150, "predicted_shares": 25, "predicted_comments": 10} async def _detect_faces(self, image: str) -> Dict[str, Any]: """Detect faces in image""" # Implementation would use face detection model return {"face_count": 1, "emotions": ["happy"]} def _generate_content_summary(self, text, images, audio) -> Dict[str, Any]: """Generate summary of content types present""" return { "has_text": bool(text), "image_count": len(images), "has_audio": bool(audio), "has_video": False # Not implemented yet } def _analyze_engagement_factors(self, results, task_names) -> Dict[str, Any]: """Analyze factors contributing to engagement""" return {"sentiment_impact": "positive", "visual_impact": "high"} def _assess_content_risks(self, results, task_names) -> Dict[str, Any]: """Assess potential content risks""" return {"risk_level": "low", "flagged_content": []} def _generate_recommendations(self, results, task_names) -> List[str]: """Generate content improvement recommendations""" return ["Add more hashtags", "Use brighter images"] def _identify_target_audience(self, results, task_names) -> str: """Identify potential target audience""" return "General" def _calculate_virality_score(self, results: List, task_names: List[str]) -> float: """Calculate potential virality score""" # Complex scoring algorithm based on multiple factors base_score = 0.5 # Boost for positive sentiment sentiment_idx = next((i for i, name in enumerate(task_names) if name == "sentiment"), None) if sentiment_idx is not None and not isinstance(results[sentiment_idx], Exception): sentiment = results[sentiment_idx].get("label", "neutral") if sentiment == "positive": base_score += 0.2 # Boost for visual content image_count = sum(1 for name in task_names if "image_" in name and "_classification" in name) base_score += min(image_count * 0.1, 0.3) return min(base_score, 1.0) async def analyze_social_media(inputs: Dict[str, Any]) -> Dict[str, Any]: """Social media analysis endpoint""" analyzer = SocialMediaAnalyzer() result = await analyzer.analyze_social_post(inputs["post_data"]) return result ``` ## Performance Optimization ### Caching and Load Balancing ```python import redis import pickle import hashlib from typing import Optional class CachedMultiModelAnalyzer(MultiModelAnalyzer): """Multi-model analyzer with Redis caching""" def __init__(self, redis_url: str = "redis://localhost:6379"): super().__init__() self.redis_client = redis.from_url(redis_url) self.cache_ttl = 3600 # 1 hour def _generate_cache_key(self, request: MultiModelRequest) -> str: """Generate cache key for request""" request_str = f"{request.text or ''}{request.image_base64 or ''}{request.audio_base64 or ''}" return f"multi_model:{hashlib.md5(request_str.encode()).hexdigest()}" async def analyze(self, request: MultiModelRequest) -> MultiModelResponse: """Analyze with caching""" cache_key = self._generate_cache_key(request) # Try to get from cache cached_result = self._get_from_cache(cache_key) if cached_result: return cached_result # Perform analysis result = await super().analyze(request) # Cache result self._store_in_cache(cache_key, result) return result def _get_from_cache(self, key: str) -> Optional[MultiModelResponse]: """Get result from Redis cache""" try: cached_data = self.redis_client.get(key) if cached_data: return MultiModelResponse(**pickle.loads(cached_data)) except Exception as e: self.logger.warning(f"Cache read error: {e}") return None def _store_in_cache(self, key: str, result: MultiModelResponse): """Store result in Redis cache""" try: serialized_data = pickle.dumps(result.dict()) self.redis_client.setex(key, self.cache_ttl, serialized_data) except Exception as e: self.logger.warning(f"Cache write error: {e}") # Model load balancing class LoadBalancedMultiModelAnalyzer(CachedMultiModelAnalyzer): """Multi-model analyzer with load balancing across model instances""" def __init__(self, model_endpoints: Dict[str, List[str]], redis_url: str = "redis://localhost:6379"): super().__init__(redis_url) self.model_endpoints = model_endpoints self.current_endpoints = {model: 0 for model in model_endpoints} def _get_next_endpoint(self, model_name: str) -> str: """Get next endpoint using round-robin load balancing""" if model_name not in self.model_endpoints: raise ValueError(f"No endpoints configured for model: {model_name}") endpoints = self.model_endpoints[model_name] current_idx = self.current_endpoints[model_name] endpoint = endpoints[current_idx] # Update for next request self.current_endpoints[model_name] = (current_idx + 1) % len(endpoints) return endpoint async def _execute_model(self, model, model_name: str, request: MultiModelRequest) -> Dict[str, Any]: """Execute model with load balancing""" endpoint = self._get_next_endpoint(model_name) # Make HTTP request to model endpoint import httpx async with httpx.AsyncClient() as client: if model_name in ["sentiment", "entities", "classification"]: response = await client.post(f"{endpoint}/analyze", json={"text": request.text}) elif model_name in ["image_classification", "object_detection"]: response = await client.post(f"{endpoint}/analyze", json={"image": request.image_base64}) # Add more model types as needed response.raise_for_status() return response.json() ``` ## Monitoring and Observability ```python from prometheus_client import Counter, Histogram, Gauge, start_http_server import time # Metrics MODEL_REQUESTS = Counter('model_requests_total', 'Total model requests', ['model_name', 'status']) MODEL_DURATION = Histogram('model_duration_seconds', 'Model execution time', ['model_name']) ACTIVE_ANALYSES = Gauge('active_analyses', 'Number of active analyses') CACHE_HITS = Counter('cache_hits_total', 'Cache hits', ['type']) class MonitoredMultiModelAnalyzer(LoadBalancedMultiModelAnalyzer): """Multi-model analyzer with comprehensive monitoring""" async def analyze(self, request: MultiModelRequest) -> MultiModelResponse: """Analyze with monitoring""" ACTIVE_ANALYSES.inc() try: start_time = time.time() result = await super().analyze(request) # Record success metrics MODEL_REQUESTS.labels(model_name='multi_model', status='success').inc() MODEL_DURATION.labels(model_name='multi_model').observe(time.time() - start_time) return result except Exception as e: MODEL_REQUESTS.labels(model_name='multi_model', status='error').inc() raise finally: ACTIVE_ANALYSES.dec() async def _run_model_safe(self, model_name: str, request: MultiModelRequest) -> ModelResult: """Run model with individual monitoring""" MODEL_REQUESTS.labels(model_name=model_name, status='started').inc() with MODEL_DURATION.labels(model_name=model_name).time(): result = await super()._run_model_safe(model_name, request) status = 'success' if result.status == 'success' else 'error' MODEL_REQUESTS.labels(model_name=model_name, status=status).inc() return result # Start metrics server # start_http_server(8001) ``` ## Usage Examples ### Comprehensive Content Analysis ```python # Deploy the multi-model service # comprehensive_result = multi_model_chute.run({ # "text": "Just visited the most amazing restaurant! The food was incredible and the view was breathtaking. Highly recommend!", # "image_base64": "...", # Base64 encoded restaurant photo # "analysis_types": [ # "sentiment", "entities", "classification", # "image_classification", "object_detection", "image_captioning" # ], # "combine_results": True, # "confidence_threshold": 0.6 # }) # print("Individual Results:") # for result in comprehensive_result["individual_results"]: # print(f"- {result['model_name']}: {result['confidence']:.2f} confidence") # print("\nCombined Analysis:") # print(f"Overall sentiment: {comprehensive_result['combined_analysis']['summary']['sentiment']['label']}") # print(f"Entities found: {comprehensive_result['combined_analysis']['summary']['entities']}") # print(f"Cross-modal alignment: {comprehensive_result['combined_analysis']['cross_modal_insights']}") ``` ## Next Steps - **[Custom Training](custom-training)** - Train specialized models for your use case - **[Performance Optimization](../guides/performance)** - Scale multi-model systems - **[Production Deployment](../guides/best-practices)** - Deploy at enterprise scale --- ## SOURCE: https://chutes.ai/docs/examples/music-generation # Music Generation with DiffRhythm This guide demonstrates how to build a sophisticated music generation service using DiffRhythm, capable of creating music from text prompts and lyrics with advanced rhythm and style control. ## Overview DiffRhythm (ASLP-lab/DiffRhythm) is a state-of-the-art music generation model that can: - Generate music from text descriptions and style prompts - Convert lyrics with timing information into musical performances - Use reference audio to guide musical style - Support multiple languages and musical genres - Generate high-quality 44.1kHz audio output ## Complete Implementation ### Input Schema Design Define comprehensive input validation for music generation: ```python import re from typing import Optional from pydantic import BaseModel from fastapi import HTTPException, status # Regex for validating LRC (lyric) format timestamps LRC_RE = re.compile(r"\[(\d+):(\d+\.\d+)\]") class InputArgs(BaseModel): style_prompt: Optional[str] = None lyrics: Optional[str] = None audio_b64: Optional[str] = None # Reference audio in base64 ``` ### Custom Image with DiffRhythm Build a custom image with all required dependencies: ```python from chutes.image import Image from chutes.chute import Chute, NodeSelector image = ( Image( username="myuser", name="diffrhythm", tag="0.0.2", readme="Music generation with ASLP-lab/DiffRhythm") .from_base("parachutes/base-python:3.12.9") .set_user("root") .run_command("apt update && apt -y install espeak-ng") # For text processing .set_user("chutes") .run_command("git clone https://github.com/ASLP-lab/DiffRhythm.git") .run_command("pip install -r DiffRhythm/requirements.txt") .run_command("pip install pybase64 py3langid") # Additional dependencies .run_command("mv -f /app/DiffRhythm/* /app") # Move to app directory .with_env("PYTHONPATH", "/app/infer") # Set Python path ) ``` ### Chute Configuration Configure the service with appropriate GPU requirements: ```python chute = Chute( username="myuser", name="diffrhythm-music", tagline="AI Music Generation with DiffRhythm", readme="Generate music from text descriptions and lyrics using advanced AI", image=image, node_selector=NodeSelector(gpu_count=1), # Single GPU sufficient ) ``` ### Model Initialization Load and initialize all required models on startup: ```python @chute.on_startup() async def initialize(self): """ Initialize DiffRhythm models and dependencies. """ from huggingface_hub import snapshot_download import torchaudio import torch import soundfile from infer_utils import ( decode_audio, get_lrc_token, get_negative_style_prompt, get_reference_latent, get_style_prompt, load_checkpoint, CNENTokenizer) from infer import inference from muq import MuQMuLan from model import DiT, CFM import json import os # Download required models revision = "613846abae8e5b869b3845a5dfabc9ecc37ecdab" repo_id = "ASLP-lab/DiffRhythm-full" path = snapshot_download(repo_id, revision=revision) vae_path = snapshot_download( "ASLP-lab/DiffRhythm-vae", revision="4656f626776f5f924c03471acb25bea6734e774f" ) # Load model configuration dit_config_path = "/app/config/diffrhythm-1b.json" with open(dit_config_path) as f: model_config = json.load(f) # Initialize models dit_model_cls = DiT self.max_frames = 6144 # CFM (Conditional Flow Matching) model self.cfm = CFM( transformer=dit_model_cls(**model_config["model"], max_frames=self.max_frames), num_channels=model_config["model"]["mel_dim"], max_frames=self.max_frames ).to("cuda") # Load trained weights self.cfm = load_checkpoint( self.cfm, os.path.join(path, "cfm_model.pt"), device="cuda", use_ema=False ) # Initialize tokenizer and style model self.tokenizer = CNENTokenizer() self.muq = MuQMuLan.from_pretrained( "OpenMuQ/MuQ-MuLan-large", revision="8a081dbcf84edd47ea7db3c4ecb8fd1ec1ddacfe" ).to("cuda") # Load VAE for audio decoding vae_ckpt_path = os.path.join(vae_path, "vae_model.pt") self.vae = torch.jit.load(vae_ckpt_path, map_location="cpu").to("cuda") # Warmup with example generation await self._warmup_model() # Store utilities self.torchaudio = torchaudio self.torch = torch self.soundfile = soundfile self.decode_audio = decode_audio self.inference = inference self.get_lrc_token = get_lrc_token self.get_reference_latent = get_reference_latent self.get_style_prompt = get_style_prompt async def _warmup_model(self): """Perform warmup generation to load models into memory.""" from infer_utils import get_lrc_token, get_negative_style_prompt, get_reference_latent, get_style_prompt from infer import inference # Load example lyrics with open("/app/infer/example/eg_en_full.lrc", "r", encoding="utf-8") as infile: lrc = infile.read() # Prepare warmup data lrc_prompt, start_time = get_lrc_token(self.max_frames, lrc, self.tokenizer, "cuda") self.negative_style_prompt = get_negative_style_prompt("cuda") self.latent_prompt = get_reference_latent("cuda", self.max_frames) style_prompt = get_style_prompt(self.muq, prompt="classical genres, hopeful mood, piano.") # Perform warmup generation with self.torch.no_grad(): generated_song = inference( cfm_model=self.cfm, vae_model=self.vae, cond=self.latent_prompt, text=lrc_prompt, duration=self.max_frames, style_prompt=style_prompt, negative_style_prompt=self.negative_style_prompt, start_time=start_time, chunked=True) # Save warmup output output_path = "/app/warmup.mp3" self.torchaudio.save(output_path, generated_song, sample_rate=44100, format="mp3") ``` ### Audio Processing Utilities Add utilities for handling audio input: ```python import pybase64 as base64 import tempfile from io import BytesIO from loguru import logger def load_audio(self, audio_b64): """ Convert base64 audio to tensor for style extraction. """ try: audio_bytes = BytesIO(base64.b64decode(audio_b64)) with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_file: temp_file.write(audio_bytes.getvalue()) temp_path = temp_file.name waveform, sample_rate = self.torchaudio.load(temp_path) return temp_path except Exception as exc: logger.error(f"Error loading audio: {exc}") raise HTTPException( status_code=status.HTTP_400_BAD_REQUEST, detail=f"Invalid input audio_b64 provided: {exc}") ``` ### Lyrics Validation Implement comprehensive lyrics validation with timing: ```python def validate_lyrics(lyrics: str, total_length: int): """ Validate LRC format lyrics for proper timing and format. """ def format_time(seconds: float) -> str: minutes = int(seconds // 60) remaining_seconds = seconds % 60 return f"{minutes:02d}:{remaining_seconds:05.2f}" previous_time = -1.0 last_timestamp = 0.0 try: for line_num, line in enumerate(lyrics.splitlines()): if not line.strip(): continue # Check line length if len(line) > 256: raise HTTPException( status_code=status.HTTP_400_BAD_REQUEST, detail=f"Line {line_num} exceeds 256 characters: {len(line)} chars") # Validate timestamp format valid_match = LRC_RE.match(line) if valid_match: minutes = int(valid_match.group(1)) seconds = float(valid_match.group(2)) current_time = minutes * 60 + seconds last_timestamp = max(last_timestamp, current_time) # Check chronological order if current_time < previous_time: raise HTTPException( status_code=status.HTTP_400_BAD_REQUEST, detail=f"Line {line_num}: Timestamp {format_time(current_time)} " f"is before previous timestamp {format_time(previous_time)}") previous_time = current_time except Exception as exc: raise HTTPException( status_code=status.HTTP_400_BAD_REQUEST, detail=f"Error validating lyrics: {exc}") # Check total duration if last_timestamp > total_length: raise HTTPException( status_code=status.HTTP_400_BAD_REQUEST, detail=f"Total duration ({format_time(last_timestamp)}) " f"exceeds maximum allowed length ({format_time(total_length)})") ``` ### Music Generation Endpoint Create the main generation endpoint: ```python import uuid import os from fastapi.responses import Response @chute.cord( public_api_path="/generate", public_api_method="POST", stream=False, output_content_type="audio/mp3") async def generate(self, args: InputArgs) -> Response: """ Generate music from style prompts and/or lyrics. """ input_path = None inference_kwargs = dict( cfm_model=self.cfm, vae_model=self.vae, cond=self.latent_prompt, duration=self.max_frames, negative_style_prompt=self.negative_style_prompt, chunked=True) # Extract style from prompt or reference audio style_prompt = None if args.style_prompt: style_prompt = self.get_style_prompt(self.muq, prompt=args.style_prompt) elif args.audio_b64: input_path = load_audio(self, args.audio_b64) try: style_prompt = self.get_style_prompt(self.muq, input_path) except Exception as exc: raise HTTPException( status_code=status.HTTP_400_BAD_REQUEST, detail=f"Invalid input audio: {exc}") finally: if input_path and os.path.exists(input_path): os.remove(input_path) if style_prompt is None: raise HTTPException( status_code=status.HTTP_400_BAD_REQUEST, detail="You must provide either style_prompt or audio_b64!") inference_kwargs["style_prompt"] = style_prompt # Process lyrics if provided if args.lyrics: validate_lyrics(args.lyrics, 285) # Max ~4.75 minutes lrc_prompt, start_time = self.get_lrc_token( self.max_frames, args.lyrics or "", self.tokenizer, "cuda" ) inference_kwargs["text"] = lrc_prompt inference_kwargs["start_time"] = start_time # Generate the music output_path = f"/tmp/{uuid.uuid4()}.mp3" try: with self.torch.no_grad(): generated_song = self.inference(**inference_kwargs) self.torchaudio.save( output_path, generated_song, sample_rate=44100, format="mp3" ) # Return audio file with open(output_path, "rb") as infile: return Response( content=infile.read(), media_type="audio/mp3", headers={ "Content-Disposition": f"attachment; filename={uuid.uuid4()}.mp3", }) finally: if os.path.exists(output_path): os.remove(output_path) ``` ## Advanced Features ### Style-Guided Generation Create endpoint for style-specific music generation: ```python class StyleRequest(BaseModel): style_description: str mood: Optional[str] = "neutral" genre: Optional[str] = "pop" instruments: Optional[str] = "piano, guitar" tempo: Optional[str] = "medium" @chute.cord(public_api_path="/style_generate", method="POST") async def generate_with_style(self, request: StyleRequest) -> Response: """Generate music with detailed style control.""" # Construct detailed style prompt style_prompt = f"{request.genre} genre, {request.mood} mood, {request.instruments}" if request.tempo: style_prompt += f", {request.tempo} tempo" if request.style_description: style_prompt += f", {request.style_description}" # Generate using style prompt args = InputArgs(style_prompt=style_prompt) return await self.generate(args) ``` ### Lyrics-to-Music with Timing Example of properly formatted lyrics with timestamps: ```python # Example LRC format lyrics example_lyrics = """ [00:00.00]Verse 1 [00:05.50]In the morning light so bright [00:10.00]I can see a better sight [00:15.50]Dreams are calling out my name [00:20.00]Nothing will be quite the same [00:25.00]Chorus [00:27.50]We are rising with the sun [00:32.00]A new journey has begun [00:37.50]Every step we take today [00:42.00]Leads us down a brighter way [00:47.00]Verse 2 [00:50.00]Through the valleys and the hills [00:55.50]We will chase away our fears [01:00.00]With the music in our hearts [01:05.50]We will make a brand new start """ class LyricsRequest(BaseModel): lyrics: str style_prompt: str = "uplifting pop song, piano and strings" @chute.cord(public_api_path="/lyrics_to_music", method="POST") async def lyrics_to_music(self, request: LyricsRequest) -> Response: """Convert timestamped lyrics into a complete song.""" args = InputArgs( style_prompt=request.style_prompt, lyrics=request.lyrics ) return await self.generate(args) ``` ### Reference Audio Style Transfer Extract musical style from uploaded audio: ```python class StyleTransferRequest(BaseModel): reference_audio_b64: str new_lyrics: Optional[str] = None style_blend: float = Field(default=1.0, ge=0.1, le=1.0) @chute.cord(public_api_path="/style_transfer", method="POST") async def style_transfer(self, request: StyleTransferRequest) -> Response: """Generate music using the style from reference audio.""" args = InputArgs( audio_b64=request.reference_audio_b64, lyrics=request.new_lyrics ) return await self.generate(args) ``` ## Deployment and Usage ### Deploy the Service ```bash # Build and deploy the music generation service chutes deploy my_music_gen:chute # Monitor the deployment chutes chutes get my-music-gen ``` ### Using the API #### Generate with Style Prompt ```bash curl -X POST "https://myuser-my-music-gen.chutes.ai/generate" \ -H "Content-Type: application/json" \ -d '{ "style_prompt": "upbeat electronic dance music, synthesizers, energetic" }' \ --output generated_music.mp3 ``` #### Generate with Lyrics ```bash curl -X POST "https://myuser-my-music-gen.chutes.ai/lyrics_to_music" \ -H "Content-Type: application/json" \ -d '{ "lyrics": "[00:00.00]Hello world\n[00:05.00]This is my song\n[00:10.00]Made with AI", "style_prompt": "acoustic folk, guitar and violin, heartfelt" }' \ --output lyrical_song.mp3 ``` #### Python Client Example ```python import requests import base64 class MusicGenerator: def __init__(self, base_url): self.base_url = base_url def generate_from_style(self, style_prompt): """Generate music from style description.""" response = requests.post( f"{self.base_url}/generate", json={"style_prompt": style_prompt} ) if response.status_code == 200: return response.content else: raise Exception(f"Generation failed: {response.status_code}") def generate_from_lyrics(self, lyrics, style="pop"): """Generate music from timestamped lyrics.""" response = requests.post( f"{self.base_url}/lyrics_to_music", json={ "lyrics": lyrics, "style_prompt": f"{style} style, full band arrangement" } ) return response.content def style_transfer(self, reference_audio_path, new_lyrics=None): """Generate music using style from reference audio.""" with open(reference_audio_path, "rb") as f: audio_b64 = base64.b64encode(f.read()).decode() payload = {"reference_audio_b64": audio_b64} if new_lyrics: payload["new_lyrics"] = new_lyrics response = requests.post( f"{self.base_url}/style_transfer", json=payload ) return response.content # Usage example generator = MusicGenerator("https://myuser-my-music-gen.chutes.ai") # Generate upbeat electronic music music = generator.generate_from_style( "energetic electronic dance music, heavy bass, futuristic sounds" ) with open("edm_track.mp3", "wb") as f: f.write(music) # Generate from lyrics lyrics = """ [00:00.00]Verse 1 [00:03.00]AI creates the beat [00:06.00]Technology so sweet [00:09.00]Music from the future [00:12.00]Is here to greet ya """ song = generator.generate_from_lyrics(lyrics, "electronic pop") with open("ai_song.mp3", "wb") as f: f.write(song) ``` ## Best Practices ### 1. Lyrics Formatting ```python # Good LRC format - clear timing and structure good_lyrics = """ [00:00.00]Intro [00:08.00]Verse 1 [00:10.50]Walking down the street tonight [00:15.00]City lights are shining bright [00:20.50]Every step I take feels right [00:25.00]In this neon-colored light [00:30.00]Chorus [00:32.50]We are alive, we are free [00:37.00]This is who we're meant to be [00:42.50]Dancing through eternity [00:47.00]In perfect harmony """ # Bad format - inconsistent timing bad_lyrics = """ [00:00]Start [0:5]Some lyrics here [15.5]More lyrics without proper format Random text without timestamp """ ``` ### 2. Style Prompt Engineering ```python # Effective style prompts are specific and descriptive effective_styles = [ "jazz ballad, piano and saxophone, slow tempo, romantic mood", "rock anthem, electric guitars, powerful drums, energetic", "classical orchestral, strings and brass, dramatic, cinematic", "ambient electronic, synthesizers, dreamy, ethereal atmosphere", "country folk, acoustic guitar, harmonica, storytelling style" ] # Avoid vague prompts vague_styles = [ "good music", "nice song", "popular style" ] ``` ### 3. Audio Quality Optimization ```python # For highest quality output @chute.cord(public_api_path="/hq_generate", method="POST") async def high_quality_generate(self, args: InputArgs) -> Response: """Generate high-quality music with extended processing.""" # Use maximum duration for better quality inference_kwargs = dict( cfm_model=self.cfm, vae_model=self.vae, cond=self.latent_prompt, duration=self.max_frames, # Use full duration negative_style_prompt=self.negative_style_prompt, chunked=False, # Don't chunk for better coherence ) # ... rest of generation logic ``` ### 4. Error Handling and Validation ```python def validate_audio_input(audio_b64: str, max_size_mb: int = 10): """Validate audio input size and format.""" try: audio_data = base64.b64decode(audio_b64) size_mb = len(audio_data) / (1024 * 1024) if size_mb > max_size_mb: raise HTTPException( status_code=400, detail=f"Audio file too large: {size_mb:.1f}MB (max: {max_size_mb}MB)" ) return audio_data except Exception as e: raise HTTPException( status_code=400, detail=f"Invalid audio data: {str(e)}" ) ``` ## Performance and Scaling ### Memory Optimization ```python # Clear GPU memory between generations @chute.cord(public_api_path="/generate", method="POST") async def generate_optimized(self, args: InputArgs) -> Response: """Memory-optimized generation.""" try: # Clear cache before generation if hasattr(self, 'torch'): self.torch.cuda.empty_cache() # Generate music result = await self.generate(args) return result finally: # Clean up after generation if hasattr(self, 'torch'): self.torch.cuda.empty_cache() ``` ### Concurrent Processing ```python # Configure for multiple concurrent generations chute = Chute( username="myuser", name="diffrhythm-music", image=image, node_selector=NodeSelector( gpu_count=2, # Multiple GPUs for parallel processing min_vram_gb_per_gpu=24 ), concurrency=4, # Handle multiple requests ) ``` ## Monitoring and Troubleshooting ### Common Issues and Solutions ```bash # Check service health chutes chutes get my-music-gen # View generation logs chutes chutes logs my-music-gen --tail 50 # Monitor GPU utilization chutes chutes metrics my-music-gen ``` ### Performance Monitoring ```python import time from loguru import logger @chute.cord(public_api_path="/generate_timed", method="POST") async def generate_with_timing(self, args: InputArgs) -> Response: """Generation with performance monitoring.""" start_time = time.time() try: result = await self.generate(args) generation_time = time.time() - start_time logger.info(f"Generation completed in {generation_time:.2f} seconds") return result except Exception as e: error_time = time.time() - start_time logger.error(f"Generation failed after {error_time:.2f} seconds: {e}") raise ``` ## Next Steps - **Custom Models**: Train DiffRhythm on your own musical datasets - **Style Control**: Experiment with different musical genres and moods - **Integration**: Build music creation apps and platforms - **Real-time**: Implement streaming music generation For more advanced examples, see: - [Audio Processing](/docs/examples/audio-processing) - [Custom Training](/docs/examples/custom-training) - [Real-time Streaming](/docs/examples/streaming-responses) --- ## SOURCE: https://chutes.ai/docs/examples/semantic-search # Semantic Search with Text Embeddings This guide demonstrates how to build a complete semantic search application using text embeddings with Chutes. We'll create a search system that understands meaning, not just keywords. ## Overview Semantic search enables: - **Meaning-based Search**: Find documents based on meaning, not just exact keywords - **Similarity Matching**: Discover related content even with different wording - **Multi-language Support**: Search across different languages - **Contextual Understanding**: Understand context and intent in queries - **Scalable Indexing**: Handle large document collections efficiently ## Quick Start ### Basic Semantic Search Service ```python from chutes.chute import Chute, NodeSelector from chutes.chute.template.tei import build_tei_chute # Create text embedding service embedding_chute = build_tei_chute( username="myuser", model_name="sentence-transformers/all-MiniLM-L6-v2", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=4 ), concurrency=8 ) print("Deploying embedding service...") result = embedding_chute.deploy() print(f"✅ Embedding service deployed: {result}") ``` ### Search Application ```python from pydantic import BaseModel, Field from typing import List, Dict, Any, Optional import numpy as np from sklearn.metrics.pairwise import cosine_similarity import json class Document(BaseModel): id: str content: str metadata: Optional[Dict[str, Any]] = Field(default_factory=dict) embedding: Optional[List[float]] = None class SearchQuery(BaseModel): query: str max_results: int = Field(default=10, le=100) similarity_threshold: float = Field(default=0.7, ge=0.0, le=1.0) filters: Optional[Dict[str, Any]] = Field(default_factory=dict) class SearchResult(BaseModel): document: Document similarity_score: float rank: int class SearchResponse(BaseModel): query: str results: List[SearchResult] total_matches: int search_time_ms: float class SemanticSearchEngine: def __init__(self, embedding_chute_url: str): self.embedding_chute_url = embedding_chute_url self.documents: List[Document] = [] self.embeddings_matrix = None async def embed_text(self, text: str) -> List[float]: """Generate embedding for text using TEI service""" import httpx async with httpx.AsyncClient() as client: response = await client.post( f"{self.embedding_chute_url}/embed", json={"inputs": text} ) response.raise_for_status() return response.json()[0] async def add_document(self, document: Document) -> None: """Add document to search index""" if document.embedding is None: document.embedding = await self.embed_text(document.content) self.documents.append(document) self._update_embeddings_matrix() async def add_documents(self, documents: List[Document]) -> None: """Add multiple documents efficiently""" # Generate embeddings for documents without them texts_to_embed = [] doc_indices = [] for i, doc in enumerate(documents): if doc.embedding is None: texts_to_embed.append(doc.content) doc_indices.append(i) if texts_to_embed: embeddings = await self._embed_batch(texts_to_embed) for idx, embedding in zip(doc_indices, embeddings): documents[idx].embedding = embedding self.documents.extend(documents) self._update_embeddings_matrix() async def _embed_batch(self, texts: List[str]) -> List[List[float]]: """Generate embeddings for multiple texts""" import httpx async with httpx.AsyncClient() as client: response = await client.post( f"{self.embedding_chute_url}/embed", json={"inputs": texts} ) response.raise_for_status() return response.json() def _update_embeddings_matrix(self): """Update the embeddings matrix for similarity search""" if self.documents: embeddings = [doc.embedding for doc in self.documents] self.embeddings_matrix = np.array(embeddings) async def search(self, query: SearchQuery) -> SearchResponse: """Perform semantic search""" import time start_time = time.time() # Generate query embedding query_embedding = await self.embed_text(query.query) query_vector = np.array(query_embedding).reshape(1, -1) # Calculate similarities similarities = cosine_similarity(query_vector, self.embeddings_matrix)[0] # Apply similarity threshold valid_indices = np.where(similarities >= query.similarity_threshold)[0] valid_similarities = similarities[valid_indices] # Sort by similarity (descending) sorted_indices = valid_indices[np.argsort(valid_similarities)[::-1]] # Apply filters and limit results results = [] for rank, idx in enumerate(sorted_indices[:query.max_results]): document = self.documents[idx] # Apply filters if specified if query.filters and not self._apply_filters(document, query.filters): continue results.append(SearchResult( document=document, similarity_score=float(similarities[idx]), rank=rank + 1 )) search_time = (time.time() - start_time) * 1000 return SearchResponse( query=query.query, results=results, total_matches=len(results), search_time_ms=search_time ) def _apply_filters(self, document: Document, filters: Dict[str, Any]) -> bool: """Apply metadata filters to document""" for key, value in filters.items(): if key not in document.metadata: return False if document.metadata[key] != value: return False return True # Global search engine instance search_engine = None async def initialize_search_engine(embedding_url: str, documents_data: List[Dict] = None): """Initialize the search engine with documents""" global search_engine search_engine = SemanticSearchEngine(embedding_url) if documents_data: documents = [Document(**doc_data) for doc_data in documents_data] await search_engine.add_documents(documents) async def run(inputs: Dict[str, Any]) -> Dict[str, Any]: """Main search service entry point""" global search_engine action = inputs.get("action", "search") if action == "initialize": embedding_url = inputs["embedding_service_url"] documents_data = inputs.get("documents", []) await initialize_search_engine(embedding_url, documents_data) return {"status": "initialized", "document_count": len(documents_data)} elif action == "add_document": document_data = inputs["document"] document = Document(**document_data) await search_engine.add_document(document) return {"status": "added", "document_id": document.id} elif action == "add_documents": documents_data = inputs["documents"] documents = [Document(**doc_data) for doc_data in documents_data] await search_engine.add_documents(documents) return {"status": "added", "count": len(documents)} elif action == "search": query_data = inputs["query"] query = SearchQuery(**query_data) response = await search_engine.search(query) return response.dict() else: raise ValueError(f"Unknown action: {action}") ``` ## Complete Example Implementation ### Document Indexing Service ```python from chutes.image import Image from chutes.chute import Chute, NodeSelector # Custom image with search dependencies search_image = ( Image( username="myuser", name="semantic-search", tag="1.0.0", python_version="3.11" ) .pip_install([ "scikit-learn==1.3.0", "numpy==1.24.3", "httpx==0.25.0", "pydantic==2.4.2", "fastapi==0.104.1", "uvicorn==0.24.0" ]) ) # Deploy search service search_chute = Chute( username="myuser", name="semantic-search-service", image=search_image, entry_file="search_engine.py", entry_point="run", node_selector=NodeSelector( gpu_count=0, # CPU-only for search logic), timeout_seconds=300, concurrency=10 ) result = search_chute.deploy() print(f"Search service deployed: {result}") ``` ### Usage Examples #### Initialize with Documents ```python # Sample documents documents = [ { "id": "doc1", "content": "Artificial intelligence is transforming healthcare through machine learning algorithms.", "metadata": {"category": "healthcare", "author": "Dr. Smith", "year": 2024} }, { "id": "doc2", "content": "Machine learning models can predict patient outcomes with high accuracy.", "metadata": {"category": "healthcare", "author": "Dr. Johnson", "year": 2024} }, { "id": "doc3", "content": "Climate change affects global weather patterns and ocean temperatures.", "metadata": {"category": "environment", "author": "Prof. Green", "year": 2023} } ] # Initialize search service response = search_chute.run({ "action": "initialize", "embedding_service_url": "https://your-embedding-service.com", "documents": documents }) print(f"Initialized: {response}") ``` #### Perform Searches ```python # Search for healthcare AI content search_response = search_chute.run({ "action": "search", "query": { "query": "AI in medical diagnosis", "max_results": 5, "similarity_threshold": 0.6, "filters": {"category": "healthcare"} } }) print(f"Found {search_response['total_matches']} results:") for result in search_response['results']: print(f"- {result['document']['id']}: {result['similarity_score']:.3f}") ``` #### Add New Documents ```python # Add new document to index new_doc = { "id": "doc4", "content": "Deep learning networks excel at image recognition tasks in medical imaging.", "metadata": {"category": "healthcare", "author": "Dr. Lee", "year": 2024} } response = search_chute.run({ "action": "add_document", "document": new_doc }) print(f"Added document: {response}") ``` ## Advanced Features ### Multi-Modal Search ```python class MultiModalDocument(BaseModel): id: str text_content: str image_path: Optional[str] = None text_embedding: Optional[List[float]] = None image_embedding: Optional[List[float]] = None metadata: Dict[str, Any] = Field(default_factory=dict) class MultiModalSearchEngine(SemanticSearchEngine): def __init__(self, text_embedding_url: str, image_embedding_url: str): super().__init__(text_embedding_url) self.image_embedding_url = image_embedding_url async def embed_image(self, image_path: str) -> List[float]: """Generate embedding for image using CLIP or similar""" import httpx async with httpx.AsyncClient() as client: with open(image_path, "rb") as f: files = {"image": f} response = await client.post( f"{self.image_embedding_url}/embed", files=files ) response.raise_for_status() return response.json() async def hybrid_search(self, text_query: str, image_query: str = None, text_weight: float = 0.7) -> SearchResponse: """Perform hybrid text + image search""" text_embedding = await self.embed_text(text_query) if image_query: image_embedding = await self.embed_image(image_query) # Combine embeddings with weights combined_embedding = ( np.array(text_embedding) * text_weight + np.array(image_embedding) * (1 - text_weight) ) else: combined_embedding = np.array(text_embedding) # Perform similarity search with combined embedding # Implementation similar to regular search... ``` ### Real-time Search API ```python from fastapi import FastAPI, HTTPException from fastapi.middleware.cors import CORSMiddleware app = FastAPI(title="Semantic Search API") app.add_middleware( CORSMiddleware, allow_origins=["*"], allow_credentials=True, allow_methods=["*"], allow_headers=["*"]) @app.post("/search") async def search_documents(query: SearchQuery) -> SearchResponse: """Search documents endpoint""" try: return await search_engine.search(query) except Exception as e: raise HTTPException(status_code=500, detail=str(e)) @app.post("/documents") async def add_document(document: Document) -> Dict[str, str]: """Add document endpoint""" try: await search_engine.add_document(document) return {"status": "success", "document_id": document.id} except Exception as e: raise HTTPException(status_code=500, detail=str(e)) @app.get("/health") async def health_check(): """Health check endpoint""" return {"status": "healthy", "documents": len(search_engine.documents)} # Run with: uvicorn app:app --host 0.0.0.0 --port 8000 ``` ### Vector Database Integration ```python import chromadb from chromadb.config import Settings class VectorDBSearchEngine: def __init__(self, embedding_service_url: str): self.embedding_service_url = embedding_service_url self.client = chromadb.Client(Settings( chroma_db_impl="duckdb+parquet", persist_directory="./chroma_db" )) self.collection = self.client.get_or_create_collection( name="documents", metadata={"hnsw:space": "cosine"} ) async def add_documents(self, documents: List[Document]): """Add documents to vector database""" # Generate embeddings texts = [doc.content for doc in documents] embeddings = await self._embed_batch(texts) # Add to ChromaDB self.collection.add( embeddings=embeddings, documents=texts, metadatas=[doc.metadata for doc in documents], ids=[doc.id for doc in documents] ) async def search(self, query: SearchQuery) -> SearchResponse: """Search using vector database""" query_embedding = await self.embed_text(query.query) results = self.collection.query( query_embeddings=[query_embedding], n_results=query.max_results, where=query.filters if query.filters else None ) # Format response search_results = [] for i, (doc_id, content, metadata, distance) in enumerate(zip( results['ids'][0], results['documents'][0], results['metadatas'][0], results['distances'][0] )): similarity = 1 - distance # Convert distance to similarity if similarity >= query.similarity_threshold: search_results.append(SearchResult( document=Document( id=doc_id, content=content, metadata=metadata ), similarity_score=similarity, rank=i + 1 )) return SearchResponse( query=query.query, results=search_results, total_matches=len(search_results), search_time_ms=0 # ChromaDB handles timing ) ``` ## Production Deployment ### Scalable Architecture ```python # High-performance embedding service embedding_chute = build_tei_chute( username="mycompany", model_name="sentence-transformers/all-mpnet-base-v2", node_selector=NodeSelector( gpu_count=2, min_vram_gb_per_gpu=16, preferred_provider="runpod" ), concurrency=16, auto_scale=True, min_instances=2, max_instances=8 ) # Search service with caching search_chute = Chute( username="mycompany", name="search-prod", image=search_image, entry_file="search_api.py", entry_point="app", node_selector=NodeSelector( gpu_count=0), environment={ "REDIS_URL": "redis://cache.example.com:6379", "VECTOR_DB_PATH": "/data/chroma", "EMBEDDING_SERVICE_URL": embedding_chute.url }, timeout_seconds=300, concurrency=20 ) ``` ### Performance Monitoring ```python from prometheus_client import Counter, Histogram, start_http_server import time # Metrics SEARCH_REQUESTS = Counter('search_requests_total', 'Total search requests') SEARCH_DURATION = Histogram('search_duration_seconds', 'Search duration') EMBEDDING_CACHE_HITS = Counter('embedding_cache_hits_total', 'Cache hits') class MonitoredSearchEngine(SemanticSearchEngine): async def search(self, query: SearchQuery) -> SearchResponse: SEARCH_REQUESTS.inc() with SEARCH_DURATION.time(): return await super().search(query) async def embed_text(self, text: str) -> List[float]: # Check cache first cache_key = f"embed:{hash(text)}" cached = await self._get_from_cache(cache_key) if cached: EMBEDDING_CACHE_HITS.inc() return cached # Generate new embedding embedding = await super().embed_text(text) await self._store_in_cache(cache_key, embedding) return embedding # Start metrics server start_http_server(8001) ``` ## Next Steps - **[Text Embeddings Guide](../templates/tei)** - Deep dive into embedding models - **[Vector Databases](vector-databases)** - Scalable vector storage solutions - **[RAG Applications](rag-applications)** - Retrieval-Augmented Generation - **[Performance Optimization](../guides/performance)** - Scale your search service For enterprise-scale deployments, see the [Production Search Architecture](../guides/production-search) guide. --- ## SOURCE: https://chutes.ai/docs/examples/simple-text-analysis # Simple Text Analysis Chute This example shows how to build a basic text analysis service using transformers and custom API endpoints. Perfect for getting started with custom Chutes. ## What We'll Build A simple text sentiment analysis service that: - 📊 **Analyzes sentiment** using a pre-trained model - 🔍 **Validates input** with Pydantic schemas - 🚀 **Provides REST API** for easy integration - 📦 **Uses custom Docker image** with optimized dependencies ## Complete Example ### `sentiment_analyzer.py` ````python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification from pydantic import BaseModel, Field from fastapi import HTTPException from chutes.chute import Chute, NodeSelector from chutes.image import Image # === INPUT/OUTPUT SCHEMAS === class TextInput(BaseModel): text: str = Field(..., min_length=5, max_length=1000, description="Text to analyze") class Config: schema_extra = { "example": { "text": "I love using this new AI service!" } } class SentimentResult(BaseModel): text: str sentiment: str # POSITIVE, NEGATIVE, NEUTRAL confidence: float processing_time: float # === CUSTOM IMAGE === image = ( Image(username="myuser", name="sentiment-analyzer", tag="1.0") .from_base("nvidia/cuda:12.4.1-runtime-ubuntu22.04") .with_python("3.11") .run_command("pip install torch>=2.4.0 transformers>=4.44.0 accelerate>=0.33.0") .with_env("TRANSFORMERS_CACHE", "/app/models") .run_command("mkdir -p /app/models") ) # === CHUTE DEFINITION === chute = Chute( username="myuser", name="sentiment-analyzer", image=image, tagline="Simple sentiment analysis with transformers", readme=""" # Sentiment Analyzer A simple sentiment analysis service using DistilBERT. ## Usage Send a POST request to `/analyze`: ```bash curl -X POST https://myuser-sentiment-analyzer.chutes.ai/analyze \\ -H "Content-Type: application/json" \\ -d '{"text": "I love this product!"}' ```` ## Response ```json { "text": "I love this product!", "sentiment": "POSITIVE", "confidence": 0.99, "processing_time": 0.05 } ``` """, node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=8 ) ) # === MODEL LOADING === @chute.on_startup() async def load_model(self): """Load the sentiment analysis model on startup.""" print("Loading sentiment analysis model...") model_name = "distilbert-base-uncased-finetuned-sst-2-english" # Load tokenizer and model self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.model = AutoModelForSequenceClassification.from_pretrained(model_name) # Move to GPU if available self.device = "cuda" if torch.cuda.is_available() else "cpu" self.model.to(self.device) self.model.eval() # Set to evaluation mode print(f"Model loaded on device: {self.device}") # === API ENDPOINTS === @chute.cord( public_api_path="/analyze", method="POST", input_schema=TextInput, output_content_type="application/json" ) async def analyze_sentiment(self, data: TextInput) -> SentimentResult: """Analyze the sentiment of the input text.""" import time start_time = time.time() try: # Tokenize input inputs = self.tokenizer( data.text, return_tensors="pt", truncation=True, padding=True, max_length=512 ).to(self.device) # Run inference with torch.no_grad(): outputs = self.model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) # Get results labels = ["NEGATIVE", "POSITIVE"] # DistilBERT SST-2 labels predicted_class = predictions.argmax(dim=-1).item() confidence = predictions[0][predicted_class].item() processing_time = time.time() - start_time return SentimentResult( text=data.text, sentiment=labels[predicted_class], confidence=confidence, processing_time=processing_time ) except Exception as e: raise HTTPException(status_code=500, detail=f"Analysis failed: {str(e)}") @chute.cord( public_api_path="/health", method="GET", output_content_type="application/json" ) async def health_check(self) -> dict: """Simple health check endpoint.""" return { "status": "healthy", "model_loaded": hasattr(self, 'model'), "device": getattr(self, 'device', 'unknown') } # Test the chute locally (optional) if **name** == "**main**": import asyncio async def test_locally(): # Simulate startup await load_model(chute) # Test analysis test_input = TextInput(text="I love this new AI service!") result = await analyze_sentiment(chute, test_input) print(f"Result: {result}") asyncio.run(test_locally()) ```` ## Step-by-Step Breakdown ### 1. Define Input/Output Schemas ```python class TextInput(BaseModel): text: str = Field(..., min_length=5, max_length=1000) ```` - **Validation**: Ensures text is between 5-1000 characters - **Documentation**: Provides clear API documentation - **Type Safety**: Automatic parsing and validation ### 2. Build Custom Image ```python image = ( Image(username="myuser", name="sentiment-analyzer", tag="1.0") .from_base("nvidia/cuda:12.4.1-runtime-ubuntu22.04") .with_python("3.11") .run_command("pip install torch>=2.4.0 transformers>=4.44.0") ) ``` - **Base Image**: CUDA-enabled Ubuntu for GPU support - **Dependencies**: Only what we need for sentiment analysis - **Optimization**: Runtime image (smaller than devel) ### 3. Model Loading ```python @chute.on_startup() async def load_model(self): self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.model = AutoModelForSequenceClassification.from_pretrained(model_name) self.model.to(self.device) ``` - **Startup Hook**: Load model once when chute starts - **GPU Support**: Automatically use GPU if available - **State Management**: Store in chute instance ### 4. API Endpoint ```python @chute.cord(public_api_path="/analyze", input_schema=TextInput) async def analyze_sentiment(self, data: TextInput) -> SentimentResult: # Process the input return SentimentResult(...) ``` - **Path Mapping**: Creates `/analyze` endpoint - **Input Validation**: Automatic validation using schema - **Typed Response**: Structured output with SentimentResult ## Building and Deploying ### 1. Build the Image ```bash chutes build sentiment_analyzer:chute --wait ``` ### 2. Deploy the Chute ```bash chutes deploy sentiment_analyzer:chute ``` ### 3. Test Your Deployment ```bash curl -X POST https://myuser-sentiment-analyzer.chutes.ai/analyze \ -H "Content-Type: application/json" \ -d '{"text": "This is amazing!"}' ``` Expected response: ```json { "text": "This is amazing!", "sentiment": "POSITIVE", "confidence": 0.99, "processing_time": 0.05 } ``` ## Testing Different Texts ```python import requests texts = [ "I love this product!", # Should be POSITIVE "This is terrible.", # Should be NEGATIVE "It's okay, nothing special.", # Could be NEGATIVE or POSITIVE "Amazing technology!", # Should be POSITIVE "Poor quality." # Should be NEGATIVE ] for text in texts: response = requests.post( "https://myuser-sentiment-analyzer.chutes.ai/analyze", json={"text": text} ) result = response.json() print(f"'{text}' -> {result['sentiment']} ({result['confidence']:.2f})") ``` ## Key Concepts Learned ### 1. **Custom Images** - How to build optimized Docker environments - Installing Python packages efficiently - Setting environment variables ### 2. **Model Management** - Loading models at startup (not per request) - GPU detection and utilization - Memory optimization ### 3. **API Design** - Input validation with Pydantic - Structured responses - Error handling ### 4. **Performance** - Model reuse across requests - Efficient tokenization - GPU acceleration ## Next Steps Now that you understand the basics, try: - **[Streaming Responses](../examples/streaming-responses)** - Real-time analysis - **[Batch Processing](../examples/batch-processing)** - Process multiple texts - **[Multi-Model Setup](../examples/multi-model-analysis)** - Combine multiple models - **[Custom Image Building](../guides/custom-images)** - Advanced Docker ## Common Issues & Solutions **Model not loading?** - Check GPU requirements in NodeSelector - Verify model name is correct - Ensure sufficient VRAM **Slow responses?** - Model loads on first request (normal) - Consider warming up with health check - Check GPU utilization **Out of memory?** - Reduce max_length in tokenizer - Use smaller model variant - Increase VRAM requirements --- ## SOURCE: https://chutes.ai/docs/examples/streaming-responses # Streaming Responses This example shows how to create **streaming API endpoints** that send results in real-time as they become available. Perfect for long-running AI tasks where you want to show progress. ## Real-time Text Streaming Real-time text streaming allows you to process and return results as they become available, providing immediate feedback to users instead of waiting for all processing to complete. This is especially valuable for: - **Long-running AI operations** - Show progress during model inference - **Interactive applications** - Provide immediate feedback as users type - **Large text processing** - Stream results chunk by chunk - **Multi-step workflows** - Display each processing step as it completes ## What We'll Build A text processing service that streams results as they're computed: - 🌊 **Streaming responses** with real-time updates - 📊 **Progress tracking** for long operations - 🔄 **Multiple processing steps** shown incrementally - 📝 **Chunked text processing** for large inputs ## Complete Example ### `streaming_processor.py` ```python import asyncio import time import json from typing import AsyncGenerator from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline from pydantic import BaseModel, Field from fastapi import HTTPException from chutes.chute import Chute, NodeSelector from chutes.image import Image # === INPUT SCHEMAS === class StreamingTextInput(BaseModel): text: str = Field(..., min_length=10, max_length=5000) include_sentiment: bool = Field(True, description="Include sentiment analysis") include_summary: bool = Field(True, description="Include text summarization") include_entities: bool = Field(False, description="Include named entity recognition") chunk_size: int = Field(200, ge=50, le=500, description="Text chunk size for processing") # === CUSTOM IMAGE === image = ( Image(username="myuser", name="streaming-processor", tag="1.0") .from_base("nvidia/cuda:12.4.1-runtime-ubuntu22.04") .with_python("3.11") .run_command("pip install torch>=2.4.0 transformers>=4.44.0 accelerate>=0.33.0 spacy>=3.7.0") .run_command("python -m spacy download en_core_web_sm") .with_env("TRANSFORMERS_CACHE", "/app/models") ) # === CHUTE DEFINITION === chute = Chute( username="myuser", name="streaming-processor", image=image, tagline="Real-time streaming text processing", readme=""" # Streaming Text Processor Process text with real-time streaming results. ## Usage ```bash curl -X POST https://myuser-streaming-processor.chutes.ai/process-stream \\ -H "Content-Type: application/json" \\ -d '{"text": "Your long text here..."}' \\ --no-buffer ``` Each response line contains JSON with the current processing step. """, node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=12 ) ) # === MODEL LOADING === @chute.on_startup() async def load_models(self): """Load all models needed for processing.""" print("Loading models for streaming processing...") import torch # Sentiment analysis model sentiment_model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest" self.sentiment_tokenizer = AutoTokenizer.from_pretrained(sentiment_model_name) self.sentiment_model = AutoModelForSequenceClassification.from_pretrained(sentiment_model_name) # Summarization pipeline self.summarizer = pipeline( "summarization", model="facebook/bart-large-cnn", device=0 if torch.cuda.is_available() else -1 ) # Named entity recognition (spaCy) import spacy self.nlp = spacy.load("en_core_web_sm") # Move sentiment model to GPU self.device = "cuda" if torch.cuda.is_available() else "cpu" self.sentiment_model.to(self.device) print(f"All models loaded on device: {self.device}") # === STREAMING ENDPOINTS === @chute.cord( public_api_path="/process-stream", method="POST", input_schema=StreamingTextInput, stream=True, output_content_type="application/json" ) async def process_text_stream(self, data: StreamingTextInput) -> AsyncGenerator[dict, None]: """Process text with streaming results.""" start_time = time.time() # Initial status yield { "status": "started", "message": "Beginning text processing...", "timestamp": time.time(), "text_length": len(data.text) } # Step 1: Text chunking yield {"status": "chunking", "message": "Splitting text into chunks..."} chunks = [] text = data.text for i in range(0, len(text), data.chunk_size): chunk = text[i:i + data.chunk_size] chunks.append(chunk) yield { "status": "chunked", "message": f"Split into {len(chunks)} chunks", "chunks": len(chunks) } # Step 2: Sentiment Analysis (if requested) if data.include_sentiment: yield {"status": "sentiment_processing", "message": "Analyzing sentiment..."} import torch try: sentiments = [] for i, chunk in enumerate(chunks): # Process chunk inputs = self.sentiment_tokenizer( chunk, return_tensors="pt", truncation=True, max_length=512 ).to(self.device) with torch.no_grad(): outputs = self.sentiment_model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) # Get sentiment labels = ["NEGATIVE", "NEUTRAL", "POSITIVE"] predicted_class = predictions.argmax().item() confidence = predictions[0][predicted_class].item() chunk_sentiment = { "chunk": i + 1, "sentiment": labels[predicted_class], "confidence": confidence } sentiments.append(chunk_sentiment) # Stream progress yield { "status": "sentiment_progress", "progress": (i + 1) / len(chunks), "chunk_result": chunk_sentiment } # Small delay to show streaming effect await asyncio.sleep(0.1) # Overall sentiment positive_chunks = sum(1 for s in sentiments if s["sentiment"] == "POSITIVE") negative_chunks = sum(1 for s in sentiments if s["sentiment"] == "NEGATIVE") if positive_chunks > negative_chunks: overall_sentiment = "POSITIVE" elif negative_chunks > positive_chunks: overall_sentiment = "NEGATIVE" else: overall_sentiment = "NEUTRAL" yield { "status": "sentiment_complete", "overall_sentiment": overall_sentiment, "chunk_sentiments": sentiments, "positive_chunks": positive_chunks, "negative_chunks": negative_chunks } except Exception as e: yield {"status": "sentiment_error", "error": str(e)} # Step 3: Summarization (if requested) if data.include_summary and len(data.text) > 100: yield {"status": "summarization_processing", "message": "Generating summary..."} try: # Summarize the full text summary_result = self.summarizer( data.text, max_length=130, min_length=30, do_sample=False ) summary = summary_result[0]['summary_text'] yield { "status": "summarization_complete", "summary": summary, "compression_ratio": len(summary) / len(data.text) } except Exception as e: yield {"status": "summarization_error", "error": str(e)} # Step 4: Named Entity Recognition (if requested) if data.include_entities: yield {"status": "entities_processing", "message": "Extracting entities..."} import spacy try: doc = self.nlp(data.text) entities = [] for ent in doc.ents: entities.append({ "text": ent.text, "label": ent.label_, "description": spacy.explain(ent.label_), "start": ent.start_char, "end": ent.end_char }) # Group by entity type entity_types = {} for ent in entities: label = ent["label"] if label not in entity_types: entity_types[label] = [] entity_types[label].append(ent) yield { "status": "entities_complete", "entities": entities, "entity_types": entity_types, "total_entities": len(entities) } except Exception as e: yield {"status": "entities_error", "error": str(e)} # Final status total_time = time.time() - start_time yield { "status": "completed", "message": "All processing complete!", "total_processing_time": total_time, "timestamp": time.time() } @chute.cord( public_api_path="/generate-stream", method="POST", stream=True, output_content_type="text/plain" ) async def generate_text_stream(self, prompt: str) -> AsyncGenerator[str, None]: """Simple text generation with streaming (simulated).""" # Simulate text generation word by word words = [ "Artificial", "intelligence", "is", "revolutionizing", "how", "we", "process", "and", "understand", "text", "data.", "With", "advanced", "models", "like", "transformers,", "we", "can", "perform", "complex", "natural", "language", "tasks", "with", "unprecedented", "accuracy." ] yield f"Prompt: {prompt}\n\nGenerated text: " for word in words: yield word + " " await asyncio.sleep(0.2) # Simulate processing time yield "\n\n[Generation complete]" # === REGULAR (NON-STREAMING) ENDPOINT FOR COMPARISON === @chute.cord( public_api_path="/process-batch", method="POST", input_schema=StreamingTextInput, output_content_type="application/json" ) async def process_text_batch(self, data: StreamingTextInput) -> dict: """Non-streaming version that returns all results at once.""" import torch start_time = time.time() results = {} # Sentiment analysis if data.include_sentiment: inputs = self.sentiment_tokenizer( data.text, return_tensors="pt", truncation=True, max_length=512 ).to(self.device) with torch.no_grad(): outputs = self.sentiment_model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) labels = ["NEGATIVE", "NEUTRAL", "POSITIVE"] predicted_class = predictions.argmax().item() confidence = predictions[0][predicted_class].item() results["sentiment"] = { "label": labels[predicted_class], "confidence": confidence } # Summarization if data.include_summary and len(data.text) > 100: summary_result = self.summarizer(data.text, max_length=130, min_length=30) results["summary"] = summary_result[0]['summary_text'] # Entities if data.include_entities: doc = self.nlp(data.text) entities = [{"text": ent.text, "label": ent.label_} for ent in doc.ents] results["entities"] = entities results["processing_time"] = time.time() - start_time return results ``` ## Testing the Streaming API ### Using curl ```bash # Test streaming processing curl -X POST https://myuser-streaming-processor.chutes.ai/process-stream \ -H "Content-Type: application/json" \ -d '{ "text": "I love using this amazing new technology! It has completely transformed how I work. The artificial intelligence capabilities are impressive and the user interface is intuitive. However, there are still some areas that could be improved.", "include_sentiment": true, "include_summary": true, "include_entities": true, "chunk_size": 100 }' \ --no-buffer ``` ### Using Python ```python import asyncio import aiohttp import json async def stream_text_processing(): """Test the streaming text processing endpoint.""" payload = { "text": """ Artificial intelligence is rapidly transforming industries across the globe. Companies like Google, Microsoft, and OpenAI are leading the charge with innovative models and applications. The technology is being used in healthcare, finance, education, and many other sectors. While the potential is enormous, there are also important ethical considerations that need to be addressed. """, "include_sentiment": True, "include_summary": True, "include_entities": True, "chunk_size": 150 } async with aiohttp.ClientSession() as session: url = "https://myuser-streaming-processor.chutes.ai/process-stream" async with session.post(url, json=payload) as response: async for line in response.content: if line: try: data = json.loads(line.decode()) print(f"[{data['status']}] {data.get('message', '')}") # Handle specific result types if data['status'] == 'sentiment_complete': print(f"Overall sentiment: {data['overall_sentiment']}") elif data['status'] == 'summarization_complete': print(f"Summary: {data['summary']}") elif data['status'] == 'entities_complete': print(f"Found {data['total_entities']} entities") except json.JSONDecodeError: continue # Run the test asyncio.run(stream_text_processing()) ``` ### Using JavaScript/Browser ```javascript async function streamTextProcessing() { const response = await fetch('/process-stream', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ text: 'Your text here...', include_sentiment: true, include_summary: true, chunk_size: 200 }) }); const reader = response.body.getReader(); const decoder = new TextDecoder(); while (true) { const { done, value } = await reader.read(); if (done) break; const lines = decoder.decode(value).split('\n'); for (const line of lines) { if (line.trim()) { try { const data = JSON.parse(line); console.log(`[${data.status}]`, data.message || ''); // Update UI based on status updateProgressUI(data); } catch (e) { // Skip invalid JSON } } } } } function updateProgressUI(data) { const statusDiv = document.getElementById('status'); const resultsDiv = document.getElementById('results'); statusDiv.textContent = data.message || data.status; if (data.status === 'sentiment_complete') { resultsDiv.innerHTML += `

Sentiment: ${data.overall_sentiment}

`; } else if (data.status === 'summarization_complete') { resultsDiv.innerHTML += `

Summary: ${data.summary}

`; } } ``` ## Key Streaming Concepts ### 1. **AsyncGenerator Pattern** ```python async def my_stream() -> AsyncGenerator[dict, None]: for i in range(10): yield {"step": i, "data": f"Processing item {i}"} await asyncio.sleep(0.1) # Simulate work ``` ### 2. **Progress Tracking** ```python total_items = len(items) for i, item in enumerate(items): # Process item result = await process_item(item) # Yield progress yield { "status": "processing", "progress": (i + 1) / total_items, "current_item": i + 1, "total_items": total_items, "result": result } ``` ### 3. **Error Handling in Streams** ```python try: result = await risky_operation() yield {"status": "success", "result": result} except Exception as e: yield {"status": "error", "error": str(e)} # Continue with other operations if possible ``` ### 4. **Multiple Content Types** ```python # JSON streaming @chute.cord(stream=True, output_content_type="application/json") async def json_stream(self): yield {"message": "JSON data"} # Plain text streaming @chute.cord(stream=True, output_content_type="text/plain") async def text_stream(self): yield "Plain text data\n" ``` ## Performance Considerations ### 1. **Chunk Size Optimization** ```python # Too small: many HTTP chunks, overhead chunk_size = 10 # Too large: delayed responses, memory usage chunk_size = 10000 # Just right: balance responsiveness and efficiency chunk_size = 200 ``` ### 2. **Async Processing** ```python # Good: Non-blocking delays await asyncio.sleep(0.1) # Bad: Blocking operations (use sparingly) time.sleep(0.1) ``` ### 3. **Memory Management** ```python # Process in chunks to avoid memory issues async def process_large_text(text: str): chunk_size = 1000 for i in range(0, len(text), chunk_size): chunk = text[i:i + chunk_size] result = await process_chunk(chunk) yield {"chunk": i // chunk_size, "result": result} # Chunk is automatically garbage collected ``` ## Use Cases for Streaming ### 1. **Long-Running AI Tasks** - Model training progress - Large text processing - Image/video generation ### 2. **Real-Time Analysis** - Live sentiment monitoring - Stream processing - Progressive enhancement ### 3. **User Experience** - Show progress to users - Provide intermediate results - Reduce perceived latency ## Next Steps - **[Batch Processing](../examples/batch-processing)** - Handle multiple inputs efficiently - **[Multi-Model Analysis](../examples/multi-model-analysis)** - Combine different AI models - **[Custom Images Guide](../guides/custom-images)** - Advanced Docker setups - **[Performance Optimization](../guides/performance)** - Speed up your chutes --- ## SOURCE: https://chutes.ai/docs/examples/text-to-speech # Text-to-Speech with CSM-1B This guide demonstrates how to build a sophisticated text-to-speech (TTS) service using CSM-1B (Conversational Speech Model), capable of generating natural-sounding speech with context awareness and multiple speaker support. ## Overview CSM-1B from Sesame is a state-of-the-art speech generation model that: - Generates high-quality speech from text input - Supports multiple speakers (2 speakers available) - Uses context from previous audio/text for continuity - Employs Llama backbone with specialized audio decoder - Produces Mimi audio codes for natural speech output - Supports configurable duration limits ## Complete Implementation ### Input Schema Design Define comprehensive input validation for TTS generation: ```python from pydantic import BaseModel, Field from typing import Optional, List class Context(BaseModel): text: str speaker: int = Field(0, gte=0, lte=1) audio_b64: str # Base64 encoded reference audio class InputArgs(BaseModel): text: str context: Optional[List[Context]] = [] speaker: Optional[int] = Field(1, gte=0, lte=1) max_duration_ms: Optional[int] = 10000 # Maximum output duration ``` ### Custom Image with CSM-1B Build a custom image with all required dependencies: ```python from chutes.image import Image from chutes.chute import Chute, NodeSelector image = ( Image( username="myuser", name="csm-1b", tag="0.0.2", readme="## Text-to-speech using sesame/csm-1b") .from_base("parachutes/base-python:3.12.9") .run_command( "pip install -r https://huggingface.co/chutesai/csm-1b/resolve/main/requirements.txt" ) .run_command("pip install pybase64") # For audio encoding/decoding .run_command( "wget -O /app/generator.py https://huggingface.co/chutesai/csm-1b/resolve/main/generator.py" ) .run_command( "wget -O /app/models.py https://huggingface.co/chutesai/csm-1b/resolve/main/models.py" ) .run_command( "wget -O /app/watermarking.py https://huggingface.co/chutesai/csm-1b/resolve/main/watermarking.py" ) ) ``` ### Chute Configuration Configure the service with appropriate GPU requirements: ```python chute = Chute( username="myuser", name="csm-1b-tts", tagline="High-quality text-to-speech with CSM-1B", readme="CSM (Conversational Speech Model) generates natural speech from text with context awareness and multiple speaker support.", image=image, node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24 # 24GB required for optimal performance )) ``` ### Model Initialization Load and initialize the CSM-1B model on startup: ```python @chute.on_startup() async def initialize(self): """ Initialize the CSM-1B model and perform warmup. """ from huggingface_hub import snapshot_download from generator import Generator from models import Model import torchaudio import torch # Download the model with specific revision revision = "01e2ed64be01915391ec7881f666d6dda0e1d509" snapshot_download("chutesai/csm-1b", revision=revision) # Store torchaudio for later use self.torchaudio = torchaudio # Initialize the model model = Model.from_pretrained("chutesai/csm-1b", revision=revision) model.to(device="cuda", dtype=torch.bfloat16) # Create the generator self.generator = Generator(model) # Warmup generation to load models into memory _ = self.generator.generate( text="Warming up Sesame...", speaker=0, context=[], max_audio_length_ms=10000) ``` ### Audio Processing Utilities Add utilities for handling audio input and output: ```python import pybase64 as base64 import tempfile import os from io import BytesIO from loguru import logger from fastapi import HTTPException, status def load_audio(self, audio_b64): """ Convert base64 audio data into audio tensor. Ensures the output is a 1D tensor [T] for compatibility. """ try: # Decode base64 to audio bytes audio_bytes = BytesIO(base64.b64decode(audio_b64)) # Save to temporary file for processing with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_file: temp_file.write(audio_bytes.getvalue()) temp_path = temp_file.name # Load audio with torchaudio waveform, sample_rate = self.torchaudio.load(temp_path) os.unlink(temp_path) # Clean up temp file # Convert to mono if stereo if waveform.shape[0] > 1: waveform = waveform.mean(dim=0) else: waveform = waveform.squeeze(0) # Resample to model's expected sample rate audio_tensor = self.torchaudio.functional.resample( waveform, orig_freq=sample_rate, new_freq=self.generator.sample_rate) # Ensure 1D tensor if audio_tensor.dim() > 1: audio_tensor = audio_tensor.squeeze() return audio_tensor except Exception as exc: logger.error(f"Error loading audio: {exc}") raise HTTPException( status_code=status.HTTP_400_BAD_REQUEST, detail=f"Invalid input audio_b64 provided: {exc}") ``` ### Text-to-Speech Endpoint Create the main TTS generation endpoint: ```python import uuid from fastapi import Response @chute.cord( public_api_path="/speak", public_api_method="POST", stream=False, output_content_type="audio/wav") async def speak(self, args: InputArgs) -> Response: """ Convert text to speech with optional context. """ from generator import Segment # Process context if provided segments = [] if args.context: for ctx in args.context: audio_tensor = load_audio(self, ctx.audio_b64) segments.append( Segment( text=ctx.text, speaker=ctx.speaker, audio=audio_tensor) ) # Generate speech audio audio = self.generator.generate( text=args.text, speaker=args.speaker, context=segments, max_audio_length_ms=args.max_duration_ms) # Save to temporary file path = f"/tmp/{uuid.uuid4()}.wav" self.torchaudio.save( path, audio.unsqueeze(0).cpu(), self.generator.sample_rate ) try: # Return audio file with open(path, "rb") as infile: return Response( content=infile.read(), media_type="audio/wav", headers={ "Content-Disposition": f"attachment; filename={uuid.uuid4()}.wav", }) finally: # Clean up temporary file if os.path.exists(path): os.remove(path) ``` ## Advanced Features ### Multi-Speaker Conversation Create endpoint for generating conversation between speakers: ```python class ConversationTurn(BaseModel): speaker: int = Field(ge=0, le=1) text: str pause_ms: Optional[int] = Field(default=500, ge=0, le=2000) class ConversationInput(BaseModel): turns: List[ConversationTurn] max_total_duration_ms: int = Field(default=30000, ge=5000, le=60000) @chute.cord(public_api_path="/conversation", method="POST") async def generate_conversation(self, args: ConversationInput) -> Response: """Generate a conversation between multiple speakers.""" from generator import Segment conversation_audio = [] context_segments = [] for turn in args.turns: # Generate speech for this turn with accumulated context audio = self.generator.generate( text=turn.text, speaker=turn.speaker, context=context_segments, max_audio_length_ms=args.max_total_duration_ms // len(args.turns)) conversation_audio.append(audio) # Add silence between turns if turn.pause_ms > 0: silence_samples = int(turn.pause_ms * self.generator.sample_rate / 1000) silence = torch.zeros(silence_samples) conversation_audio.append(silence) # Add this turn to context for future turns context_segments.append( Segment( text=turn.text, speaker=turn.speaker, audio=audio) ) # Concatenate all audio full_audio = torch.cat(conversation_audio, dim=0) # Save and return path = f"/tmp/conversation_{uuid.uuid4()}.wav" self.torchaudio.save(path, full_audio.unsqueeze(0).cpu(), self.generator.sample_rate) try: with open(path, "rb") as infile: return Response( content=infile.read(), media_type="audio/wav", headers={"Content-Disposition": f"attachment; filename=conversation.wav"}) finally: if os.path.exists(path): os.remove(path) ``` ### Voice Cloning with Reference Audio Clone a voice from a reference audio sample: ```python class VoiceCloningInput(BaseModel): text: str reference_audio_b64: str reference_text: str # Text that was spoken in reference audio max_duration_ms: int = Field(default=15000, ge=1000, le=30000) @chute.cord(public_api_path="/clone_voice", method="POST") async def clone_voice(self, args: VoiceCloningInput) -> Response: """Generate speech using a reference voice sample.""" from generator import Segment # Load reference audio reference_audio = load_audio(self, args.reference_audio_b64) # Create context segment from reference reference_segment = Segment( text=args.reference_text, speaker=0, # Use speaker 0 as base audio=reference_audio) # Generate new speech with reference voice characteristics audio = self.generator.generate( text=args.text, speaker=0, context=[reference_segment], max_audio_length_ms=args.max_duration_ms) # Save and return path = f"/tmp/cloned_{uuid.uuid4()}.wav" self.torchaudio.save(path, audio.unsqueeze(0).cpu(), self.generator.sample_rate) try: with open(path, "rb") as infile: return Response( content=infile.read(), media_type="audio/wav", headers={"Content-Disposition": f"attachment; filename=cloned_voice.wav"}) finally: if os.path.exists(path): os.remove(path) ``` ### Batch Processing Process multiple texts efficiently: ```python class BatchTTSInput(BaseModel): texts: List[str] = Field(max_items=10) # Limit batch size speaker: int = Field(default=0, ge=0, le=1) max_duration_per_text_ms: int = Field(default=10000, ge=1000, le=20000) @chute.cord(public_api_path="/batch_speak", method="POST") async def batch_speak(self, args: BatchTTSInput) -> List[str]: """Generate speech for multiple texts and return as base64 list.""" results = [] for text in args.texts: # Generate audio for each text audio = self.generator.generate( text=text, speaker=args.speaker, context=[], max_audio_length_ms=args.max_duration_per_text_ms) # Convert to WAV bytes path = f"/tmp/batch_{uuid.uuid4()}.wav" self.torchaudio.save(path, audio.unsqueeze(0).cpu(), self.generator.sample_rate) try: with open(path, "rb") as infile: audio_b64 = base64.b64encode(infile.read()).decode() results.append(audio_b64) finally: if os.path.exists(path): os.remove(path) return results ``` ## Deployment and Usage ### Deploy the Service ```bash # Build and deploy the TTS service chutes deploy my_tts:chute # Monitor the deployment chutes chutes get my-tts ``` ### Using the API #### Basic Text-to-Speech ```bash curl -X POST "https://myuser-my-tts.chutes.ai/speak" \ -H "Content-Type: application/json" \ -d '{ "text": "Hello, this is a demonstration of high-quality text-to-speech synthesis.", "speaker": 0, "max_duration_ms": 15000 }' \ --output speech.wav ``` #### Voice Cloning ```bash # First, encode your reference audio to base64 # base64 -i reference.wav > reference.b64 curl -X POST "https://myuser-my-tts.chutes.ai/clone_voice" \ -H "Content-Type: application/json" \ -d '{ "text": "This is new text spoken in the reference voice", "reference_audio_b64": "'$(cat reference.b64)'", "reference_text": "Original text that was spoken in the reference audio", "max_duration_ms": 20000 }' \ --output cloned_speech.wav ``` #### Python Client Example ```python import requests import base64 import io from pydantic import BaseModel from typing import List, Optional class TTSClient: def __init__(self, base_url: str): self.base_url = base_url.rstrip('/') def speak(self, text: str, speaker: int = 0, max_duration_ms: int = 10000) -> bytes: """Generate speech from text.""" response = requests.post( f"{self.base_url}/speak", json={ "text": text, "speaker": speaker, "max_duration_ms": max_duration_ms } ) if response.status_code == 200: return response.content else: raise Exception(f"TTS failed: {response.status_code} - {response.text}") def clone_voice(self, text: str, reference_audio_path: str, reference_text: str) -> bytes: """Generate speech using voice cloning.""" # Encode reference audio with open(reference_audio_path, "rb") as f: reference_b64 = base64.b64encode(f.read()).decode() response = requests.post( f"{self.base_url}/clone_voice", json={ "text": text, "reference_audio_b64": reference_b64, "reference_text": reference_text, "max_duration_ms": 20000 } ) return response.content def generate_conversation(self, turns: List[dict]) -> bytes: """Generate a conversation between speakers.""" response = requests.post( f"{self.base_url}/conversation", json={ "turns": turns, "max_total_duration_ms": 30000 } ) return response.content def batch_speak(self, texts: List[str], speaker: int = 0) -> List[bytes]: """Generate speech for multiple texts.""" response = requests.post( f"{self.base_url}/batch_speak", json={ "texts": texts, "speaker": speaker, "max_duration_per_text_ms": 10000 } ) if response.status_code == 200: b64_results = response.json() return [base64.b64decode(b64) for b64 in b64_results] else: raise Exception(f"Batch TTS failed: {response.status_code}") # Usage examples client = TTSClient("https://myuser-my-tts.chutes.ai") # Basic TTS speech_audio = client.speak("Hello, world! This is synthesized speech.") with open("hello.wav", "wb") as f: f.write(speech_audio) # Voice cloning cloned_audio = client.clone_voice( text="This is new content in the cloned voice", reference_audio_path="reference_voice.wav", reference_text="This was the original reference text" ) with open("cloned.wav", "wb") as f: f.write(cloned_audio) # Conversation generation conversation_turns = [ {"speaker": 0, "text": "Hello, how are you today?", "pause_ms": 1000}, {"speaker": 1, "text": "I'm doing great, thanks for asking!", "pause_ms": 800}, {"speaker": 0, "text": "That's wonderful to hear.", "pause_ms": 500} ] conversation_audio = client.generate_conversation(conversation_turns) with open("conversation.wav", "wb") as f: f.write(conversation_audio) ``` ## Best Practices ### 1. Text Preprocessing ```python import re def preprocess_text(text: str) -> str: """Clean and prepare text for TTS.""" # Expand common abbreviations text = text.replace("Dr.", "Doctor") text = text.replace("Mr.", "Mister") text = text.replace("Mrs.", "Missus") text = text.replace("&", "and") # Handle numbers (basic example) text = re.sub(r'\b(\d+)\b', lambda m: num_to_words(int(m.group(1))), text) # Remove excessive punctuation text = re.sub(r'[.]{2,}', '.', text) text = re.sub(r'[!]{2,}', '!', text) text = re.sub(r'[?]{2,}', '?', text) return text.strip() def num_to_words(num: int) -> str: """Convert numbers to words (basic implementation).""" if num == 0: return "zero" elif num == 1: return "one" # Add more number conversions as needed else: return str(num) # Fallback ``` ### 2. Context Management ```python class ContextManager: """Manage conversation context for better continuity.""" def __init__(self, max_context_length: int = 5): self.context_segments = [] self.max_length = max_context_length def add_segment(self, text: str, speaker: int, audio_tensor): """Add a new segment to context.""" from generator import Segment segment = Segment(text=text, speaker=speaker, audio=audio_tensor) self.context_segments.append(segment) # Keep only recent context if len(self.context_segments) > self.max_length: self.context_segments = self.context_segments[-self.max_length:] def get_context(self) -> List: """Get current context for generation.""" return self.context_segments.copy() def clear(self): """Clear all context.""" self.context_segments = [] # Usage in endpoint @chute.cord(public_api_path="/contextual_speak", method="POST") async def contextual_speak(self, args: InputArgs) -> Response: """Generate speech with persistent context.""" if not hasattr(self, 'context_manager'): self.context_manager = ContextManager() # Generate with context audio = self.generator.generate( text=args.text, speaker=args.speaker, context=self.context_manager.get_context(), max_audio_length_ms=args.max_duration_ms) # Add to context for future generations self.context_manager.add_segment(args.text, args.speaker, audio) # Return audio... ``` ### 3. Quality Control ```python def validate_audio_quality(audio_tensor, sample_rate: int) -> bool: """Check generated audio quality.""" import torch # Check for silence (all zeros) if torch.all(audio_tensor == 0): return False # Check for clipping if torch.max(torch.abs(audio_tensor)) > 0.99: return False # Check minimum duration (avoid too short clips) min_duration_ms = 500 min_samples = int(min_duration_ms * sample_rate / 1000) if len(audio_tensor) < min_samples: return False return True @chute.cord(public_api_path="/quality_speak", method="POST") async def quality_controlled_speak(self, args: InputArgs) -> Response: """Generate speech with quality validation.""" max_retries = 3 for attempt in range(max_retries): audio = self.generator.generate( text=args.text, speaker=args.speaker, context=[], max_audio_length_ms=args.max_duration_ms) if validate_audio_quality(audio, self.generator.sample_rate): # Quality passed, return audio break else: logger.warning(f"Audio quality check failed, attempt {attempt + 1}") if attempt == max_retries - 1: raise HTTPException( status_code=500, detail="Failed to generate quality audio after multiple attempts" ) # Save and return validated audio... ``` ## Performance Optimization ### Memory Management ```python @chute.cord(public_api_path="/optimized_speak", method="POST") async def optimized_speak(self, args: InputArgs) -> Response: """Memory-optimized speech generation.""" import torch try: # Clear cache before generation torch.cuda.empty_cache() # Generate with memory efficiency with torch.inference_mode(): audio = self.generator.generate( text=args.text, speaker=args.speaker, context=args.context, max_audio_length_ms=args.max_duration_ms) # Process and return immediately path = f"/tmp/{uuid.uuid4()}.wav" self.torchaudio.save(path, audio.unsqueeze(0).cpu(), self.generator.sample_rate) # Read and clean up immediately with open(path, "rb") as infile: content = infile.read() os.remove(path) return Response( content=content, media_type="audio/wav", headers={"Content-Disposition": f"attachment; filename=speech.wav"}) finally: # Always clear cache after generation torch.cuda.empty_cache() ``` ### Caching for Repeated Requests ```python import hashlib from typing import Dict class TTSCache: """Simple cache for TTS results.""" def __init__(self, max_size: int = 100): self.cache: Dict[str, bytes] = {} self.max_size = max_size def get_key(self, text: str, speaker: int) -> str: """Generate cache key.""" content = f"{text}_{speaker}" return hashlib.md5(content.encode()).hexdigest() def get(self, text: str, speaker: int) -> Optional[bytes]: """Get cached result.""" key = self.get_key(text, speaker) return self.cache.get(key) def set(self, text: str, speaker: int, audio_bytes: bytes): """Cache result.""" if len(self.cache) >= self.max_size: # Remove oldest item (simple FIFO) oldest_key = next(iter(self.cache)) del self.cache[oldest_key] key = self.get_key(text, speaker) self.cache[key] = audio_bytes # Add to chute initialization @chute.on_startup() async def initialize_with_cache(self): # ... existing initialization ... self.tts_cache = TTSCache(max_size=200) @chute.cord(public_api_path="/cached_speak", method="POST") async def cached_speak(self, args: InputArgs) -> Response: """TTS with caching for repeated requests.""" # Check cache first (only for simple requests without context) if not args.context: cached_result = self.tts_cache.get(args.text, args.speaker) if cached_result: return Response( content=cached_result, media_type="audio/wav", headers={"Content-Disposition": "attachment; filename=cached_speech.wav"}) # Generate new audio audio = self.generator.generate( text=args.text, speaker=args.speaker, context=[], max_audio_length_ms=args.max_duration_ms) # Save to file and cache path = f"/tmp/{uuid.uuid4()}.wav" self.torchaudio.save(path, audio.unsqueeze(0).cpu(), self.generator.sample_rate) with open(path, "rb") as infile: audio_bytes = infile.read() os.remove(path) # Cache result if not args.context: self.tts_cache.set(args.text, args.speaker, audio_bytes) return Response( content=audio_bytes, media_type="audio/wav", headers={"Content-Disposition": "attachment; filename=speech.wav"}) ``` ## Monitoring and Troubleshooting ### Performance Monitoring ```bash # Check service health chutes chutes get my-tts # View generation logs chutes chutes logs my-tts --tail 100 # Monitor GPU utilization chutes chutes metrics my-tts ``` ### Common Issues and Solutions ```python # Handle common TTS issues @chute.cord(public_api_path="/robust_speak", method="POST") async def robust_speak(self, args: InputArgs) -> Response: """TTS with comprehensive error handling.""" try: # Preprocess text processed_text = preprocess_text(args.text) # Validate text length if len(processed_text) > 1000: raise HTTPException( status_code=400, detail="Text too long. Maximum 1000 characters allowed." ) # Generate audio audio = self.generator.generate( text=processed_text, speaker=args.speaker, context=[], max_audio_length_ms=args.max_duration_ms) # Validate output if not validate_audio_quality(audio, self.generator.sample_rate): raise HTTPException( status_code=500, detail="Generated audio failed quality checks" ) # Return successful result path = f"/tmp/{uuid.uuid4()}.wav" self.torchaudio.save(path, audio.unsqueeze(0).cpu(), self.generator.sample_rate) with open(path, "rb") as infile: content = infile.read() os.remove(path) return Response( content=content, media_type="audio/wav", headers={"Content-Disposition": "attachment; filename=speech.wav"}) except torch.cuda.OutOfMemoryError: raise HTTPException( status_code=503, detail="GPU memory exhausted. Please try again or reduce duration." ) except Exception as e: logger.error(f"TTS generation failed: {e}") raise HTTPException( status_code=500, detail=f"Speech generation failed: {str(e)}" ) ``` ## Next Steps - **Custom Voice Training**: Train CSM-1B on your own voice data - **Multilingual Support**: Experiment with different languages - **Real-time Streaming**: Implement streaming TTS for live applications - **Integration**: Build voice assistants and interactive applications For more advanced examples, see: - [Real-time Streaming](/docs/examples/streaming-responses) - [Custom Training](/docs/examples/custom-training) - [Audio Processing](/docs/examples/audio-processing) --- ## SOURCE: https://chutes.ai/docs/examples/video-generation # Video Generation with Wan2.1 This guide demonstrates how to build a sophisticated video generation service using Wan2.1-14B from Alibaba, capable of generating high-quality videos from text prompts and transforming images into videos. ## Overview Wan2.1-14B is a state-of-the-art video generation model that supports: - **Text-to-Video (T2V)**: Generate videos from text descriptions - **Image-to-Video (I2V)**: Transform images into dynamic videos - **Text-to-Image (T2I)**: Generate single frames from text - **Multiple Resolutions**: Support for various aspect ratios - **High Quality**: Up to 720p video generation with 44.1kHz audio - **Distributed Processing**: Multi-GPU support for large-scale deployment ## Complete Implementation ### Input Schema Design Define comprehensive input validation for video generation: ```python from pydantic import BaseModel, Field from typing import Optional from enum import Enum class Resolution(str, Enum): SIXTEEN_NINE = "1280*720" # 16:9 widescreen NINE_SIXTEEN = "720*1280" # 9:16 portrait (mobile) WIDESCREEN = "832*480" # Cinematic widescreen PORTRAIT = "480*832" # Portrait SQUARE = "1024*1024" # Square format class VideoGenInput(BaseModel): prompt: str negative_prompt: Optional[str] = ( "Vibrant colors, overexposed, static, blurry details, subtitles, style, artwork, " "painting, picture, still, overall grayish, worst quality, low quality, JPEG compression artifacts, " "ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn face, deformed, disfigured, " "malformed limbs, fused fingers, motionless image, cluttered background, three legs, " "many people in the background, walking backwards, slow motion" ) resolution: Optional[Resolution] = Resolution.WIDESCREEN sample_shift: Optional[float] = Field(None, ge=1.0, le=7.0) guidance_scale: Optional[float] = Field(5.0, ge=1.0, le=7.5) seed: Optional[int] = 42 steps: int = Field(25, ge=10, le=30) fps: int = Field(16, ge=16, le=60) frames: Optional[int] = Field(81, ge=81, le=241) single_frame: Optional[bool] = False class ImageGenInput(BaseModel): prompt: str negative_prompt: Optional[str] = ( "Vibrant colors, overexposed, static, blurry details, subtitles, style, artwork, " "painting, picture, still, overall grayish, worst quality, low quality, JPEG compression artifacts, " "ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn face, deformed, disfigured, " "malformed limbs, fused fingers, motionless image, cluttered background, three legs, " "many people in the background, walking backwards, slow motion" ) resolution: Optional[Resolution] = Resolution.WIDESCREEN sample_shift: Optional[float] = Field(None, ge=1.0, le=7.0) guidance_scale: Optional[float] = Field(5.0, ge=1.0, le=7.5) seed: Optional[int] = 42 class I2VInput(BaseModel): prompt: str negative_prompt: Optional[str] = ( "Vibrant colors, overexposed, static, blurry details, subtitles, style, artwork, " "painting, picture, still, overall grayish, worst quality, low quality, JPEG compression artifacts, " "ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn face, deformed, disfigured, " "malformed limbs, fused fingers, motionless image, cluttered background, three legs, " "many people in the background, walking backwards, slow motion" ) sample_shift: Optional[float] = Field(None, ge=1.0, le=7.0) guidance_scale: Optional[float] = Field(5.0, ge=1.0, le=7.5) seed: Optional[int] = 42 image_b64: str # Base64 encoded input image steps: int = Field(25, ge=20, le=50) fps: int = Field(16, ge=16, le=60) single_frame: Optional[bool] = False ``` ### Custom Image with Wan2.1 Build a custom image with all required dependencies: ```python from chutes.image import Image as ChutesImage from chutes.chute import Chute, NodeSelector import os import time from loguru import logger # Set up environment for large model downloads T2V_PATH = os.path.join(os.getenv("HF_HOME", "/cache"), "Wan2.1-T2V-14B") I2V_480_PATH = os.path.join(os.getenv("HF_HOME", "/cache"), "Wan2.1-I2V-14B-480P") # Download models if in remote execution context if os.getenv("CHUTES_EXECUTION_CONTEXT") == "REMOTE": from huggingface_hub import snapshot_download cache_dir = os.getenv("HF_HOME", "/cache") for _ in range(3): # Retry downloads try: snapshot_download( repo_id="Wan-AI/Wan2.1-I2V-14B-480P", revision="6b73f84e66371cdfe870c72acd6826e1d61cf279", local_dir=I2V_480_PATH) snapshot_download( repo_id="Wan-AI/Wan2.1-T2V-14B", revision="b1cbf2d3d13dca5164463128885ab8e551e93e40", local_dir=T2V_PATH) break except Exception as exc: logger.warning(f"Error downloading models: {exc}") time.sleep(30) # Build custom image with video generation capabilities image = ( ChutesImage( username="myuser", name="wan21", tag="0.0.1", readme="## Video and image generation/editing model from Alibaba") .from_base("parachutes/base-python:3.12.7") .set_user("root") .run_command("apt update") .apt_install("ffmpeg") # Required for video processing .set_user("chutes") .run_command( "git clone https://github.com/Wan-Video/Wan2.1 && " "cd Wan2.1 && " "pip install --upgrade pip && " "pip install setuptools wheel && " "pip install torch torchvision && " "pip install -r requirements.txt --no-build-isolation && " "pip install xfuser && " # Apply critical patches for performance "perl -pi -e 's/sharding_strategy=sharding_strategy,/sharding_strategy=sharding_strategy,\\n use_orig_params=True,/g' wan/distributed/fsdp.py && " "perl -pi -e 's/dtype=torch.float32,/dtype=torch.bfloat16,/g' wan/modules/t5.py && " "mv -f /app/Wan2.1/wan /home/chutes/.local/lib/python3.12/site-packages/" ) ) ``` ### Chute Configuration Configure the service with high-end GPU requirements: ```python chute = Chute( username="myuser", name="wan2.1-14b", tagline="Text-to-video, image-to-video, text-to-image with Wan2.1 14B", readme="High-quality video generation using Wan2.1 14B model with support for multiple formats and resolutions", image=image, node_selector=NodeSelector( gpu_count=8, # Multi-GPU setup required include=["h100", "h800", "h100_nvl", "h100_sxm", "h200"] # Latest GPUs only )) ``` ### Distributed Model Initialization Initialize models across multiple GPUs using distributed processing: ```python def initialize_model(rank, world_size, task_queue): """ Initialize Wan2.1 models in distributed fashion across GPUs. """ import torch import torch.distributed as dist import wan from wan.configs import WAN_CONFIGS from xfuser.core.distributed import initialize_model_parallel, init_distributed_environment # Set up distributed environment os.environ["RANK"] = str(rank) os.environ["WORLD_SIZE"] = str(world_size) os.environ["LOCAL_RANK"] = str(rank) local_rank = rank device = local_rank torch.cuda.set_device(local_rank) logger.info(f"Initializing distributed inference on {rank=}...") dist.init_process_group( backend="nccl", init_method="tcp://127.0.0.1:29501", rank=rank, world_size=world_size ) init_distributed_environment(rank=dist.get_rank(), world_size=dist.get_world_size()) initialize_model_parallel( sequence_parallel_degree=dist.get_world_size(), ring_degree=1, ulysses_degree=8) # Initialize text-to-video model cfg = WAN_CONFIGS["t2v-14B"] base_seed = [42] if rank == 0 else [None] dist.broadcast_object_list(base_seed, src=0) logger.info(f"Loading text-to-video model on {rank=}") wan_t2v = wan.WanT2V( config=cfg, checkpoint_dir=T2V_PATH, device_id=device, rank=rank, t5_fsdp=True, dit_fsdp=True, use_usp=True, t5_cpu=False) # Compile models for optimal performance logger.info("Compiling text-to-video model...") wan_t2v.text_encoder = torch.compile(wan_t2v.text_encoder) wan_t2v.vae.model = torch.compile(wan_t2v.vae.model) wan_t2v.model = torch.compile(wan_t2v.model) # Initialize image-to-video model logger.info(f"Loading 480P image-to-video model on {rank=}") cfg = WAN_CONFIGS["i2v-14B"] wan_i2v_480 = wan.WanI2V( config=cfg, checkpoint_dir=I2V_480_PATH, device_id=device, rank=rank, t5_fsdp=True, dit_fsdp=True, use_usp=True, t5_cpu=False) logger.info("Compiling 480P image-to-video model...") wan_i2v_480.text_encoder = torch.compile(wan_i2v_480.text_encoder) wan_i2v_480.vae.model = torch.compile(wan_i2v_480.vae.model) wan_i2v_480.model = torch.compile(wan_i2v_480.model) logger.info(f"Finished loading models on {rank=}") if rank == 0: return wan_t2v, wan_i2v_480 else: # Worker processes handle task queue while True: task = task_queue.get() prompt = task.get("prompt") args = task.get("args") if task.get("type") == "T2V": logger.info(f"Process {rank} executing T2V task...") _ = wan_t2v.generate(prompt, **args) else: # I2V task logger.info(f"Process {rank} executing I2V 480P task...") _ = wan_i2v_480.generate(prompt, task["image"], **args) dist.barrier() @chute.on_startup() async def initialize(self): """ Initialize distributed video generation system. """ import torch import torch.multiprocessing as torch_mp import multiprocessing import numpy as np from wan.configs import MAX_AREA_CONFIGS, SIZE_CONFIGS from PIL import Image start_time = int(time.time()) self.world_size = torch.cuda.device_count() torch_mp.set_start_method("spawn", force=True) # Create task queue for distributed processing processes = [] self.task_queue = multiprocessing.Queue() logger.info(f"Starting {self.world_size} processes for distributed execution...") # Start worker processes for rank in range(1, self.world_size): p = torch_mp.Process( target=initialize_model, args=(rank, self.world_size, self.task_queue) ) p.start() processes.append(p) self.processes = processes # Initialize main process models self.wan_t2v, self.wan_i2v_480 = initialize_model(0, self.world_size, self.task_queue) delta = int(time.time()) - start_time logger.success(f"Initialized T2V and I2V models in {delta} seconds!") # Perform warmup generations await self._warmup_models() async def _warmup_models(self): """Warmup both T2V and I2V models with test generations.""" import numpy as np from PIL import Image from wan.configs import MAX_AREA_CONFIGS, SIZE_CONFIGS # Create synthetic warmup image array = np.zeros((480, 832, 3), dtype=np.uint8) for x in range(832): for y in range(480): r = int(255 * x / 832) g = int(255 * y / 480) b = int(255 * (x + y) / (832 + 480)) array[y, x] = [r, g, b] warmup_image = Image.fromarray(array) # Warmup I2V model prompt_args = { "max_area": MAX_AREA_CONFIGS[Resolution.WIDESCREEN.value], "frame_num": 81, "shift": 3.0, "sample_solver": "unipc", "sampling_steps": 25, "guide_scale": 5.0, "seed": 42, "offload_model": False, } logger.info("Warming up image-to-video model...") _infer(self, "Shifting gradient.", image=warmup_image, single_frame=False, **prompt_args) # Warmup T2V model for all resolutions for resolution in ( Resolution.SIXTEEN_NINE, Resolution.NINE_SIXTEEN, Resolution.WIDESCREEN, Resolution.PORTRAIT, Resolution.SQUARE): prompt_args = { "size": SIZE_CONFIGS[resolution.value], "frame_num": 81, "shift": 5.0, "sample_solver": "unipc", "sampling_steps": 25, "guide_scale": 5.0, "seed": 42, "offload_model": False, } logger.info(f"Warming up text-to-video model with {resolution=}") _infer(self, "a goat jumping off a boat", image=None, single_frame=False, **prompt_args) ``` ### Core Inference Function Create the unified inference function for all generation types: ```python def _infer(self, prompt, image=None, single_frame=False, **prompt_args): """ Unified inference function for T2V, I2V, and T2I generation. """ import torch.distributed as dist from wan.utils.utils import cache_video, cache_image import uuid from io import BytesIO from fastapi import Response # Determine task type task_type = "I2V" if image else "T2V" if task_type == "I2V": _, height = image.size task_type += f"_{height}" # Distribute task to worker processes for _ in range(self.world_size - 1): self.task_queue.put({ "type": task_type, "prompt": prompt, "image": image, "args": prompt_args }) # Generate on main process model = getattr(self, f"wan_{task_type.lower()}") if image: video = model.generate(prompt, image, **prompt_args) else: video = model.generate(prompt, **prompt_args) # Wait for all processes to complete dist.barrier() # Save result (only on rank 0) if os.getenv("RANK") == "0": extension = "png" if single_frame else "mp4" output_file = f"/tmp/{uuid.uuid4()}.{extension}" try: if single_frame: output_file = cache_image( tensor=video.squeeze(1)[None], save_file=output_file, nrow=1, normalize=True, value_range=(-1, 1)) else: output_file = cache_video( tensor=video[None], save_file=output_file, fps=prompt_args.get("fps", 16), nrow=1, normalize=True, value_range=(-1, 1)) if not output_file: raise Exception("Failed to save output!") # Read file and return response buffer = BytesIO() with open(output_file, "rb") as infile: buffer.write(infile.read()) buffer.seek(0) media_type = "video/mp4" if not single_frame else "image/png" return Response( content=buffer.getvalue(), media_type=media_type, headers={ "Content-Disposition": f'attachment; filename="{uuid.uuid4()}.{extension}"' }) finally: if output_file and os.path.exists(output_file): os.remove(output_file) ``` ### Video Generation Endpoints Create endpoints for different generation modes: ```python import base64 from io import BytesIO from PIL import Image from fastapi import HTTPException, status @chute.cord( public_api_path="/text2video", public_api_method="POST", stream=False, output_content_type="video/mp4") async def text_to_video(self, args: VideoGenInput): """ Generate video from text description. """ from wan.configs import SIZE_CONFIGS if args.sample_shift is None: args.sample_shift = 5.0 if args.single_frame: args.frames = 1 elif args.frames % 4 != 1: # Ensure frame count is compatible args.frames = args.frames - (args.frames % 4) + 1 if not args.frames: args.frames = 81 prompt_args = { "size": SIZE_CONFIGS[args.resolution.value], "frame_num": args.frames, "shift": args.sample_shift, "sample_solver": "unipc", "sampling_steps": args.steps, "guide_scale": args.guidance_scale, "seed": args.seed, "offload_model": False, } return _infer( self, args.prompt, image=None, single_frame=args.single_frame, **prompt_args ) @chute.cord( public_api_path="/text2image", public_api_method="POST", stream=False, output_content_type="image/png") async def text_to_image(self, args: ImageGenInput): """ Generate single image from text description. """ # Convert to video input with single frame vargs = VideoGenInput(**args.model_dump()) vargs.single_frame = True return await text_to_video(self, vargs) def prepare_input_image(args): """ Resize and crop input image to target resolution. """ target_width = 832 target_height = 480 try: input_image = Image.open(BytesIO(base64.b64decode(args.image_b64))) orig_width, orig_height = input_image.size # Calculate scaling to maintain aspect ratio width_ratio = target_width / orig_width height_ratio = target_height / orig_height scale_factor = max(width_ratio, height_ratio) new_width = int(orig_width * scale_factor) new_height = int(orig_height * scale_factor) # Resize image input_image = input_image.resize((new_width, new_height), Image.Resampling.LANCZOS) # Center crop to target dimensions width, height = input_image.size left = (width - target_width) // 2 top = (height - target_height) // 2 right = left + target_width bottom = top + target_height input_image = input_image.crop((left, top, right, bottom)).convert("RGB") except Exception as exc: raise HTTPException( status_code=status.HTTP_400_BAD_REQUEST, detail=f"Invalid image input! {exc}") return input_image @chute.cord( public_api_path="/image2video", public_api_method="POST", stream=False, output_content_type="video/mp4") async def image_to_video(self, args: I2VInput): """ Generate video from input image and text prompt. """ from wan.configs import MAX_AREA_CONFIGS if args.sample_shift is None: args.sample_shift = 3.0 # Process and validate input image input_image = prepare_input_image(args) prompt_args = { "max_area": MAX_AREA_CONFIGS[Resolution.WIDESCREEN.value], "frame_num": 81, # Fixed frame count for stability "shift": args.sample_shift, "sample_solver": "unipc", "sampling_steps": args.steps, "guide_scale": args.guidance_scale, "seed": args.seed, "offload_model": False, } return _infer( self, args.prompt, image=input_image, single_frame=False, **prompt_args ) ``` ## Advanced Features ### Batch Video Generation Process multiple prompts efficiently: ```python class BatchVideoInput(BaseModel): prompts: List[str] = Field(max_items=5) # Limit for resource management resolution: Resolution = Resolution.WIDESCREEN steps: int = Field(20, ge=10, le=30) frames: int = Field(81, ge=81, le=161) @chute.cord(public_api_path="/batch_video", method="POST") async def batch_video_generation(self, args: BatchVideoInput) -> List[str]: """Generate multiple videos and return as base64 list.""" from wan.configs import SIZE_CONFIGS results = [] for prompt in args.prompts: prompt_args = { "size": SIZE_CONFIGS[args.resolution.value], "frame_num": args.frames, "shift": 5.0, "sample_solver": "unipc", "sampling_steps": args.steps, "guide_scale": 5.0, "seed": 42, "offload_model": False, } response = _infer(self, prompt, image=None, single_frame=False, **prompt_args) # Convert response to base64 video_b64 = base64.b64encode(response.body).decode() results.append(video_b64) return results ``` ### Style-Guided Video Generation Add style control to video generation: ```python class StyledVideoInput(BaseModel): prompt: str style: str = "cinematic" # Style guidance mood: str = "dramatic" # Mood control camera_movement: str = "static" # Camera motion resolution: Resolution = Resolution.WIDESCREEN steps: int = Field(25, ge=15, le=35) @chute.cord(public_api_path="/styled_video", method="POST") async def styled_video_generation(self, args: StyledVideoInput) -> Response: """Generate video with style and mood control.""" # Enhance prompt with style guidance enhanced_prompt = f"{args.prompt}, {args.style} style, {args.mood} mood" if args.camera_movement != "static": enhanced_prompt += f", {args.camera_movement} camera movement" # Generate with enhanced prompt video_args = VideoGenInput( prompt=enhanced_prompt, resolution=args.resolution, steps=args.steps, frames=81, single_frame=False ) return await text_to_video(self, video_args) ``` ### Video Interpolation Create smooth transitions between keyframes: ```python class InterpolationInput(BaseModel): start_prompt: str end_prompt: str interpolation_steps: int = Field(5, ge=3, le=10) resolution: Resolution = Resolution.WIDESCREEN @chute.cord(public_api_path="/interpolate_video", method="POST") async def video_interpolation(self, args: InterpolationInput) -> Response: """Generate video that interpolates between two prompts.""" # Generate interpolated prompts interpolated_prompts = [] for i in range(args.interpolation_steps): weight = i / (args.interpolation_steps - 1) if weight == 0: prompt = args.start_prompt elif weight == 1: prompt = args.end_prompt else: # Simple linear interpolation in text space prompt = f"transitioning from {args.start_prompt} to {args.end_prompt}, step {i+1}" interpolated_prompts.append(prompt) # Generate sequence of videos video_segments = [] for prompt in interpolated_prompts: video_args = VideoGenInput( prompt=prompt, resolution=args.resolution, frames=41, # Shorter segments for smooth transition steps=20 ) segment_response = await text_to_video(self, video_args) video_segments.append(segment_response.body) # Concatenate videos (simplified - would need ffmpeg for production) # For now, return the last segment return Response( content=video_segments[-1], media_type="video/mp4", headers={"Content-Disposition": "attachment; filename=interpolated_video.mp4"} ) ``` ## Deployment and Usage ### Deploy the Service ```bash # Build and deploy the video generation service chutes deploy my_video_gen:chute # Monitor the deployment (this will take time due to model size) chutes chutes get my-video-gen ``` ### Using the API #### Text-to-Video Generation ```bash curl -X POST "https://myuser-my-video-gen.chutes.ai/text2video" \ -H "Content-Type: application/json" \ -d '{ "prompt": "a majestic eagle soaring over mountain peaks at golden hour", "resolution": "1280*720", "steps": 25, "frames": 81, "fps": 24, "seed": 12345 }' \ --output eagle_video.mp4 ``` #### Image-to-Video Generation ```bash # First encode your image to base64 base64 -i input_image.jpg > image.b64 curl -X POST "https://myuser-my-video-gen.chutes.ai/image2video" \ -H "Content-Type: application/json" \ -d '{ "prompt": "gentle waves lapping against the shore", "image_b64": "'$(cat image.b64)'", "steps": 30, "fps": 16, "seed": 42 }' \ --output animated_image.mp4 ``` #### Python Client Example ```python import requests import base64 from typing import List, Optional from enum import Enum class VideoGenerator: def __init__(self, base_url: str): self.base_url = base_url.rstrip('/') def text_to_video( self, prompt: str, resolution: str = "832*480", steps: int = 25, frames: int = 81, fps: int = 16, seed: Optional[int] = None ) -> bytes: """Generate video from text prompt.""" payload = { "prompt": prompt, "resolution": resolution, "steps": steps, "frames": frames, "fps": fps, "single_frame": False } if seed is not None: payload["seed"] = seed response = requests.post( f"{self.base_url}/text2video", json=payload, timeout=300 # Extended timeout for video generation ) if response.status_code == 200: return response.content else: raise Exception(f"Video generation failed: {response.status_code} - {response.text}") def image_to_video( self, prompt: str, image_path: str, steps: int = 25, fps: int = 16, seed: Optional[int] = None ) -> bytes: """Generate video from image and text prompt.""" # Encode image to base64 with open(image_path, "rb") as f: image_b64 = base64.b64encode(f.read()).decode() payload = { "prompt": prompt, "image_b64": image_b64, "steps": steps, "fps": fps, "single_frame": False } if seed is not None: payload["seed"] = seed response = requests.post( f"{self.base_url}/image2video", json=payload, timeout=300 ) return response.content def text_to_image( self, prompt: str, resolution: str = "1024*1024", seed: Optional[int] = None ) -> bytes: """Generate single image from text.""" payload = { "prompt": prompt, "resolution": resolution } if seed is not None: payload["seed"] = seed response = requests.post( f"{self.base_url}/text2image", json=payload, timeout=120 ) return response.content def styled_video( self, prompt: str, style: str = "cinematic", mood: str = "dramatic", camera_movement: str = "static" ) -> bytes: """Generate styled video.""" payload = { "prompt": prompt, "style": style, "mood": mood, "camera_movement": camera_movement, "resolution": "1280*720", "steps": 25 } response = requests.post( f"{self.base_url}/styled_video", json=payload, timeout=300 ) return response.content # Usage examples generator = VideoGenerator("https://myuser-my-video-gen.chutes.ai") # Generate cinematic video video = generator.text_to_video( prompt="A time-lapse of a bustling city street transitioning from day to night", resolution="1280*720", frames=121, fps=24, seed=12345 ) with open("city_timelapse.mp4", "wb") as f: f.write(video) # Animate a photograph animated = generator.image_to_video( prompt="gentle autumn breeze causing leaves to fall", image_path="autumn_scene.jpg", steps=30, fps=16 ) with open("animated_autumn.mp4", "wb") as f: f.write(animated) # Generate styled content styled_video = generator.styled_video( prompt="a lone warrior walking through a desert", style="epic fantasy", mood="heroic", camera_movement="slow pan" ) with open("epic_warrior.mp4", "wb") as f: f.write(styled_video) ``` ## Performance Optimization ### Memory Management The model requires significant GPU memory and careful management: ```python # Monitor and optimize memory usage @chute.cord(public_api_path="/optimized_video", method="POST") async def optimized_video_generation(self, args: VideoGenInput) -> Response: """Memory-optimized video generation.""" import torch try: # Clear cache before generation torch.cuda.empty_cache() # Reduce frame count for memory efficiency if needed if args.frames > 161: args.frames = 161 logger.warning("Reduced frame count for memory efficiency") # Generate with memory monitoring result = await text_to_video(self, args) return result except torch.cuda.OutOfMemoryError: # Fallback to lower resolution/frame count logger.warning("OOM detected, falling back to lower settings") args.resolution = Resolution.WIDESCREEN # Smaller resolution args.frames = 81 # Fewer frames torch.cuda.empty_cache() return await text_to_video(self, args) finally: # Always clean up torch.cuda.empty_cache() ``` ### Quality vs Speed Trade-offs ```python class QualityPreset(str, Enum): FAST = "fast" # 15 steps, 720p, 81 frames BALANCED = "balanced" # 25 steps, 1080p, 121 frames QUALITY = "quality" # 35 steps, 1080p, 161 frames @chute.cord(public_api_path="/preset_video", method="POST") async def preset_video_generation(self, prompt: str, preset: QualityPreset = QualityPreset.BALANCED) -> Response: """Generate video with quality presets.""" if preset == QualityPreset.FAST: args = VideoGenInput( prompt=prompt, resolution=Resolution.WIDESCREEN, steps=15, frames=81, fps=16 ) elif preset == QualityPreset.BALANCED: args = VideoGenInput( prompt=prompt, resolution=Resolution.SIXTEEN_NINE, steps=25, frames=121, fps=24 ) else: # QUALITY args = VideoGenInput( prompt=prompt, resolution=Resolution.SIXTEEN_NINE, steps=35, frames=161, fps=30 ) return await text_to_video(self, args) ``` ## Best Practices ### 1. Prompt Engineering for Video ```python # Effective video prompts include motion and temporal elements good_video_prompts = [ "a cat gracefully leaping from a windowsill to a nearby table", "ocean waves gently rolling onto a sandy beach at sunset", "time-lapse of cherry blossoms blooming in spring", "a paper airplane gliding through the air in slow motion", "raindrops creating ripples on a calm pond surface" ] # Avoid static descriptions better suited for images avoid_for_video = [ "a beautiful mountain landscape", # Too static "portrait of a person", # No implied motion "a red car", # Lacks temporal context ] # Add temporal and motion keywords def enhance_video_prompt(base_prompt: str) -> str: """Enhance prompts for better video generation.""" motion_words = [ "flowing", "moving", "swaying", "drifting", "gliding", "rotating", "spinning", "floating", "cascading", "rippling" ] temporal_words = [ "slowly", "gently", "gradually", "smoothly", "continuously", "rhythmically", "gracefully", "elegantly" ] # Simple enhancement (would be more sophisticated in practice) if not any(word in base_prompt.lower() for word in motion_words + temporal_words): return f"{base_prompt}, gently moving, smooth motion" return base_prompt ``` ### 2. Resolution and Aspect Ratio Selection ```python def select_optimal_resolution(content_type: str, platform: str = "web") -> Resolution: """Select optimal resolution based on content and platform.""" if platform == "mobile": return Resolution.NINE_SIXTEEN # Mobile-friendly portrait elif platform == "social": return Resolution.SQUARE # Social media posts elif content_type == "cinematic": return Resolution.SIXTEEN_NINE # Widescreen cinematic elif content_type == "portrait": return Resolution.PORTRAIT # Portrait orientation else: return Resolution.WIDESCREEN # General purpose ``` ### 3. Error Handling and Fallbacks ```python async def robust_video_generation(self, args: VideoGenInput) -> Response: """Generate video with comprehensive error handling and fallbacks.""" max_retries = 3 for attempt in range(max_retries): try: # Validate input parameters if args.frames > 241: args.frames = 241 if args.steps > 35: args.steps = 35 # Generate video result = await text_to_video(self, args) return result except torch.cuda.OutOfMemoryError: logger.warning(f"OOM on attempt {attempt + 1}, reducing settings") # Progressive fallback strategy if attempt == 0: args.frames = min(args.frames, 121) # Reduce frames elif attempt == 1: args.resolution = Resolution.WIDESCREEN # Smaller resolution args.frames = 81 else: args.steps = 15 # Faster generation args.frames = 41 torch.cuda.empty_cache() except Exception as e: logger.error(f"Generation failed on attempt {attempt + 1}: {e}") if attempt == max_retries - 1: raise HTTPException( status_code=500, detail=f"Video generation failed after {max_retries} attempts" ) time.sleep(5) # Wait before retry ``` ## Monitoring and Troubleshooting ### Resource Monitoring ```bash # Monitor service health and resource usage chutes chutes get my-video-gen # View detailed logs chutes chutes logs my-video-gen --tail 200 # Monitor GPU utilization across all devices chutes chutes metrics my-video-gen --detailed ``` ### Performance Metrics ```python import time from loguru import logger @chute.cord(public_api_path="/monitored_video", method="POST") async def monitored_video_generation(self, args: VideoGenInput) -> Response: """Video generation with performance monitoring.""" start_time = time.time() gpu_memory_start = torch.cuda.memory_allocated() try: result = await text_to_video(self, args) generation_time = time.time() - start_time gpu_memory_peak = torch.cuda.max_memory_allocated() logger.info( f"Video generation completed - " f"Time: {generation_time:.2f}s, " f"Frames: {args.frames}, " f"Resolution: {args.resolution}, " f"GPU Memory: {gpu_memory_peak / 1024**3:.2f}GB" ) return result except Exception as e: error_time = time.time() - start_time logger.error(f"Video generation failed after {error_time:.2f}s: {e}") raise finally: torch.cuda.reset_peak_memory_stats() ``` ## Next Steps - **Custom Training**: Fine-tune Wan2.1 on your own video datasets - **Advanced Effects**: Implement video filters and post-processing - **Real-time Streaming**: Build live video generation systems - **Integration**: Connect with video editing and content creation tools For more advanced examples, see: - [Custom Training](/docs/examples/custom-training) - [Streaming Applications](/docs/examples/streaming-responses) - [Performance Optimization](/docs/examples/performance-optimization) --- ## SOURCE: https://chutes.ai/docs/guides/agents-and-tools # Function Calling, Agents, and Tool Use This guide demonstrates how to build advanced AI applications using **function calling** (tool use) and **autonomous agents** on the Chutes platform. You'll learn how to enable models to interact with external tools, databases, and APIs. ## Overview Chutes supports function calling through its optimized serving templates (vLLM and SGLang), enabling: - **Structured Data Extraction**: Get JSON outputs guaranteed to match a schema - **Tool Execution**: Allow models to call Python functions - **Agentic Workflows**: Build multi-step reasoning agents - **External Integrations**: Connect LLMs to APIs, databases, and the web ## Quick Start: Enabling Function Calling Use the `vLLM` template with specific arguments to enable tool support. ### 1. Deploy a Tool-Compatible Model Models like **Mistral**, **Llama 3**, and **Qwen** have excellent function calling capabilities. ```python # deploy_agent_chute.py from chutes.chute import NodeSelector from chutes.chute.template import build_vllm_chute chute = build_vllm_chute( username="myuser", model_name="mistralai/Mistral-7B-Instruct-v0.3", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24 ), engine_args={ "enable_auto_tool_choice": True, # Enable tool parsing "tool_call_parser": "mistral", # Use specific parser (or "llama3_json") "max_model_len": 8192 } ) ``` Deploy this chute: ```bash chutes deploy deploy_agent_chute:chute --wait ``` ## Building a Simple Agent Here is a complete example of a Python client interacting with your deployed chute to execute tools. ### The Client Code ```python import openai import json import math # 1. Define the tools def calculate_square_root(x: float) -> float: """Calculates the square root of a number.""" return math.sqrt(x) def get_weather(location: str) -> str: """Get the current weather for a location.""" # Mock response return json.dumps({"location": location, "temperature": "72F", "condition": "Sunny"}) tools = [ { "type": "function", "function": { "name": "calculate_square_root", "description": "Calculates the square root of a number", "parameters": { "type": "object", "properties": { "x": {"type": "number", "description": "The number to calculate the root of"} }, "required": ["x"] } } }, { "type": "function", "function": { "name": "get_weather", "description": "Get the current weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA"} }, "required": ["location"] } } } ] # 2. Initialize Client client = openai.OpenAI( base_url="https://myuser-mistral-7b.chutes.ai/v1", api_key="your-api-key" ) # 3. Chat Loop with Tool Execution messages = [ {"role": "system", "content": "You are a helpful assistant with access to tools."}, {"role": "user", "content": "What is the square root of 144 and what's the weather in Miami?"} ] # First call: Model decides to call tools response = client.chat.completions.create( model="mistralai/Mistral-7B-Instruct-v0.3", messages=messages, tools=tools, tool_choice="auto" ) response_message = response.choices[0].message tool_calls = response_message.tool_calls if tool_calls: # Append the model's response (containing tool calls) to history messages.append(response_message) # Execute each tool call for tool_call in tool_calls: function_name = tool_call.function.name function_args = json.loads(tool_call.function.arguments) print(f"🛠️ Executing {function_name} with {function_args}...") if function_name == "calculate_square_root": result = str(calculate_square_root(**function_args)) elif function_name == "get_weather": result = get_weather(**function_args) else: result = "Error: Unknown function" # Append tool result to history messages.append({ "tool_call_id": tool_call.id, "role": "tool", "name": function_name, "content": result }) # Second call: Model uses tool results to generate final answer final_response = client.chat.completions.create( model="mistralai/Mistral-7B-Instruct-v0.3", messages=messages ) print(f"🤖 Agent: {final_response.choices[0].message.content}") ``` ## Structured Output (JSON Mode) Sometimes you don't need to execute a function, but just want **guaranteed JSON output**. ```python # Define the schema you want schema = { "type": "object", "properties": { "sentiment": {"type": "string", "enum": ["positive", "negative"]}, "score": {"type": "number"}, "keywords": {"type": "array", "items": {"type": "string"}} }, "required": ["sentiment", "score", "keywords"] } response = client.chat.completions.create( model="mistralai/Mistral-7B-Instruct-v0.3", messages=[ {"role": "user", "content": "Analyze this review: 'The product is decent but expensive.'"} ], # Force JSON mode response_format={"type": "json_object"}, # Optionally pass schema in system prompt or specific guided decoding parameters if using SGLang ) print(response.choices[0].message.content) # Output: {"sentiment": "neutral", "score": 0.5, "keywords": ["decent", "expensive"]} ``` ## Advanced: SGLang for High-Speed Agents For complex agentic workflows requiring **constrained generation** (e.g., "Output must be valid SQL"), SGLang is superior. ### 1. Deploy SGLang Chute ```python from chutes.chute.template.sglang import build_sglang_chute chute = build_sglang_chute( username="myuser", model_name="meta-llama/Meta-Llama-3-8B-Instruct", node_selector=NodeSelector(gpu_count=1), engine_args={ "disable_flashinfer": False } ) ``` ### 2. Using Regex Constraints (Client-Side) SGLang supports `extra_body` parameters for regex constraints: ```python response = client.chat.completions.create( model="meta-llama/Meta-Llama-3-8B-Instruct", messages=[{"role": "user", "content": "What is the IP address of localhost?"}], extra_body={ "regex": r"((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)" } ) print(response.choices[0].message.content) # Guaranteed to be a valid IP format ``` ## Building a RAG Agent Combine **Function Calling** with **Chutes Embeddings** for a RAG (Retrieval Augmented Generation) agent. ### Architecture 1. **Vector Store**: Stores your documents (e.g., Qdrant/pgvector running in a separate Chute or externally). 2. **Embedding Chute**: TEI template for generating query embeddings. 3. **Agent Chute**: vLLM/SGLang model with a `search_knowledge_base` tool. ### Implementation Sketch ```python def search_knowledge_base(query: str): """Tool exposed to the LLM.""" # 1. Embed query using Chutes TEI endpoint embedding = requests.post( "https://myuser-embeddings.chutes.ai/embed", json={"inputs": query} ).json() # 2. Search vector DB results = vector_db.search(embedding) # 3. Return context return json.dumps(results) # ... Add this tool to the tools list in the Client Code example above ... ``` ## Best Practices for Agents 1. **System Prompts**: Clearly define the agent's persona and constraints. - _Bad:_ "You are a bot." - _Good:_ "You are a data analysis assistant. You have access to a SQL database. Always verify schemas before querying." 2. **Tool Descriptions**: Models rely heavily on tool descriptions. Be verbose and precise. 3. **Error Handling**: If a tool fails, feed the error message back to the model as a "tool" role message. The model can often self-correct. 4. **Concurrency**: For agents that make parallel tool calls, use Python's `asyncio.gather` to execute them concurrently before responding to the model. ## Next Steps - **[Embedding Service](/docs/examples/embeddings)** - Set up your RAG backend - **[SGLang Template](/docs/templates/sglang)** - Advanced constrained generation - **[vLLM Template](/docs/templates/vllm)** - High-performance tool serving --- ## SOURCE: https://chutes.ai/docs/guides/best-practices # Best Practices for Production-Ready Chutes This comprehensive guide covers production best practices for building, deploying, and maintaining robust, scalable, and secure Chutes applications in production environments. ## Overview Production-ready Chutes applications require: - **Scalable Architecture**: Design for growth and varying loads - **Security**: Protect data, models, and infrastructure - **Performance**: Optimize for speed, memory, and resource efficiency - **Reliability**: Handle failures gracefully with high availability - **Monitoring**: Complete observability and alerting - **Maintainability**: Code quality, documentation, and operational procedures ## Application Architecture ### Modular Design Patterns ```python from abc import ABC, abstractmethod from typing import Protocol, TypeVar, Generic, Any, Optional, Dict from dataclasses import dataclass import logging # Define clear interfaces class ModelInterface(Protocol): """Protocol for AI model implementations.""" async def load(self) -> None: """Load the model into memory.""" ... async def predict(self, input_data: Any) -> Any: """Make prediction on input data.""" ... async def unload(self) -> None: """Unload model from memory.""" ... class CacheInterface(Protocol): """Protocol for caching implementations.""" async def get(self, key: str) -> Optional[Any]: ... async def set(self, key: str, value: Any, ttl: int = None) -> None: ... async def delete(self, key: str) -> None: ... # Implement dependency injection @dataclass class Dependencies: """Application dependencies container.""" model: ModelInterface cache: CacheInterface logger: logging.Logger metrics: Any # Metrics collector config: Dict[str, Any] class ServiceBase(ABC): """Base class for application services.""" def __init__(self, deps: Dependencies): self.deps = deps self.logger = deps.logger self.model = deps.model self.cache = deps.cache @abstractmethod async def initialize(self) -> None: """Initialize the service.""" pass @abstractmethod async def cleanup(self) -> None: """Cleanup service resources.""" pass class TextGenerationService(ServiceBase): """Text generation service implementation.""" async def initialize(self) -> None: """Initialize text generation service.""" await self.model.load() self.logger.info("Text generation service initialized") async def generate(self, prompt: str, **kwargs) -> Dict[str, Any]: """Generate text with caching and error handling.""" # Create cache key cache_key = self._create_cache_key(prompt, kwargs) # Try cache first cached_result = await self.cache.get(cache_key) if cached_result: self.logger.info("Cache hit for text generation") return cached_result # Generate new result try: result = await self.model.predict(prompt, **kwargs) # Cache result await self.cache.set(cache_key, result, ttl=3600) return result except Exception as e: self.logger.error(f"Text generation failed: {e}") raise def _create_cache_key(self, prompt: str, kwargs: Dict) -> str: """Create deterministic cache key.""" import hashlib import json key_data = {"prompt": prompt, "params": sorted(kwargs.items())} key_str = json.dumps(key_data, sort_keys=True) return f"text_gen:{hashlib.md5(key_str.encode()).hexdigest()}" async def cleanup(self) -> None: """Cleanup resources.""" await self.model.unload() self.logger.info("Text generation service cleaned up") # Chute implementation with dependency injection from chutes.chute import Chute chute = Chute(username="production", name="text-service") @chute.on_startup() async def initialize_app(self): """Initialize application with proper dependency injection.""" # Configure logging logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' ) logger = logging.getLogger("text-service") # Initialize model model = await self._create_model() # Initialize cache cache = await self._create_cache() # Initialize metrics metrics = await self._create_metrics() # Load configuration config = await self._load_config() # Create dependencies container self.deps = Dependencies( model=model, cache=cache, logger=logger, metrics=metrics, config=config ) # Initialize services self.text_service = TextGenerationService(self.deps) await self.text_service.initialize() async def _create_model(self): """Factory method for model creation.""" # Implementation depends on your specific model pass async def _create_cache(self): """Factory method for cache creation.""" # Could be Redis, Memcached, or in-memory cache pass async def _create_metrics(self): pass async def _load_config(self): return {} ``` ### Configuration Management ```python import os from typing import Optional, Union from pydantic import BaseSettings, Field, validator from pathlib import Path class ApplicationConfig(BaseSettings): """Production application configuration.""" # Environment environment: str = Field("production", env="APP_ENV") debug: bool = Field(False, env="APP_DEBUG") # Model settings model_name: str = Field(..., env="MODEL_NAME") model_path: Optional[str] = Field(None, env="MODEL_PATH") max_batch_size: int = Field(8, env="MAX_BATCH_SIZE") # Performance settings max_workers: int = Field(4, env="MAX_WORKERS") request_timeout: float = Field(30.0, env="REQUEST_TIMEOUT") max_memory_usage: float = Field(0.9, env="MAX_MEMORY_USAGE") # Cache settings cache_backend: str = Field("redis", env="CACHE_BACKEND") cache_url: str = Field("redis://localhost:6379", env="CACHE_URL") cache_ttl: int = Field(3600, env="CACHE_TTL") # Logging log_level: str = Field("INFO", env="LOG_LEVEL") log_format: str = Field("json", env="LOG_FORMAT") # Security api_key_required: bool = Field(True, env="API_KEY_REQUIRED") allowed_origins: list = Field(["*"], env="ALLOWED_ORIGINS") rate_limit_requests: int = Field(100, env="RATE_LIMIT_REQUESTS") rate_limit_window: int = Field(60, env="RATE_LIMIT_WINDOW") # Monitoring metrics_enabled: bool = Field(True, env="METRICS_ENABLED") health_check_interval: int = Field(30, env="HEALTH_CHECK_INTERVAL") @validator('log_level') def validate_log_level(cls, v): valid_levels = ['DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'] if v.upper() not in valid_levels: raise ValueError(f'Log level must be one of: {valid_levels}') return v.upper() @validator('max_memory_usage') def validate_memory_usage(cls, v): if not 0.1 <= v <= 1.0: raise ValueError('Memory usage must be between 0.1 and 1.0') return v class Config: env_file = ".env" env_file_encoding = "utf-8" # Environment-specific configurations class DevelopmentConfig(ApplicationConfig): """Development environment configuration.""" environment: str = "development" debug: bool = True log_level: str = "DEBUG" api_key_required: bool = False class StagingConfig(ApplicationConfig): """Staging environment configuration.""" environment: str = "staging" debug: bool = False log_level: str = "INFO" class ProductionConfig(ApplicationConfig): """Production environment configuration.""" environment: str = "production" debug: bool = False log_level: str = "WARNING" api_key_required: bool = True def get_config() -> ApplicationConfig: """Get configuration based on environment.""" env = os.getenv("APP_ENV", "production").lower() config_classes = { "development": DevelopmentConfig, "staging": StagingConfig, "production": ProductionConfig } config_class = config_classes.get(env, ProductionConfig) # Note: In a real app you'd instantiate this properly with env vars # return config_class() return config_class(model_name="default-model") # simplified for example # Usage in Chute @chute.on_startup() async def load_configuration(self): """Load and validate configuration.""" self.config = get_config() # Configure logging based on config import logging logging.basicConfig( level=getattr(logging, self.config.log_level), format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' ) self.logger = logging.getLogger(f"chute.{self.config.environment}") self.logger.info(f"Application started in {self.config.environment} mode") ``` ## Performance Optimization See the [Performance Optimization Guide](performance) for detailed strategies. Key areas include: - **Dynamic Batching**: Group requests for efficient GPU usage. - **Caching**: Cache expensive model outputs using Redis or in-memory stores. - **Quantization**: Use 8-bit or 4-bit quantization to reduce memory footprint and increase speed. - **Async Processing**: Use async/await to handle concurrent requests without blocking. ## Security Best Practices See the [Security Guide](security) for a deep dive. Essentials: - **Authentication**: Always use API keys or JWTs in production. - **Input Validation**: Validate and sanitize all inputs using Pydantic schemas. - **Rate Limiting**: Prevent abuse by limiting requests per user/IP. - **Secrets Management**: Use environment variables or mounted volumes for secrets; never hardcode them. ## Monitoring and Observability Implement structured logging and metrics to track the health of your application. ```python import time from contextlib import contextmanager from datetime import datetime import json import logging class StructuredLogger: def __init__(self, name): self.logger = logging.getLogger(name) # Configure JSON handler... def info(self, message, **kwargs): self.logger.info(json.dumps({"message": message, **kwargs})) class PerformanceMonitor: def __init__(self): # Initialize prometheus metrics... pass @contextmanager def measure_request(self, endpoint): start = time.time() try: yield finally: duration = time.time() - start # Record metric... ``` ## Deployment Best Practices ### Production Deployment Checklist ```python class ProductionDeploymentChecklist: """Comprehensive production deployment checklist.""" CHECKLIST = { "Security": [ "✓ Enable HTTPS/TLS encryption", "✓ Configure API authentication", "✓ Set up rate limiting", "✓ Sanitize all inputs", "✓ Secrets management", ], "Performance": [ "✓ Load testing completed", "✓ Memory usage optimized", "✓ Caching implemented", "✓ Auto-scaling rules configured", ], "Reliability": [ "✓ Health checks implemented", "✓ Error handling comprehensive", "✓ Graceful shutdown handled", ], "Monitoring": [ "✓ Application metrics", "✓ Error tracking", "✓ Log aggregation", "✓ Alert configuration", ], } ``` ## Summary and Next Steps This guide covers the essential patterns for building production-grade Chutes. ### Implementation Priority 1. **Security**: Authentication and input validation. 2. **Monitoring**: Logging and basic metrics. 3. **Performance**: Caching and resource management. 4. **Reliability**: Error handling and health checks. For more specific guides, see: - [Error Handling Guide](error-handling) - [Custom Images Guide](custom-images) - [Streaming Guide](streaming) - [Templates Guide](templates) - [Performance Optimization](performance) --- ## SOURCE: https://chutes.ai/docs/guides/cost-optimization # Cost Optimization Guide This guide provides strategies to optimize costs while maintaining performance and reliability for your Chutes applications. ## Overview Cost optimization in Chutes involves: - **Resource Right-sizing**: Choose appropriate hardware configurations - **Auto-scaling**: Scale resources based on demand - **Spot Instances**: Use cost-effective computing options - **Efficient Scheduling**: Optimize when workloads run - **Model Optimization**: Reduce computational requirements ## Resource Right-sizing ### Choose Appropriate Hardware Select the right GPU and memory configuration: ```python from chutes.chute import Chute, NodeSelector # Cost-optimized for inference inference_chute = Chute( username="myuser", name="cost-optimized-inference", image=your_image, entry_file="app.py", entry_point="run", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=8, # Right-size for your model# Minimal RAM requirements preferred_provider="vast" # Often more cost-effective ), timeout_seconds=300, concurrency=8 ) # For batch processing with higher throughput needs batch_chute = Chute( username="myuser", name="batch-processing", image=your_image, entry_file="batch_app.py", entry_point="run", node_selector=NodeSelector( gpu_count=2, min_vram_gb_per_gpu=24), timeout_seconds=1800, concurrency=4 ) ``` ## Spot Instance Strategy ### Using Spot Instances Leverage spot instances for significant cost savings: ```python from chutes.chute import Chute, NodeSelector # Spot instance configuration spot_chute = Chute( username="myuser", name="spot-training", image=training_image, entry_file="training.py", entry_point="run", node_selector=NodeSelector( gpu_count=4, min_vram_gb_per_gpu=16max_spot_price=0.50 # Set maximum price you're willing to pay ), timeout_seconds=7200, # Longer timeout for training concurrency=1, auto_scale=False ) # Fault-tolerant batch processing with spot instances class SpotInstanceManager: def __init__(self, chute_config): self.chute_config = chute_config self.retry_count = 3 async def run_with_retry(self, inputs): """Run job with automatic retry on spot interruption""" for attempt in range(self.retry_count): try: # Create chute with spot instance chute = Chute(**self.chute_config) result = chute.run(inputs) return result except Exception as e: if attempt == self.retry_count - 1: # All retries exhausted raise e # Wait before retry await asyncio.sleep(30) raise Exception("Failed after all retry attempts") ``` ## Smart Scaling Strategies ### Time-based Scaling Scale based on predictable usage patterns: ```python import schedule import time from datetime import datetime class TimeBasedScaler: def __init__(self, chute_name): self.chute_name = chute_name self.setup_schedule() def setup_schedule(self): """Set up scaling schedule based on usage patterns""" # Scale up during business hours schedule.every().monday.at("08:00").do(self.scale_up) schedule.every().tuesday.at("08:00").do(self.scale_up) schedule.every().wednesday.at("08:00").do(self.scale_up) schedule.every().thursday.at("08:00").do(self.scale_up) schedule.every().friday.at("08:00").do(self.scale_up) # Scale down after hours schedule.every().monday.at("18:00").do(self.scale_down) schedule.every().tuesday.at("18:00").do(self.scale_down) schedule.every().wednesday.at("18:00").do(self.scale_down) schedule.every().thursday.at("18:00").do(self.scale_down) schedule.every().friday.at("18:00").do(self.scale_down) # Minimal scaling on weekends schedule.every().saturday.at("00:00").do(self.scale_minimal) schedule.every().sunday.at("00:00").do(self.scale_minimal) def scale_up(self): """Scale up for peak hours""" self.update_chute_config({ "min_instances": 3, "max_instances": 10, "concurrency": 20 }) def scale_down(self): """Scale down for off-peak hours""" self.update_chute_config({ "min_instances": 1, "max_instances": 3, "concurrency": 8 }) def scale_minimal(self): """Minimal scaling for weekends""" self.update_chute_config({ "min_instances": 0, "max_instances": 2, "concurrency": 4 }) def update_chute_config(self, config): """Update chute configuration""" # Implementation to update chute scaling settings pass def run(self): """Run the scheduler""" while True: schedule.run_pending() time.sleep(60) ``` ### Demand-based Auto-scaling Implement intelligent auto-scaling: ```python class DemandBasedScaler: def __init__(self, chute, target_utilization=0.7): self.chute = chute self.target_utilization = target_utilization self.metrics_history = [] self.scale_cooldown = 300 # 5 minutes self.last_scale_time = 0 async def monitor_and_scale(self): """Monitor metrics and scale accordingly""" current_metrics = await self.get_current_metrics() self.metrics_history.append(current_metrics) # Keep only last 10 minutes of metrics if len(self.metrics_history) > 10: self.metrics_history.pop(0) # Calculate average utilization avg_utilization = sum(m['utilization'] for m in self.metrics_history) / len(self.metrics_history) current_time = time.time() time_since_last_scale = current_time - self.last_scale_time # Only scale if cooldown period has passed if time_since_last_scale < self.scale_cooldown: return if avg_utilization > self.target_utilization + 0.1: # Scale up await self.scale_up() self.last_scale_time = current_time elif avg_utilization < self.target_utilization - 0.2: # Scale down await self.scale_down() self.last_scale_time = current_time async def get_current_metrics(self): """Get current performance metrics""" # Implementation to get actual metrics return { 'utilization': 0.8, 'response_time': 200, 'queue_length': 5 } async def scale_up(self): """Scale up instances""" current_instances = await self.get_current_instance_count() new_count = min(current_instances + 1, self.chute.max_instances) await self.set_instance_count(new_count) async def scale_down(self): """Scale down instances""" current_instances = await self.get_current_instance_count() new_count = max(current_instances - 1, self.chute.min_instances) await self.set_instance_count(new_count) ``` ## Workload Optimization ### Batch Processing for Cost Efficiency Process multiple requests together: ```python import asyncio from typing import List, Dict, Any class CostOptimizedBatchProcessor: def __init__(self, max_batch_size=32, max_wait_time=5.0): self.max_batch_size = max_batch_size self.max_wait_time = max_wait_time self.pending_requests = [] self.processing = False async def add_request(self, request_data: Dict[str, Any]) -> Any: """Add request to batch queue""" future = asyncio.Future() self.pending_requests.append({ 'data': request_data, 'future': future }) # Start processing if not already running if not self.processing: asyncio.create_task(self.process_batch()) return await future async def process_batch(self): """Process accumulated requests as a batch""" if self.processing: return self.processing = True # Wait for batch to fill up or timeout start_time = time.time() while (len(self.pending_requests) < self.max_batch_size and time.time() - start_time < self.max_wait_time): await asyncio.sleep(0.1) if not self.pending_requests: self.processing = False return # Extract batch batch = self.pending_requests[:self.max_batch_size] self.pending_requests = self.pending_requests[self.max_batch_size:] try: # Process batch batch_data = [req['data'] for req in batch] results = await self.process_batch_data(batch_data) # Return results to futures for req, result in zip(batch, results): req['future'].set_result(result) except Exception as e: # Handle batch errors for req in batch: req['future'].set_exception(e) finally: self.processing = False # Process remaining requests if any if self.pending_requests: asyncio.create_task(self.process_batch()) async def process_batch_data(self, batch_data: List[Dict[str, Any]]) -> List[Any]: """Process the actual batch - implement your logic here""" # Example: AI model inference on batch results = [] for data in batch_data: # Process individual item result = await self.model_inference(data) results.append(result) return results # Usage in chute batch_processor = CostOptimizedBatchProcessor(max_batch_size=16, max_wait_time=2.0) async def run_cost_optimized(inputs: Dict[str, Any]) -> Any: """Cost-optimized endpoint using batching""" result = await batch_processor.add_request(inputs) return result ``` ## Model Optimization for Cost ### Model Quantization Reduce computational costs through quantization: ```python import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer class QuantizedModelForCost: def __init__(self, model_name: str): self.tokenizer = AutoTokenizer.from_pretrained(model_name) # Load model with 8-bit quantization self.model = AutoModelForSequenceClassification.from_pretrained( model_name, load_in_8bit=True, # Reduces memory usage by ~50% device_map="auto" ) async def predict(self, texts: List[str]) -> List[Dict[str, Any]]: """Batch prediction with quantized model""" # Process in batches for efficiency batch_size = 16 results = [] for i in range(0, len(texts), batch_size): batch = texts[i:i + batch_size] # Tokenize batch inputs = self.tokenizer( batch, padding=True, truncation=True, return_tensors="pt", max_length=512 ) # Inference with torch.no_grad(): outputs = self.model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) # Extract results for j, prediction in enumerate(predictions): results.append({ 'text': batch[j], 'prediction': prediction.cpu().numpy().tolist(), 'confidence': float(torch.max(prediction)) }) return results # Deploy with cost-optimized settings cost_optimized_chute = Chute( username="myuser", name="quantized-inference", image=quantized_image, entry_file="quantized_model.py", entry_point="run", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=8, # Reduced from 16GB due to quantization), concurrency=12, # Higher concurrency due to reduced memory usage timeout_seconds=120 ) ``` ### Model Caching Strategy Implement intelligent caching to reduce compute costs: ```python import hashlib import pickle import redis from typing import Optional, Dict, Any class CostOptimizedCache: def __init__(self, redis_url: str = "redis://localhost:6379"): self.redis_client = redis.from_url(redis_url) self.hit_count = 0 self.miss_count = 0 def get_cache_key(self, inputs: Dict[str, Any]) -> str: """Generate cache key from inputs""" # Create deterministic hash of inputs input_str = str(sorted(inputs.items())) return f"model_cache:{hashlib.md5(input_str.encode()).hexdigest()}" async def get_cached_result(self, inputs: Dict[str, Any]) -> Optional[Any]: """Get cached result if available""" cache_key = self.get_cache_key(inputs) try: cached_data = self.redis_client.get(cache_key) if cached_data: self.hit_count += 1 return pickle.loads(cached_data) except Exception: pass self.miss_count += 1 return None async def cache_result(self, inputs: Dict[str, Any], result: Any, ttl: int = 3600): """Cache computation result""" cache_key = self.get_cache_key(inputs) try: serialized_result = pickle.dumps(result) self.redis_client.setex(cache_key, ttl, serialized_result) except Exception: pass def get_cache_stats(self) -> Dict[str, float]: """Get cache performance statistics""" total_requests = self.hit_count + self.miss_count if total_requests == 0: return {"hit_rate": 0.0, "miss_rate": 0.0} return { "hit_rate": self.hit_count / total_requests, "miss_rate": self.miss_count / total_requests, "total_requests": total_requests } # Global cache instance cost_cache = CostOptimizedCache() async def run_with_cost_cache(inputs: Dict[str, Any]) -> Any: """Run with intelligent caching for cost optimization""" # Try to get cached result first cached_result = await cost_cache.get_cached_result(inputs) if cached_result is not None: return { "result": cached_result, "cached": True, "cache_stats": cost_cache.get_cache_stats() } # Compute result if not cached result = await expensive_computation(inputs) # Cache result for future requests await cost_cache.cache_result(inputs, result, ttl=1800) # 30 minutes return { "result": result, "cached": False, "cache_stats": cost_cache.get_cache_stats() } ``` ## Cost Monitoring and Analytics ### Cost Tracking Monitor and track costs in real-time: ```python import time from typing import Dict, List from dataclasses import dataclass from datetime import datetime, timedelta @dataclass class CostMetric: timestamp: float gpu_hours: float compute_cost: float request_count: int cache_hit_rate: float class CostMonitor: def __init__(self): self.cost_history: List[CostMetric] = [] self.hourly_costs: Dict[str, float] = {} self.daily_budgets: Dict[str, float] = {} def record_usage(self, gpu_hours: float, compute_cost: float, request_count: int, cache_hit_rate: float): """Record usage metrics""" metric = CostMetric( timestamp=time.time(), gpu_hours=gpu_hours, compute_cost=compute_cost, request_count=request_count, cache_hit_rate=cache_hit_rate ) self.cost_history.append(metric) # Update hourly tracking hour_key = datetime.now().strftime("%Y-%m-%d-%H") if hour_key not in self.hourly_costs: self.hourly_costs[hour_key] = 0 self.hourly_costs[hour_key] += compute_cost def get_daily_cost(self, date: str = None) -> float: """Get total cost for a specific day""" if date is None: date = datetime.now().strftime("%Y-%m-%d") daily_cost = 0 for hour_key, cost in self.hourly_costs.items(): if hour_key.startswith(date): daily_cost += cost return daily_cost def check_budget_alert(self, daily_budget: float) -> Dict[str, Any]: """Check if approaching budget limits""" current_cost = self.get_daily_cost() budget_usage = current_cost / daily_budget alert_level = "green" if budget_usage > 0.9: alert_level = "red" elif budget_usage > 0.7: alert_level = "yellow" return { "current_cost": current_cost, "daily_budget": daily_budget, "budget_usage": budget_usage, "alert_level": alert_level, "remaining_budget": daily_budget - current_cost } def get_cost_optimization_suggestions(self) -> List[str]: """Generate cost optimization suggestions""" suggestions = [] # Analyze recent metrics recent_metrics = self.cost_history[-10:] if len(self.cost_history) >= 10 else self.cost_history if recent_metrics: avg_cache_hit_rate = sum(m.cache_hit_rate for m in recent_metrics) / len(recent_metrics) avg_cost_per_request = sum(m.compute_cost / max(m.request_count, 1) for m in recent_metrics) / len(recent_metrics) if avg_cache_hit_rate < 0.5: suggestions.append("Consider increasing cache TTL or implementing better caching strategy") if avg_cost_per_request > 0.01: # Threshold for expensive requests suggestions.append("Consider using smaller models or batch processing") # Check for usage patterns hourly_usage = {} for metric in recent_metrics: hour = datetime.fromtimestamp(metric.timestamp).hour if hour not in hourly_usage: hourly_usage[hour] = [] hourly_usage[hour].append(metric.compute_cost) # Suggest time-based scaling if usage varies significantly if len(hourly_usage) > 3: costs = [sum(costs) for costs in hourly_usage.values()] if max(costs) / min(costs) > 3: suggestions.append("Consider time-based scaling to reduce costs during low-usage periods") return suggestions # Global cost monitor cost_monitor = CostMonitor() async def run_with_cost_monitoring(inputs: Dict[str, Any]) -> Any: """Run with cost monitoring""" start_time = time.time() # Execute request result = await process_request(inputs) # Calculate metrics execution_time = time.time() - start_time gpu_hours = execution_time / 3600 # Convert to hours estimated_cost = gpu_hours * 0.50 # $0.50 per GPU hour (example rate) # Record usage cost_monitor.record_usage( gpu_hours=gpu_hours, compute_cost=estimated_cost, request_count=1, cache_hit_rate=0.8 # From cache system ) # Check budget budget_status = cost_monitor.check_budget_alert(daily_budget=50.0) return { "result": result, "cost_info": { "execution_time": execution_time, "estimated_cost": estimated_cost, "budget_status": budget_status } } ``` ## Cost Optimization Best Practices ### 1. Resource Selection - Choose the smallest GPU that meets your performance requirements - Use CPU-only instances for non-AI workloads - Consider memory requirements carefully ### 2. Scaling Strategy - Implement auto-scaling based on actual demand - Use time-based scaling for predictable patterns - Set appropriate scale-down policies ### 3. Workload Optimization - Batch requests when possible - Implement intelligent caching - Use model quantization for inference workloads ### 4. Monitoring and Alerts - Set up budget alerts and monitoring - Track cost per request and optimization opportunities - Regular review of usage patterns ## Next Steps - **[Performance Guide](performance)** - Optimize performance while controlling costs - **[Best Practices](best-practices)** - General optimization strategies - **[Monitoring](../monitoring)** - Advanced cost and performance monitoring For enterprise cost optimization, see the [Enterprise Cost Management Guide](../enterprise/cost-management). --- ## SOURCE: https://chutes.ai/docs/guides/custom-chutes # Building Custom Chutes This guide walks you through creating custom Chutes from scratch, covering everything from basic setup to advanced patterns for production applications. ## Overview Custom Chutes give you complete control over your AI application architecture, allowing you to: - **Build Complex Logic**: Implement sophisticated AI pipelines - **Custom Dependencies**: Use any Python packages or system libraries - **Multiple Models**: Combine different AI models in a single service - **Advanced Processing**: Add preprocessing, postprocessing, and business logic - **Custom APIs**: Design exactly the endpoints you need ## Basic Custom Chute Structure ### Minimal Example Here's the simplest possible custom Chute: ```python from chutes.chute import Chute from chutes.image import Image # Create custom image image = ( Image(username="myuser", name="my-custom-app", tag="1.0") .from_base("python:3.11-slim") .run_command("pip install numpy pandas") ) # Create chute chute = Chute( username="myuser", name="my-custom-app", image=image ) @chute.on_startup() async def initialize(self): """Initialize any resources needed by your app.""" self.message = "Hello from custom chute!" @chute.cord(public_api_path="/hello", method="GET") async def hello(self): """Simple endpoint that returns a greeting.""" return {"message": self.message} ``` ### Adding Dependencies and Models ```python from chutes.chute import Chute, NodeSelector from chutes.image import Image from pydantic import BaseModel from typing import List, Optional # Define input/output schemas class AnalysisInput(BaseModel): text: str options: Optional[List[str]] = [] class AnalysisOutput(BaseModel): result: str confidence: float metadata: dict # Create custom image with AI dependencies image = ( Image(username="myuser", name="text-analyzer", tag="1.0") .from_base("nvidia/cuda:11.8-devel-ubuntu22.04") .run_command("apt update && apt install -y python3 python3-pip") .run_command("pip3 install torch transformers tokenizers") .run_command("pip3 install numpy pandas scikit-learn") .run_command("pip3 install fastapi uvicorn pydantic") .set_workdir("/app") ) # Create chute with GPU support chute = Chute( username="myuser", name="text-analyzer", image=image, node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=8 ), concurrency=4 ) @chute.on_startup() async def initialize_models(self): """Load AI models during startup.""" from transformers import pipeline import torch # Load sentiment analysis model self.sentiment_analyzer = pipeline( "sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment-latest", device=0 if torch.cuda.is_available() else -1 ) # Load text classification model self.classifier = pipeline( "zero-shot-classification", model="facebook/bart-large-mnli", device=0 if torch.cuda.is_available() else -1 ) @chute.cord( public_api_path="/analyze", method="POST", input_schema=AnalysisInput, output_schema=AnalysisOutput ) async def analyze_text(self, input_data: AnalysisInput) -> AnalysisOutput: """Analyze text with multiple AI models.""" # Sentiment analysis sentiment_result = self.sentiment_analyzer(input_data.text)[0] # Classification (if options provided) classification_result = None if input_data.options: classification_result = self.classifier( input_data.text, input_data.options ) # Combine results result = f"Sentiment: {sentiment_result['label']}" if classification_result: result += f", Category: {classification_result['labels'][0]}" return AnalysisOutput( result=result, confidence=sentiment_result['score'], metadata={ "sentiment": sentiment_result, "classification": classification_result } ) ``` ## Advanced Patterns ### Multi-Model Pipeline ```python from chutes.chute import Chute, NodeSelector from chutes.image import Image from pydantic import BaseModel, Field from typing import List, Dict, Any import asyncio class DocumentInput(BaseModel): text: str analyze_sentiment: bool = True extract_entities: bool = True summarize: bool = False max_summary_length: int = Field(default=150, ge=50, le=500) class DocumentOutput(BaseModel): original_text: str sentiment: Optional[Dict[str, Any]] = None entities: Optional[List[Dict[str, Any]]] = None summary: Optional[str] = None processing_time: float # Advanced image with multiple AI libraries image = ( Image(username="myuser", name="document-processor", tag="2.0") .from_base("nvidia/cuda:11.8-devel-ubuntu22.04") .run_command("apt update && apt install -y python3 python3-pip git") .run_command("pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118") .run_command("pip3 install transformers tokenizers") .run_command("pip3 install spacy") .run_command("python3 -m spacy download en_core_web_sm") .run_command("pip3 install sumy nltk") .run_command("pip3 install asyncio aiofiles") .set_workdir("/app") ) chute = Chute( username="myuser", name="document-processor", image=image, node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16 ), concurrency=6 ) @chute.on_startup() async def initialize_pipeline(self): """Initialize multiple AI models for document processing.""" from transformers import pipeline import spacy import torch import time self.device = 0 if torch.cuda.is_available() else -1 # Load models print("Loading sentiment analyzer...") self.sentiment_analyzer = pipeline( "sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment-latest", device=self.device ) print("Loading NER model...") self.ner_model = pipeline( "ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", device=self.device, aggregation_strategy="simple" ) print("Loading summarization model...") self.summarizer = pipeline( "summarization", model="facebook/bart-large-cnn", device=self.device ) print("Loading spaCy model...") self.nlp = spacy.load("en_core_web_sm") print("All models loaded successfully!") async def analyze_sentiment_async(self, text: str) -> Dict[str, Any]: """Asynchronous sentiment analysis.""" loop = asyncio.get_event_loop() result = await loop.run_in_executor( None, lambda: self.sentiment_analyzer(text)[0] ) return result async def extract_entities_async(self, text: str) -> List[Dict[str, Any]]: """Asynchronous named entity recognition.""" loop = asyncio.get_event_loop() # Use transformers NER ner_results = await loop.run_in_executor( None, lambda: self.ner_model(text) ) # Also use spaCy for additional entity types spacy_results = await loop.run_in_executor( None, lambda: [(ent.text, ent.label_, ent.start_char, ent.end_char) for ent in self.nlp(text).ents] ) # Combine results entities = [] # Add transformer results for entity in ner_results: entities.append({ "text": entity["word"], "label": entity["entity_group"], "confidence": entity["score"], "start": entity["start"], "end": entity["end"], "source": "transformers" }) # Add spaCy results for text_span, label, start, end in spacy_results: entities.append({ "text": text_span, "label": label, "confidence": 1.0, # spaCy doesn't provide confidence "start": start, "end": end, "source": "spacy" }) return entities async def summarize_async(self, text: str, max_length: int = 150) -> str: """Asynchronous text summarization.""" if len(text.split()) < 50: return text # Too short to summarize loop = asyncio.get_event_loop() result = await loop.run_in_executor( None, lambda: self.summarizer( text, max_length=max_length, min_length=30, do_sample=False )[0] ) return result["summary_text"] @chute.cord( public_api_path="/process", method="POST", input_schema=DocumentInput, output_schema=DocumentOutput ) async def process_document(self, input_data: DocumentInput) -> DocumentOutput: """Process document with multiple AI models in parallel.""" import time start_time = time.time() # Create tasks for parallel processing tasks = [] if input_data.analyze_sentiment: tasks.append(analyze_sentiment_async(self, input_data.text)) else: tasks.append(asyncio.create_task(asyncio.sleep(0, result=None))) if input_data.extract_entities: tasks.append(extract_entities_async(self, input_data.text)) else: tasks.append(asyncio.create_task(asyncio.sleep(0, result=None))) if input_data.summarize: tasks.append(summarize_async(self, input_data.text, input_data.max_summary_length)) else: tasks.append(asyncio.create_task(asyncio.sleep(0, result=None))) # Execute all tasks in parallel sentiment_result, entities_result, summary_result = await asyncio.gather(*tasks) processing_time = time.time() - start_time return DocumentOutput( original_text=input_data.text, sentiment=sentiment_result, entities=entities_result, summary=summary_result, processing_time=processing_time ) ``` ### State Management and Caching ```python from chutes.chute import Chute from chutes.image import Image import asyncio from typing import Dict, Any, Optional import hashlib import json import time class StatefulChute(Chute): """Custom chute with built-in state management.""" def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.cache = {} self.session_data = {} self.request_history = [] # Create image with caching dependencies image = ( Image(username="myuser", name="stateful-app", tag="1.0") .from_base("python:3.11-slim") .run_command("pip install redis aioredis") .run_command("pip install sqlalchemy aiosqlite") .run_command("pip install fastapi uvicorn pydantic") ) chute = StatefulChute( username="myuser", name="stateful-app", image=image ) @chute.on_startup() async def initialize_storage(self): """Initialize storage systems.""" import aioredis # In-memory cache self.memory_cache = {} self.cache_ttl = {} # Try to connect to Redis (optional) try: self.redis = await aioredis.create_redis_pool('redis://localhost') self.has_redis = True except: self.redis = None self.has_redis = False print("Redis not available, using memory cache only") # Session storage self.sessions = {} # Request tracking self.request_count = 0 self.last_requests = [] async def get_cached(self, key: str) -> Optional[Any]: """Get value from cache (Redis or memory).""" # Check memory cache first if key in self.memory_cache: if key in self.cache_ttl and time.time() > self.cache_ttl[key]: del self.memory_cache[key] del self.cache_ttl[key] else: return self.memory_cache[key] # Check Redis if available if self.has_redis: try: value = await self.redis.get(key) if value: return json.loads(value) except: pass return None async def set_cached(self, key: str, value: Any, ttl: int = 3600): """Set value in cache with TTL.""" # Store in memory cache self.memory_cache[key] = value self.cache_ttl[key] = time.time() + ttl # Store in Redis if available if self.has_redis: try: await self.redis.setex(key, ttl, json.dumps(value)) except: pass def get_cache_key(self, data: str, operation: str) -> str: """Generate cache key from data and operation.""" content = f"{operation}:{data}" return hashlib.md5(content.encode()).hexdigest() class ProcessingRequest(BaseModel): text: str operation: str = "analyze" use_cache: bool = True session_id: Optional[str] = None @chute.cord( public_api_path="/process_cached", method="POST", input_schema=ProcessingRequest ) async def process_with_caching(self, input_data: ProcessingRequest) -> Dict[str, Any]: """Process request with caching and session management.""" # Track request self.request_count += 1 request_info = { "timestamp": time.time(), "operation": input_data.operation, "session_id": input_data.session_id } self.last_requests.append(request_info) # Keep only last 100 requests if len(self.last_requests) > 100: self.last_requests = self.last_requests[-100:] # Check cache cache_key = get_cache_key(self, input_data.text, input_data.operation) if input_data.use_cache: cached_result = await get_cached(self, cache_key) if cached_result: cached_result["from_cache"] = True cached_result["request_id"] = self.request_count return cached_result # Process request (simulate AI processing) await asyncio.sleep(0.1) # Simulate processing time result = { "text": input_data.text, "operation": input_data.operation, "result": f"Processed: {input_data.text[:50]}...", "timestamp": time.time(), "request_id": self.request_count, "from_cache": False } # Store in cache if input_data.use_cache: await set_cached(self, cache_key, result, ttl=1800) # 30 minutes # Update session data if input_data.session_id: if input_data.session_id not in self.sessions: self.sessions[input_data.session_id] = { "created": time.time(), "requests": [] } self.sessions[input_data.session_id]["requests"].append({ "request_id": self.request_count, "operation": input_data.operation, "timestamp": time.time() }) return result @chute.cord(public_api_path="/stats", method="GET") async def get_stats(self) -> Dict[str, Any]: """Get service statistics.""" cache_size = len(self.memory_cache) session_count = len(self.sessions) # Recent request stats recent_requests = [r for r in self.last_requests if time.time() - r["timestamp"] < 3600] # Last hour operation_counts = {} for req in recent_requests: op = req["operation"] operation_counts[op] = operation_counts.get(op, 0) + 1 return { "total_requests": self.request_count, "cache_size": cache_size, "session_count": session_count, "recent_requests_1h": len(recent_requests), "operation_counts": operation_counts, "has_redis": self.has_redis } ``` ### Background Jobs and Queues ```python from chutes.chute import Chute from chutes.image import Image from pydantic import BaseModel from typing import Dict, List, Optional import asyncio import uuid import time from enum import Enum class JobStatus(str, Enum): PENDING = "pending" RUNNING = "running" COMPLETED = "completed" FAILED = "failed" class JobRequest(BaseModel): task_type: str data: Dict priority: int = Field(default=1, ge=1, le=5) class JobResponse(BaseModel): job_id: str status: JobStatus created_at: float started_at: Optional[float] = None completed_at: Optional[float] = None result: Optional[Dict] = None error: Optional[str] = None # Create image with job processing capabilities image = ( Image(username="myuser", name="job-processor", tag="1.0") .from_base("python:3.11-slim") .run_command("pip install asyncio aiofiles") .run_command("pip install celery redis") # For advanced job queues .run_command("pip install fastapi uvicorn pydantic") ) chute = Chute( username="myuser", name="job-processor", image=image, concurrency=8 ) @chute.on_startup() async def initialize_job_system(self): """Initialize job processing system.""" # Job storage self.jobs = {} self.job_queue = asyncio.Queue() # Job processing self.workers = [] self.max_workers = 4 # Start background workers for i in range(self.max_workers): worker = asyncio.create_task(self.job_worker(f"worker-{i}")) self.workers.append(worker) print(f"Started {self.max_workers} job workers") async def job_worker(self, worker_name: str): """Background worker to process jobs.""" while True: try: # Get job from queue job_id = await self.job_queue.get() if job_id not in self.jobs: continue job = self.jobs[job_id] # Update job status job["status"] = JobStatus.RUNNING job["started_at"] = time.time() job["worker"] = worker_name print(f"{worker_name} processing job {job_id}") # Process job based on type try: if job["task_type"] == "text_analysis": result = await self.process_text_analysis(job["data"]) elif job["task_type"] == "data_processing": result = await self.process_data(job["data"]) elif job["task_type"] == "file_conversion": result = await self.process_file_conversion(job["data"]) else: raise ValueError(f"Unknown task type: {job['task_type']}") # Job completed successfully job["status"] = JobStatus.COMPLETED job["completed_at"] = time.time() job["result"] = result except Exception as e: # Job failed job["status"] = JobStatus.FAILED job["completed_at"] = time.time() job["error"] = str(e) print(f"Job {job_id} failed: {e}") # Mark task as done self.job_queue.task_done() except Exception as e: print(f"Worker {worker_name} error: {e}") await asyncio.sleep(1) async def process_text_analysis(self, data: Dict) -> Dict: """Process text analysis job.""" text = data.get("text", "") # Simulate AI processing await asyncio.sleep(2) # Simulate processing time return { "text": text, "length": len(text), "word_count": len(text.split()), "analysis": "Text analysis completed" } async def process_data(self, data: Dict) -> Dict: """Process data processing job.""" items = data.get("items", []) # Simulate data processing await asyncio.sleep(len(items) * 0.1) return { "processed_items": len(items), "total_value": sum(item.get("value", 0) for item in items) } async def process_file_conversion(self, data: Dict) -> Dict: """Process file conversion job.""" file_type = data.get("file_type", "") target_type = data.get("target_type", "") # Simulate file conversion await asyncio.sleep(3) return { "source_type": file_type, "target_type": target_type, "status": "converted", "file_size": "1.2MB" } @chute.cord( public_api_path="/jobs", method="POST", input_schema=JobRequest ) async def submit_job(self, job_request: JobRequest) -> Dict[str, str]: """Submit a new job for processing.""" job_id = str(uuid.uuid4()) # Create job record job = { "id": job_id, "task_type": job_request.task_type, "data": job_request.data, "priority": job_request.priority, "status": JobStatus.PENDING, "created_at": time.time(), "started_at": None, "completed_at": None, "result": None, "error": None, "worker": None } self.jobs[job_id] = job # Add to queue await self.job_queue.put(job_id) return {"job_id": job_id, "status": "submitted"} @chute.cord(public_api_path="/jobs/{job_id}", method="GET") async def get_job_status(self, job_id: str) -> JobResponse: """Get status of a specific job.""" if job_id not in self.jobs: raise HTTPException(status_code=404, detail="Job not found") job = self.jobs[job_id] return JobResponse( job_id=job["id"], status=job["status"], created_at=job["created_at"], started_at=job["started_at"], completed_at=job["completed_at"], result=job["result"], error=job["error"] ) @chute.cord(public_api_path="/jobs", method="GET") async def list_jobs(self, status: Optional[JobStatus] = None, limit: int = 50) -> Dict: """List jobs with optional filtering.""" jobs = list(self.jobs.values()) # Filter by status if specified if status: jobs = [job for job in jobs if job["status"] == status] # Sort by creation time (newest first) jobs.sort(key=lambda x: x["created_at"], reverse=True) # Limit results jobs = jobs[:limit] # Convert to response format job_list = [] for job in jobs: job_list.append(JobResponse( job_id=job["id"], status=job["status"], created_at=job["created_at"], started_at=job["started_at"], completed_at=job["completed_at"], result=job["result"], error=job["error"] )) return { "jobs": job_list, "total": len(job_list), "queue_size": self.job_queue.qsize() } # Background job decorator @chute.job() async def cleanup_old_jobs(self): """Clean up completed jobs older than 24 hours.""" cutoff_time = time.time() - (24 * 60 * 60) # 24 hours ago jobs_to_remove = [] for job_id, job in self.jobs.items(): if (job["status"] in [JobStatus.COMPLETED, JobStatus.FAILED] and job["completed_at"] and job["completed_at"] < cutoff_time): jobs_to_remove.append(job_id) for job_id in jobs_to_remove: del self.jobs[job_id] if jobs_to_remove: print(f"Cleaned up {len(jobs_to_remove)} old jobs") ``` ## Best Practices ### 1. Error Handling ```python from fastapi import HTTPException import traceback from loguru import logger @chute.cord(public_api_path="/robust", method="POST") async def robust_endpoint(self, input_data: Dict) -> Dict: """Endpoint with comprehensive error handling.""" try: # Validate input if not input_data.get("text"): raise HTTPException( status_code=400, detail="Missing required field: text" ) # Process with timeout result = await asyncio.wait_for( self.process_text(input_data["text"]), timeout=30.0 ) return {"result": result, "status": "success"} except asyncio.TimeoutError: logger.error("Processing timeout") raise HTTPException( status_code=408, detail="Processing timeout - request took too long" ) except ValueError as e: logger.error(f"Validation error: {e}") raise HTTPException( status_code=400, detail=f"Invalid input: {str(e)}" ) except Exception as e: logger.error(f"Unexpected error: {e}\n{traceback.format_exc()}") raise HTTPException( status_code=500, detail="Internal server error" ) ``` ### 2. Resource Management ```python @chute.on_startup() async def initialize_with_resource_management(self): """Initialize with proper resource management.""" import torch # GPU memory management if torch.cuda.is_available(): torch.cuda.empty_cache() self.device = torch.device("cuda") # Monitor GPU memory self.gpu_memory_threshold = 0.9 # 90% usage threshold else: self.device = torch.device("cpu") # Connection pools self.session = aiohttp.ClientSession( connector=aiohttp.TCPConnector(limit=100) ) # Resource cleanup tracking self.cleanup_tasks = [] @chute.on_shutdown() async def cleanup_resources(self): """Clean up resources on shutdown.""" # Close HTTP session if hasattr(self, 'session'): await self.session.close() # Cancel background tasks for task in self.cleanup_tasks: task.cancel() # Clear GPU memory if hasattr(self, 'device') and self.device.type == 'cuda': torch.cuda.empty_cache() print("Resources cleaned up successfully") ``` ### 3. Monitoring and Metrics ```python import time from collections import defaultdict @chute.on_startup() async def initialize_metrics(self): """Initialize metrics collection.""" self.metrics = { "request_count": 0, "error_count": 0, "response_times": [], "endpoint_usage": defaultdict(int) } # Start metrics collection task self.metrics_task = asyncio.create_task(self.collect_metrics()) async def collect_metrics(self): """Background task to collect and log metrics.""" while True: try: await asyncio.sleep(60) # Collect every minute if self.metrics["response_times"]: avg_response_time = sum(self.metrics["response_times"]) / len(self.metrics["response_times"]) self.metrics["response_times"] = [] # Reset else: avg_response_time = 0 logger.info(f"Metrics - Requests: {self.metrics['request_count']}, " f"Errors: {self.metrics['error_count']}, " f"Avg Response Time: {avg_response_time:.2f}s") except Exception as e: logger.error(f"Metrics collection error: {e}") # Decorator for automatic metrics collection def with_metrics(func): """Decorator to automatically collect metrics.""" async def wrapper(self, *args, **kwargs): start_time = time.time() try: self.metrics["request_count"] += 1 self.metrics["endpoint_usage"][func.__name__] += 1 result = await func(self, *args, **kwargs) response_time = time.time() - start_time self.metrics["response_times"].append(response_time) return result except Exception as e: self.metrics["error_count"] += 1 raise return wrapper @chute.cord(public_api_path="/monitored", method="POST") @with_metrics async def monitored_endpoint(self, input_data: Dict) -> Dict: """Endpoint with automatic metrics collection.""" # Your processing logic here await asyncio.sleep(0.1) # Simulate work return {"result": "processed", "input": input_data} ``` ## Testing and Development ### Local Testing ```python # test_custom_chute.py import pytest import asyncio from unittest.mock import Mock, AsyncMock @pytest.mark.asyncio async def test_chute_initialization(): """Test chute startup.""" # Mock the chute chute_mock = Mock() chute_mock.initialize_models = AsyncMock() # Test initialization await chute_mock.initialize_models() assert chute_mock.initialize_models.called @pytest.mark.asyncio async def test_endpoint_functionality(): """Test endpoint logic.""" # Create test instance chute_instance = Mock() chute_instance.process_text = AsyncMock(return_value="processed result") # Test data test_input = {"text": "test input"} # Call function result = await chute_instance.process_text(test_input["text"]) assert result == "processed result" # Run tests # pytest test_custom_chute.py -v ``` ### Development Workflow ```bash # 1. Create and test locally python my_chute.py # Test locally first # 2. Build image chutes build my-custom-app:chute --wait # 3. Deploy to staging chutes deploy my-custom-app:chute --wait # 4. Test deployed service curl https://myuser-my-custom-app.chutes.ai/hello # 5. Monitor and iterate chutes chutes logs my-custom-app chutes chutes metrics my-custom-app ``` ## Advanced Topics ### 1. Custom Middleware ```python from fastapi import Request, Response import time @chute.middleware("http") async def add_process_time_header(request: Request, call_next): """Add processing time header to all responses.""" start_time = time.time() response = await call_next(request) process_time = time.time() - start_time response.headers["X-Process-Time"] = str(process_time) return response ``` ### 2. Custom Dependencies ```python from fastapi import Depends, HTTPException async def verify_api_key(api_key: str = Header(None)) -> str: """Verify API key dependency.""" if not api_key or api_key != "your-secret-key": raise HTTPException(status_code=401, detail="Invalid API key") return api_key @chute.cord(public_api_path="/secure", method="POST") async def secure_endpoint( self, input_data: Dict, api_key: str = Depends(verify_api_key) ) -> Dict: """Secure endpoint requiring API key.""" return {"message": "Access granted", "data": input_data} ``` ### 3. WebSocket Support ```python from fastapi import WebSocket @chute.websocket("/ws") async def websocket_endpoint(self, websocket: WebSocket): """WebSocket endpoint for real-time communication.""" await websocket.accept() try: while True: # Receive message data = await websocket.receive_text() # Process message response = await self.process_message(data) # Send response await websocket.send_text(response) except Exception as e: print(f"WebSocket error: {e}") finally: await websocket.close() ``` ## Next Steps - **Production Deployment**: Scale and monitor custom chutes - **Advanced Patterns**: Implement microservices architectures - **Integration**: Connect with external APIs and databases - **Optimization**: Profile and optimize performance For more advanced topics, see: - [Error Handling Guide](error-handling) - [Best Practices](best-practices) - [Performance Optimization](performance-optimization) --- ## SOURCE: https://chutes.ai/docs/guides/custom-images # Custom Image Building This guide covers advanced Docker image building techniques for Chutes, enabling you to create optimized, production-ready containers for AI applications with custom dependencies, performance tuning, and security considerations. ## Overview Custom images in Chutes provide: - **Full Control**: Complete control over the software stack - **Optimization**: Fine-tuned performance for specific workloads - **Custom Dependencies**: Any Python packages, system libraries, or tools - **Reproducibility**: Versioned, immutable deployments - **Caching**: Intelligent layer caching for fast rebuilds - **Security**: Hardened containers with minimal attack surface ## Basic Image Building ### Simple Custom Image ```python from chutes.image import Image # Basic custom image image = ( Image(username="myuser", name="my-app", tag="1.0") .from_base("python:3.11-slim") .run_command("pip install numpy pandas scikit-learn") .with_workdir("/app") ) ``` ### Fluent API Patterns The Chutes Image class uses a fluent API for building complex Docker images: ```python image = ( Image(username="myuser", name="ai-pipeline", tag="2.1") .from_base("nvidia/cuda:11.8-devel-ubuntu22.04") # System setup .run_command("apt update && apt install -y python3 python3-pip git curl") .run_command("apt install -y ffmpeg libsm6 libxext6") # OpenCV dependencies # Python environment .run_command("pip3 install --upgrade pip setuptools wheel") .run_command("pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118") # AI libraries .run_command("pip3 install transformers accelerate") .run_command("pip3 install opencv-python pillow") .run_command("pip3 install fastapi uvicorn pydantic") # Environment configuration .with_env("PYTHONPATH", "/app") .with_env("CUDA_VISIBLE_DEVICES", "0") .with_workdir("/app") # User setup for security .run_command("useradd -m -u 1000 appuser") .run_command("chown -R appuser:appuser /app") .with_user("appuser") ) ``` ## Advanced Image Building Patterns ### Multi-Stage Builds Use multi-stage builds for smaller, more secure production images: ```python # Build stage build_image = ( Image(username="myuser", name="ai-builder", tag="build") .from_base("nvidia/cuda:11.8-devel-ubuntu22.04") .run_command("apt update && apt install -y python3 python3-pip git build-essential") .run_command("pip3 install --upgrade pip setuptools wheel") # Install build dependencies .run_command("pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu118") .run_command("pip3 install transformers[torch]") .run_command("pip3 install accelerate bitsandbytes") # Compile custom CUDA kernels if needed .run_command("pip3 install flash-attn --no-build-isolation") .run_command("pip3 install apex --no-build-isolation") # Copy application code .copy_file("requirements.txt", "/tmp/requirements.txt") .run_command("pip3 install -r /tmp/requirements.txt") ) # Production stage - smaller runtime image production_image = ( Image(username="myuser", name="ai-runtime", tag="1.0") .from_base("nvidia/cuda:11.8-runtime-ubuntu22.04") # Runtime only, not devel .run_command("apt update && apt install -y python3 python3-pip") .run_command("rm -rf /var/lib/apt/lists/*") # Clean up package cache # Note: copy_from_image not available - use external build process # Application setup .set_workdir("/app") .set_user("appuser") # Non-root user ) ``` ### GPU-Optimized Images Build images optimized for different GPU architectures: ```python def create_gpu_optimized_image(gpu_arch: str = "ampere"): """Create GPU-optimized image for specific architecture.""" # Base images optimized for different GPU generations base_images = { "pascal": "nvidia/cuda:11.2-devel-ubuntu20.04", # GTX 10xx, P100 "volta": "nvidia/cuda:11.4-devel-ubuntu20.04", # V100, Titan V "turing": "nvidia/cuda:11.6-devel-ubuntu20.04", # RTX 20xx, T4 "ampere": "nvidia/cuda:11.8-devel-ubuntu22.04", # RTX 30xx, A100 "ada": "nvidia/cuda:12.1-devel-ubuntu22.04", # RTX 40xx "hopper": "nvidia/cuda:12.2-devel-ubuntu22.04", # H100 } # Architecture-specific optimizations torch_arch_flags = { "pascal": "6.0;6.1", "volta": "7.0", "turing": "7.5", "ampere": "8.0;8.6", "ada": "8.9", "hopper": "9.0" } base_image = base_images.get(gpu_arch, base_images["ampere"]) arch_flags = torch_arch_flags.get(gpu_arch, "8.0;8.6") return ( Image(username="myuser", name=f"gpu-{gpu_arch}", tag="1.0") .from_base(base_image) .with_env("TORCH_CUDA_ARCH_LIST", arch_flags) .with_env("CUDA_ARCHITECTURES", arch_flags.replace(";", " ")) # Install optimized PyTorch .run_command("pip3 install --upgrade pip") .run_command("pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118") # Compile architecture-specific kernels .run_command(f"pip3 install flash-attn --no-build-isolation") .run_command("pip3 install xformers") # Memory-efficient attention # Install performance libraries .run_command("pip3 install triton") # CUDA kernel compilation .run_command("pip3 install apex --no-build-isolation") # Mixed precision ) # Usage ampere_image = create_gpu_optimized_image("ampere") # For A100, RTX 30xx hopper_image = create_gpu_optimized_image("hopper") # For H100 ``` ### AI Framework-Specific Images Create specialized images for different AI frameworks: ```python class AIFrameworkImages: """Collection of framework-specific image builders.""" @staticmethod def pytorch_image(version: str = "2.1.0", cuda_version: str = "11.8"): """PyTorch optimized image.""" return ( Image(username="myuser", name="pytorch", tag=version) .from_base(f"nvidia/cuda:{cuda_version}-devel-ubuntu22.04") .run_command("apt update && apt install -y python3 python3-pip") # Install PyTorch with CUDA support .run_command(f"pip3 install torch=={version} torchvision torchaudio --index-url https://download.pytorch.org/whl/cu{cuda_version.replace('.', '')}") # Performance optimizations .run_command("pip3 install accelerate") .run_command("pip3 install xformers") # Memory-efficient transformers .run_command("pip3 install flash-attn --no-build-isolation") # Common ML libraries .run_command("pip3 install transformers datasets tokenizers") .run_command("pip3 install numpy scipy scikit-learn pandas") # Environment optimizations .with_env("TORCH_BACKENDS_CUDNN_BENCHMARK", "1") .with_env("TORCH_BACKENDS_CUDNN_DETERMINISTIC", "0") ) @staticmethod def tensorflow_image(version: str = "2.13.0"): """TensorFlow optimized image.""" return ( Image(username="myuser", name="tensorflow", tag=version) .from_base("tensorflow/tensorflow:2.13.0-gpu") # Additional TF ecosystem .run_command("pip3 install tensorflow-hub tensorflow-datasets") .run_command("pip3 install tensorflow-probability") .run_command("pip3 install tensorboard") # Optimization libraries .run_command("pip3 install tf-keras-vis") # Visualization .run_command("pip3 install tensorflow-model-optimization") # Quantization # Environment configuration .with_env("TF_FORCE_GPU_ALLOW_GROWTH", "true") .with_env("TF_GPU_MEMORY_ALLOCATION", "incremental") ) @staticmethod def jax_image(version: str = "0.4.14"): """JAX optimized image.""" return ( Image(username="myuser", name="jax", tag=version) .from_base("nvidia/cuda:11.8-devel-ubuntu22.04") .run_command("apt update && apt install -y python3 python3-pip") # Install JAX with CUDA .run_command(f"pip3 install jax[cuda11_local]=={version} -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html") .run_command("pip3 install flax optax") # Common JAX libraries .run_command("pip3 install chex dm-haiku") # DeepMind utilities # Performance libraries .run_command("pip3 install jaxlib") .run_command("pip3 install equinox") # Neural networks in JAX ) # Usage examples pytorch_img = AIFrameworkImages.pytorch_image("2.1.0", "11.8") tf_img = AIFrameworkImages.tensorflow_image("2.13.0") jax_img = AIFrameworkImages.jax_image("0.4.14") ``` ## Performance Optimization ### Compilation and Caching Optimize build times and runtime performance: ```python def create_optimized_ai_image(): """Create performance-optimized AI image.""" return ( Image(username="myuser", name="optimized-ai", tag="1.0") .from_base("nvidia/cuda:11.8-devel-ubuntu22.04") # System optimizations .run_command("apt update && apt install -y python3 python3-pip build-essential") .run_command("apt install -y ccache") # Compiler cache # Configure compilation cache .with_env("CCACHE_DIR", "/tmp/ccache") .with_env("CCACHE_MAXSIZE", "2G") # Python optimizations .with_env("PYTHONOPTIMIZE", "2") # Enable optimizations .with_env("PYTHONDONTWRITEBYTECODE", "1") # Don't write .pyc files # PyTorch compilation cache .with_env("TORCH_COMPILE_CACHE_DIR", "/tmp/torch_cache") .run_command("mkdir -p /tmp/torch_cache") # Install with optimizations .run_command("pip3 install --upgrade pip setuptools wheel") .run_command("CC='ccache gcc' pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu118") # Compile frequently used kernels ahead of time .run_command("python3 -c 'import torch; torch.compile(torch.nn.Linear(10, 1))'") # Clean up build artifacts .run_command("apt remove -y build-essential && apt autoremove -y") .run_command("rm -rf /var/lib/apt/lists/* /tmp/ccache") ) ``` ### Memory Optimization Create memory-efficient images: ```python def create_memory_optimized_image(): """Create memory-efficient image for resource-constrained environments.""" return ( Image(username="myuser", name="memory-optimized", tag="1.0") .from_base("python:3.11-slim") # Smaller base image # Minimal system dependencies .run_command("apt update && apt install -y --no-install-recommends python3-dev gcc") # Install only essential packages .run_command("pip3 install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu") # CPU-only for smaller size .run_command("pip3 install --no-cache-dir transformers[torch]") # Memory optimizations .with_env("PYTORCH_CUDA_ALLOC_CONF", "max_split_size_mb:128") .with_env("TRANSFORMERS_CACHE", "/tmp/transformers_cache") # Clean up .run_command("apt remove -y gcc python3-dev && apt autoremove -y") .run_command("rm -rf /var/lib/apt/lists/*") .run_command("pip3 cache purge") ) ``` ## Security Hardening ### Secure Base Images Build security-hardened images: ```python def create_secure_image(): """Create security-hardened image.""" return ( Image(username="myuser", name="secure-ai", tag="1.0") .from_base("nvidia/cuda:11.8-runtime-ubuntu22.04") # Runtime, not devel # Security updates .run_command("apt update && apt upgrade -y") # Install only necessary packages .run_command("apt install -y --no-install-recommends python3 python3-pip") # Create non-root user .run_command("groupadd -r appgroup && useradd -r -g appgroup -u 1000 appuser") .run_command("mkdir -p /app && chown appuser:appgroup /app") # Remove unnecessary packages and files .run_command("apt remove -y --purge wget curl && apt autoremove -y") .run_command("rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*") # Security configurations .run_command("chmod 755 /app") .run_command("find /usr -type f -perm +6000 -exec chmod -s {} \\; || true") # Remove setuid/setgid # Switch to non-root user .with_user("appuser") .with_workdir("/app") # Security environment variables .with_env("PYTHONDONTWRITEBYTECODE", "1") .with_env("PYTHONUNBUFFERED", "1") ) ``` ### Secrets Management Handle secrets securely in images: ```python def create_image_with_secrets(): """Create image with proper secrets handling.""" return ( Image(username="myuser", name="secure-secrets", tag="1.0") .from_base("python:3.11-slim") # Install secrets management tools .run_command("pip3 install cryptography python-dotenv") # Create secrets directory with proper permissions .run_command("mkdir -p /app/secrets && chmod 700 /app/secrets") # Never embed secrets in image layers! # Use environment variables or mounted volumes instead .with_env("SECRETS_PATH", "/app/secrets") # Configure for external secret injection .run_command("echo '#!/bin/bash\n" "if [ -f /app/secrets/.env ]; then\n" " export $(cat /app/secrets/.env | grep -v ^# | xargs)\n" "fi\n" "exec \"$@\"' > /app/entrypoint.sh") .run_command("chmod +x /app/entrypoint.sh") # Use entrypoint for secret loading .with_entrypoint(["/app/entrypoint.sh"]) ) ``` ## Specialized Image Types ### Development Images Create development-friendly images with debugging tools: ```python def create_development_image(): """Create development image with debugging tools.""" return ( Image(username="myuser", name="dev-ai", tag="latest") .from_base("nvidia/cuda:11.8-devel-ubuntu22.04") # Development tools .run_command("apt update && apt install -y python3 python3-pip git vim curl htop") .run_command("apt install -y iputils-ping net-tools strace gdb") # Python development tools .run_command("pip3 install ipython jupyter notebook") .run_command("pip3 install debugpy pytest pytest-cov") .run_command("pip3 install black isort flake8 mypy") # AI libraries with debug symbols .run_command("pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu118") .run_command("pip3 install transformers[dev]") # Jupyter configuration .run_command("jupyter notebook --generate-config") .run_command("echo \"c.NotebookApp.ip = '0.0.0.0'\" >> ~/.jupyter/jupyter_notebook_config.py") .run_command("echo \"c.NotebookApp.token = ''\" >> ~/.jupyter/jupyter_notebook_config.py") # Development environment .with_env("PYTHONPATH", "/app") .with_env("JUPYTER_ENABLE_LAB", "yes") .with_workdir("/app") # Expose Jupyter port .expose_port(8888) ) ``` ### Production Images Create production-optimized images: ```python def create_production_image(): """Create production-ready image.""" return ( Image(username="myuser", name="prod-ai", tag="1.0") .from_base("nvidia/cuda:11.8-runtime-ubuntu22.04") # Runtime only # Minimal production dependencies .run_command("apt update && apt install -y --no-install-recommends python3 python3-pip") # Production Python packages .run_command("pip3 install --no-cache-dir torch torchvision --index-url https://download.pytorch.org/whl/cu118") .run_command("pip3 install --no-cache-dir transformers accelerate") .run_command("pip3 install --no-cache-dir fastapi uvicorn[standard]") # Production optimizations .with_env("PYTHONOPTIMIZE", "2") .with_env("PYTHONDONTWRITEBYTECODE", "1") .with_env("PYTHONUNBUFFERED", "1") # Health check script .run_command("echo '#!/bin/bash\ncurl -f http://localhost:8000/health || exit 1' > /app/healthcheck.sh") .run_command("chmod +x /app/healthcheck.sh") # Non-root user for security .run_command("useradd -m -u 1000 appuser") .run_command("mkdir -p /app && chown appuser:appuser /app") .with_user("appuser") # Clean up .run_command("rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*") # Health check .with_healthcheck(["CMD", "/app/healthcheck.sh"]) ) ``` ## Image Management and Versioning ### Semantic Versioning Implement proper versioning for images: ```python class VersionedImageBuilder: """Build images with semantic versioning.""" def __init__(self, username: str, name: str): self.username = username self.name = name self.major = 1 self.minor = 0 self.patch = 0 self.build = None def version(self, major: int, minor: int, patch: int, build: str = None): """Set version numbers.""" self.major = major self.minor = minor self.patch = patch self.build = build return self def get_version_tag(self) -> str: """Get formatted version tag.""" tag = f"{self.major}.{self.minor}.{self.patch}" if self.build: tag += f"-{self.build}" return tag def build_image(self, base_config_func): """Build image with version tags.""" version_tag = self.get_version_tag() image = base_config_func( Image(self.username, self.name, version_tag) ) # Add version metadata image = ( image .with_label("version", version_tag) .with_label("major", str(self.major)) .with_label("minor", str(self.minor)) .with_label("patch", str(self.patch)) ) if self.build: image = image.with_label("build", self.build) return image # Usage def my_ai_config(image: Image) -> Image: return ( image .from_base("nvidia/cuda:11.8-runtime-ubuntu22.04") .run_command("pip3 install torch transformers") ) builder = VersionedImageBuilder("myuser", "my-ai-app") image = builder.version(2, 1, 0, "beta").build_image(my_ai_config) ``` ### Environment-Specific Images Build images for different environments: ```python class EnvironmentImageBuilder: """Build environment-specific images.""" @staticmethod def development(base_image: Image) -> Image: """Development environment configuration.""" return ( base_image .run_command("pip3 install ipython jupyter pytest debugpy") .with_env("FLASK_ENV", "development") .with_env("LOG_LEVEL", "DEBUG") .expose_port(8888) # Jupyter .expose_port(5678) # Debugger ) @staticmethod def staging(base_image: Image) -> Image: """Staging environment configuration.""" return ( base_image .with_env("FLASK_ENV", "staging") .with_env("LOG_LEVEL", "INFO") .with_healthcheck(["CMD", "curl", "-f", "http://localhost:8000/health"]) ) @staticmethod def production(base_image: Image) -> Image: """Production environment configuration.""" return ( base_image .with_env("FLASK_ENV", "production") .with_env("LOG_LEVEL", "WARNING") .with_env("PYTHONOPTIMIZE", "2") .run_command("pip3 cache purge") # Clean up cache .with_healthcheck(["CMD", "curl", "-f", "http://localhost:8000/health"]) ) # Usage base = Image("myuser", "my-app", "1.0").from_base("python:3.11-slim") dev_image = EnvironmentImageBuilder.development(base) staging_image = EnvironmentImageBuilder.staging(base) prod_image = EnvironmentImageBuilder.production(base) ``` ## Testing and Validation ### Image Testing Framework Test images before deployment: ```python import subprocess import tempfile import json class ImageTester: """Test framework for validating images.""" def __init__(self, image: Image): self.image = image self.test_results = [] def test_python_imports(self, packages: list): """Test that Python packages can be imported.""" test_script = f""" import sys failed_imports = [] for package in {packages}: try: __import__(package) print(f"✓ {package}") except ImportError as e: failed_imports.append((package, str(e))) print(f"✗ {package}: {e}") if failed_imports: sys.exit(1) """ result = self._run_test_script(test_script) self.test_results.append({ "test": "python_imports", "passed": result.returncode == 0, "output": result.stdout }) return result.returncode == 0 def test_gpu_availability(self): """Test GPU availability and CUDA setup.""" test_script = """ import torch import sys print(f"PyTorch version: {torch.__version__}") print(f"CUDA available: {torch.cuda.is_available()}") if torch.cuda.is_available(): print(f"CUDA version: {torch.version.cuda}") print(f"Device count: {torch.cuda.device_count()}") for i in range(torch.cuda.device_count()): props = torch.cuda.get_device_properties(i) print(f"Device {i}: {props.name} ({props.total_memory / 1024**3:.1f}GB)") else: print("CUDA not available") sys.exit(1) """ result = self._run_test_script(test_script) self.test_results.append({ "test": "gpu_availability", "passed": result.returncode == 0, "output": result.stdout }) return result.returncode == 0 def test_model_loading(self, model_name: str): """Test that a specific model can be loaded.""" test_script = f""" from transformers import AutoTokenizer, AutoModel import sys try: tokenizer = AutoTokenizer.from_pretrained("{model_name}") model = AutoModel.from_pretrained("{model_name}") print(f"✓ Successfully loaded {model_name}") print(f"Model parameters: {{sum(p.numel() for p in model.parameters()):,}}") except Exception as e: print(f"✗ Failed to load {model_name}: {{e}}") sys.exit(1) """ result = self._run_test_script(test_script) self.test_results.append({ "test": f"model_loading_{model_name}", "passed": result.returncode == 0, "output": result.stdout }) return result.returncode == 0 def test_security(self): """Test security configurations.""" test_script = """ import os import pwd import sys # Check user user = pwd.getpwuid(os.getuid()) print(f"Running as user: {user.pw_name} (UID: {user.pw_uid})") if user.pw_uid == 0: print("✗ Running as root - security risk!") sys.exit(1) else: print("✓ Running as non-root user") # Check write permissions write_paths = ["/", "/etc", "/usr"] for path in write_paths: if os.access(path, os.W_OK): print(f"✗ Write access to {path} - security risk!") sys.exit(1) else: print(f"✓ No write access to {path}") """ result = self._run_test_script(test_script) self.test_results.append({ "test": "security", "passed": result.returncode == 0, "output": result.stdout }) return result.returncode == 0 def _run_test_script(self, script: str): """Run a test script in a container.""" with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f: f.write(script) script_path = f.name try: # This would run the script in the container # For actual implementation, you'd use docker run or similar result = subprocess.run([ "python3", script_path ], capture_output=True, text=True, timeout=60) return result finally: os.unlink(script_path) def run_all_tests(self): """Run all tests and return summary.""" tests_passed = 0 total_tests = len(self.test_results) for result in self.test_results: if result["passed"]: tests_passed += 1 return { "total_tests": total_tests, "tests_passed": tests_passed, "success_rate": tests_passed / total_tests if total_tests > 0 else 0, "results": self.test_results } # Usage image = Image("myuser", "test-image", "1.0") tester = ImageTester(image) tester.test_python_imports(["torch", "transformers", "numpy"]) tester.test_gpu_availability() tester.test_model_loading("bert-base-uncased") tester.test_security() summary = tester.run_all_tests() print(f"Tests passed: {summary['tests_passed']}/{summary['total_tests']}") ``` ## Troubleshooting and Debugging ### Common Build Issues Debug common image building problems: ```python class ImageDebugger: """Debug common image building issues.""" @staticmethod def diagnose_build_failure(build_log: str): """Analyze build log for common issues.""" issues = [] # Check for common problems if "E: Package" in build_log and "has no installation candidate" in build_log: issues.append({ "issue": "Package not found", "solution": "Update package lists with 'apt update' before installing packages" }) if "Permission denied" in build_log: issues.append({ "issue": "Permission denied", "solution": "Ensure user has proper permissions or run as root for system operations" }) if "No space left on device" in build_log: issues.append({ "issue": "Disk space", "solution": "Clean up unused files and caches, or increase disk space" }) if "CUDA_ERROR_OUT_OF_MEMORY" in build_log: issues.append({ "issue": "GPU memory insufficient", "solution": "Reduce batch size or use a GPU with more memory" }) if "ModuleNotFoundError" in build_log: issues.append({ "issue": "Python module not found", "solution": "Install missing dependencies or check PYTHONPATH" }) return issues @staticmethod def suggest_optimizations(image_size_mb: int, build_time_seconds: int): """Suggest optimizations based on image metrics.""" suggestions = [] if image_size_mb > 5000: # > 5GB suggestions.append("Image is large - consider multi-stage builds or smaller base images") if build_time_seconds > 600: # > 10 minutes suggestions.append("Build is slow - consider using pre-built base images or build caching") suggestions.extend([ "Combine RUN commands to reduce layers", "Clean up package caches and temporary files", "Use .dockerignore to exclude unnecessary files", "Order commands from least to most likely to change" ]) return suggestions # Usage debugger = ImageDebugger() issues = debugger.diagnose_build_failure(build_log_content) suggestions = debugger.suggest_optimizations(8000, 800) ``` ### Build Optimization Optimize build performance: ```python def create_optimized_build_image(): """Create image with build optimizations.""" return ( Image(username="myuser", name="optimized-build", tag="1.0") .from_base("nvidia/cuda:11.8-devel-ubuntu22.04") # Layer optimization - combine related commands .run_command( "apt update && " "apt install -y python3 python3-pip git && " "rm -rf /var/lib/apt/lists/*" # Clean up in same layer ) # Use build cache effectively .copy_file("requirements.txt", "/tmp/requirements.txt") # Copy requirements first .run_command("pip3 install -r /tmp/requirements.txt") # Install deps .copy_file(".", "/app") # Copy code last # Build-time variables for optimization .with_arg("MAKEFLAGS", "-j$(nproc)") # Parallel compilation .with_arg("PIP_NO_CACHE_DIR", "1") # Don't cache pip downloads # Multi-stage friendly structure .with_label("stage", "build") .with_workdir("/app") ) ``` ## Best Practices Summary ### Image Building Checklist ```python class ImageBuildingChecklist: """Comprehensive checklist for image building best practices.""" def __init__(self): self.checks = { "security": [ "Use non-root user", "Remove setuid/setgid binaries", "Don't embed secrets", "Use minimal base images", "Keep system packages updated" ], "performance": [ "Use appropriate base image", "Minimize layers", "Leverage build cache", "Clean up in same layer", "Use multi-stage builds" ], "maintainability": [ "Pin package versions", "Use semantic versioning", "Add descriptive labels", "Document custom configurations", "Include health checks" ], "size_optimization": [ "Remove package caches", "Use slim base images", "Avoid unnecessary dependencies", "Compress layers where possible", "Use .dockerignore" ] } def validate_image(self, image: Image) -> dict: """Validate image against best practices.""" # This would inspect the image and check against the checklist # For demo purposes, returning a structure return { "security_score": 85, "performance_score": 90, "maintainability_score": 80, "size_score": 75, "recommendations": [ "Consider using non-root user", "Add health check", "Clean up package caches" ] } # Usage checklist = ImageBuildingChecklist() image = Image("myuser", "my-app", "1.0") validation = checklist.validate_image(image) print(f"Overall score: {sum(validation.values()[:4]) / 4}") ``` ## Next Steps - **Advanced Patterns**: Explore multi-stage builds and image optimization - **CI/CD Integration**: Automate image building and testing - **Registry Management**: Manage image repositories and distributions - **Security Scanning**: Implement vulnerability scanning in build pipeline For more advanced topics, see: - [Custom Chutes Guide](custom-chutes) - [Best Practices](best-practices) - [Security Guide](security) --- ## SOURCE: https://chutes.ai/docs/guides/custom-templates # Custom Templates Guide This guide shows how to create reusable templates for common AI workflows, making it easy to deploy similar applications with different configurations. ## Overview Custom templates in Chutes allow you to: - **Standardize Deployments**: Create consistent deployment patterns - **Reduce Code Duplication**: Reuse common configurations - **Simplify Complex Setups**: Abstract away complexity for end users - **Enable Team Collaboration**: Share best practices across teams ## Template Structure ### Basic Template Function A template is a Python function that returns a configured Chute: ```python from chutes.image import Image from chutes.chute import Chute, NodeSelector from typing import Optional, Dict, Any def build_text_classification_template( username: str, model_name: str, num_labels: int, node_selector: Optional[NodeSelector] = None, **kwargs ) -> Chute: """ Template for text classification models Args: username: Chutes username model_name: HuggingFace model name num_labels: Number of classification labels node_selector: Hardware requirements **kwargs: Additional chute configuration Returns: Configured Chute instance """ # Default node selector if node_selector is None: node_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=8) # Build custom image image = ( Image( username=username, name="text-classification", tag="latest", python_version="3.11" ) .pip_install([ "torch==2.1.0", "transformers==4.35.0", "datasets==2.14.0", "scikit-learn==1.3.0" ]) .copy_files("./templates/text_classification", "/app") ) # Create chute chute = Chute( username=username, name=f"text-classifier-{model_name.split('/')[-1]}", image=image, entry_file="classifier.py", entry_point="run", node_selector=node_selector, environment={ "MODEL_NAME": model_name, "NUM_LABELS": str(num_labels) }, timeout_seconds=300, concurrency=8, **kwargs ) return chute # Usage classifier_chute = build_text_classification_template( username="myuser", model_name="bert-base-uncased", num_labels=3 ) ``` ## Advanced Template Examples ### Computer Vision Template ```python def build_image_classification_template( username: str, model_name: str, image_size: int = 224, batch_size: int = 16, use_gpu: bool = True, **kwargs ) -> Chute: """Template for image classification models""" # Configure hardware based on requirements if use_gpu: node_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=12) else: node_selector = NodeSelector( gpu_count=0) # Build image with computer vision dependencies image = ( Image( username=username, name="image-classification", tag=f"v{model_name.replace('/', '-')}", python_version="3.11" ) .pip_install([ "torch==2.1.0", "torchvision==0.16.0", "timm==0.9.7", "pillow==10.0.1", "opencv-python==4.8.1.78" ]) .copy_files("./templates/image_classification", "/app") ) chute = Chute( username=username, name=f"image-classifier-{model_name.split('/')[-1]}", image=image, entry_file="image_classifier.py", entry_point="run", node_selector=node_selector, environment={ "MODEL_NAME": model_name, "IMAGE_SIZE": str(image_size), "BATCH_SIZE": str(batch_size) }, timeout_seconds=600, concurrency=4, **kwargs ) return chute # Example implementation file: templates/image_classification/image_classifier.py """ import os import torch import timm from PIL import Image import torchvision.transforms as transforms from typing import List, Dict, Any import base64 import io class ImageClassifier: def __init__(self): self.model_name = os.environ.get("MODEL_NAME", "resnet50") self.image_size = int(os.environ.get("IMAGE_SIZE", "224")) self.batch_size = int(os.environ.get("BATCH_SIZE", "16")) # Load model self.model = timm.create_model(self.model_name, pretrained=True) self.model.eval() # Define transforms self.transform = transforms.Compose([ transforms.Resize((self.image_size, self.image_size)), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) def preprocess_image(self, image_b64: str) -> torch.Tensor: # Decode base64 image image_bytes = base64.b64decode(image_b64) image = Image.open(io.BytesIO(image_bytes)).convert('RGB') # Apply transforms tensor = self.transform(image) return tensor.unsqueeze(0) # Add batch dimension def predict(self, images: List[str]) -> List[Dict[str, Any]]: results = [] for i in range(0, len(images), self.batch_size): batch = images[i:i + self.batch_size] # Preprocess batch tensors = [self.preprocess_image(img) for img in batch] batch_tensor = torch.cat(tensors, dim=0) # Inference with torch.no_grad(): outputs = self.model(batch_tensor) probabilities = torch.nn.functional.softmax(outputs, dim=1) # Process results for j, probs in enumerate(probabilities): top5_probs, top5_indices = torch.topk(probs, 5) results.append({ "predictions": [ { "class_id": int(idx), "probability": float(prob) } for idx, prob in zip(top5_indices, top5_probs) ] }) return results # Global classifier instance classifier = ImageClassifier() async def run(inputs: Dict[str, Any]) -> Dict[str, Any]: images = inputs.get("images", []) if not images: return {"error": "No images provided"} results = classifier.predict(images) return {"results": results} """ ``` ### LLM Chat Template ```python def build_llm_chat_template( username: str, model_name: str, max_length: int = 2048, temperature: float = 0.7, use_quantization: bool = False, **kwargs ) -> Chute: """Template for LLM chat applications""" # Determine hardware requirements based on model if "7b" in model_name.lower(): vram_gb = 16 if not use_quantization else 8 elif "13b" in model_name.lower(): vram_gb = 24 if not use_quantization else 12 elif "70b" in model_name.lower(): vram_gb = 80 if not use_quantization else 40 else: vram_gb = 16 # Default node_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=vram_gb) # Build image with LLM dependencies pip_packages = [ "torch==2.1.0", "transformers==4.35.0", "accelerate==0.24.0" ] if use_quantization: pip_packages.append("bitsandbytes==0.41.0") image = ( Image( username=username, name="llm-chat", tag=f"v{model_name.replace('/', '-')}", python_version="3.11" ) .pip_install(pip_packages) .copy_files("./templates/llm_chat", "/app") ) environment = { "MODEL_NAME": model_name, "MAX_LENGTH": str(max_length), "TEMPERATURE": str(temperature), "USE_QUANTIZATION": str(use_quantization).lower() } chute = Chute( username=username, name=f"llm-chat-{model_name.split('/')[-1]}", image=image, entry_file="chat_model.py", entry_point="run", node_selector=node_selector, environment=environment, timeout_seconds=300, concurrency=4, **kwargs ) return chute ``` ### Multi-Model Analysis Template ```python def build_multi_model_analysis_template( username: str, models_config: Dict[str, Dict[str, Any]], enable_caching: bool = True, **kwargs ) -> Chute: """ Template for multi-model analysis pipelines Args: username: Chutes username models_config: Dictionary of model configurations Example: { "sentiment": {"model": "cardiffnlp/twitter-roberta-base-sentiment"}, "ner": {"model": "dbmdz/bert-large-cased-finetuned-conll03-english"}, "classification": {"model": "facebook/bart-large-mnli"} } enable_caching: Whether to enable Redis caching """ # Calculate resource requirements based on models total_models = len(models_config) estimated_vram = total_models * 4 # 4GB per model estimate node_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=max(16, estimated_vram) ) # Build comprehensive image pip_packages = [ "torch==2.1.0", "transformers==4.35.0", "datasets==2.14.0", "scikit-learn==1.3.0", "numpy==1.24.3", "asyncio-pool==0.6.0" ] if enable_caching: pip_packages.extend(["redis==5.0.0", "pickle5==0.0.12"]) image = ( Image( username=username, name="multi-model-analysis", tag="latest", python_version="3.11" ) .pip_install(pip_packages) .copy_files("./templates/multi_model", "/app") ) # Environment configuration environment = { "MODELS_CONFIG": json.dumps(models_config), "ENABLE_CACHING": str(enable_caching).lower() } if enable_caching: environment["REDIS_URL"] = "redis://localhost:6379" chute = Chute( username=username, name="multi-model-analyzer", image=image, entry_file="multi_analyzer.py", entry_point="run", node_selector=node_selector, environment=environment, timeout_seconds=600, concurrency=6, **kwargs ) return chute # Usage example multi_model_chute = build_multi_model_analysis_template( username="myuser", models_config={ "sentiment": { "model": "cardiffnlp/twitter-roberta-base-sentiment-latest", "task": "sentiment-analysis" }, "ner": { "model": "dbmdz/bert-large-cased-finetuned-conll03-english", "task": "ner" }, "classification": { "model": "facebook/bart-large-mnli", "task": "zero-shot-classification" } }, enable_caching=True ) ``` ## Template Best Practices ### 1. Parameterization Make templates flexible with good defaults: ```python def build_flexible_template( username: str, model_name: str, # Required parameters task_type: str, # Optional parameters with sensible defaults python_version: str = "3.11", timeout_seconds: int = 300, concurrency: int = 8, enable_monitoring: bool = True, enable_caching: bool = True, auto_scale: bool = False, # Hardware configuration gpu_count: int = 1, min_vram_gb: int = 8, # Advanced configuration environment_vars: Optional[Dict[str, str]] = None, custom_pip_packages: Optional[List[str]] = None, **kwargs ) -> Chute: """Highly flexible template with many configuration options""" # Merge environment variables base_env = { "MODEL_NAME": model_name, "TASK_TYPE": task_type, "ENABLE_MONITORING": str(enable_monitoring).lower(), "ENABLE_CACHING": str(enable_caching).lower() } if environment_vars: base_env.update(environment_vars) # Build pip packages list base_packages = [ "torch==2.1.0", "transformers==4.35.0" ] if enable_monitoring: base_packages.append("prometheus-client==0.18.0") if enable_caching: base_packages.append("redis==5.0.0") if custom_pip_packages: base_packages.extend(custom_pip_packages) # Configure node selector node_selector = NodeSelector( gpu_count=gpu_count, min_vram_gb_per_gpu=min_vram_gb) # Build image image = ( Image( username=username, name=f"{task_type}-model", tag=model_name.replace("/", "-"), python_version=python_version ) .pip_install(base_packages) .copy_files(f"./templates/{task_type}", "/app") ) # Create chute chute = Chute( username=username, name=f"{task_type}-{model_name.split('/')[-1]}", image=image, entry_file="app.py", entry_point="run", node_selector=node_selector, environment=base_env, timeout_seconds=timeout_seconds, concurrency=concurrency, auto_scale=auto_scale, **kwargs ) return chute ``` ### 2. Template Validation Add validation to prevent common errors: ```python def validate_template_inputs( model_name: str, task_type: str, gpu_count: int, min_vram_gb: int ) -> None: """Validate template inputs""" # Validate model name format if "/" not in model_name: raise ValueError("model_name should be in format 'organization/model'") # Validate task type valid_tasks = ["classification", "ner", "generation", "embedding"] if task_type not in valid_tasks: raise ValueError(f"task_type must be one of {valid_tasks}") # Validate hardware requirements if gpu_count < 0 or gpu_count > 8: raise ValueError("gpu_count must be between 0 and 8") if min_vram_gb < 4 or min_vram_gb > 80: raise ValueError("min_vram_gb must be between 4 and 80") # Model-specific validation if "70b" in model_name.lower() and min_vram_gb < 40: raise ValueError("70B models require at least 40GB VRAM") def build_validated_template(username: str, model_name: str, **kwargs) -> Chute: """Template with input validation""" # Extract and validate key parameters task_type = kwargs.get("task_type", "classification") gpu_count = kwargs.get("gpu_count", 1) min_vram_gb = kwargs.get("min_vram_gb", 8) validate_template_inputs(model_name, task_type, gpu_count, min_vram_gb) # Continue with template creation... return build_flexible_template(username, model_name, task_type, **kwargs) ``` ### 3. Template Documentation Document templates thoroughly: ```python def build_documented_template( username: str, model_name: str, **kwargs ) -> Chute: """ Production-ready template for ML model deployment This template provides a robust foundation for deploying machine learning models with monitoring, caching, and auto-scaling capabilities. Args: username (str): Your Chutes username model_name (str): HuggingFace model identifier (e.g., 'bert-base-uncased') Keyword Args: task_type (str): Type of ML task ('classification', 'ner', 'generation') Default: 'classification' gpu_count (int): Number of GPUs required (0-8) Default: 1 min_vram_gb (int): Minimum VRAM per GPU in GB (4-80) Default: 8 enable_monitoring (bool): Enable Prometheus metrics Default: True enable_caching (bool): Enable Redis caching Default: True auto_scale (bool): Enable auto-scaling Default: False Returns: Chute: Configured chute instance ready for deployment Example: >>> chute = build_documented_template( ... username="myuser", ... model_name="bert-base-uncased", ... task_type="classification", ... enable_monitoring=True, ... auto_scale=True ... ) >>> result = chute.deploy() Raises: ValueError: If invalid parameters are provided Note: This template automatically configures hardware requirements based on the model size. For 70B+ models, consider using multiple GPUs. """ # Template implementation... pass ``` ## Creating Template Packages ### Organizing Templates Structure templates as reusable packages: ``` my_chutes_templates/ ├── __init__.py ├── text/ │ ├── __init__.py │ ├── classification.py │ ├── generation.py │ └── embedding.py ├── vision/ │ ├── __init__.py │ ├── classification.py │ ├── detection.py │ └── segmentation.py ├── audio/ │ ├── __init__.py │ ├── transcription.py │ └── generation.py └── templates/ ├── text_classification/ │ ├── app.py │ └── requirements.txt ├── image_classification/ │ ├── app.py │ └── requirements.txt └── audio_transcription/ ├── app.py └── requirements.txt ``` ### Package Implementation ```python # my_chutes_templates/__init__.py from .text.classification import build_text_classification_template from .text.generation import build_text_generation_template from .vision.classification import build_image_classification_template __all__ = [ "build_text_classification_template", "build_text_generation_template", "build_image_classification_template" ] __version__ = "1.0.0" # my_chutes_templates/text/classification.py from ..base import BaseTemplate class TextClassificationTemplate(BaseTemplate): """Template for text classification models""" def __init__(self): super().__init__( template_name="text_classification", required_params=["model_name", "num_labels"], default_packages=[ "torch==2.1.0", "transformers==4.35.0", "scikit-learn==1.3.0" ] ) def build(self, username: str, **kwargs) -> Chute: return self._build_template(username, **kwargs) def build_text_classification_template(username: str, **kwargs) -> Chute: """Convenience function for building text classification template""" template = TextClassificationTemplate() return template.build(username, **kwargs) ``` ## Template Testing ### Unit Tests for Templates ```python import unittest from unittest.mock import patch, MagicMock from my_chutes_templates import build_text_classification_template class TestTextClassificationTemplate(unittest.TestCase): def test_template_creation(self): """Test basic template creation""" chute = build_text_classification_template( username="testuser", model_name="bert-base-uncased", num_labels=3 ) self.assertEqual(chute.username, "testuser") self.assertIn("bert-base-uncased", chute.name) self.assertEqual(chute.environment["NUM_LABELS"], "3") def test_invalid_parameters(self): """Test validation of invalid parameters""" with self.assertRaises(ValueError): build_text_classification_template( username="testuser", model_name="invalid-model", # Invalid format num_labels=3 ) @patch('chutes.chute.Chute.deploy') def test_template_deployment(self, mock_deploy): """Test template deployment""" mock_deploy.return_value = {"status": "success"} chute = build_text_classification_template( username="testuser", model_name="bert-base-uncased", num_labels=3 ) result = chute.deploy() self.assertEqual(result["status"], "success") mock_deploy.assert_called_once() if __name__ == "__main__": unittest.main() ``` ## Next Steps - **[Best Practices](best-practices)** - General deployment best practices - **[Templates Guide](templates)** - Using existing templates - **[Performance Optimization](performance)** - Optimize your custom templates For advanced template development, see the [Template Development Guide](../advanced/template-development). --- ## SOURCE: https://chutes.ai/docs/guides/error-handling # Error Handling and Resilience This guide covers comprehensive error handling strategies for Chutes applications, ensuring robust, production-ready AI services that gracefully handle failures and provide meaningful feedback. ## Overview Effective error handling in Chutes includes: - **Graceful Degradation**: Handle failures without complete system breakdown - **User-Friendly Messages**: Provide clear, actionable error information - **Logging and Monitoring**: Track errors for debugging and improvement - **Retry Strategies**: Automatically recover from transient failures - **Circuit Breakers**: Prevent cascading failures - **Fallback Mechanisms**: Provide alternative responses when primary methods fail ## Error Types and Classification ### AI Model Errors ```python from enum import Enum from typing import Optional, Dict, Any import logging from datetime import datetime class AIErrorType(Enum): """Classification of AI-specific errors.""" MODEL_LOADING_FAILED = "model_loading_failed" INFERENCE_TIMEOUT = "inference_timeout" OUT_OF_MEMORY = "out_of_memory" INVALID_INPUT = "invalid_input" MODEL_OVERLOADED = "model_overloaded" GENERATION_FAILED = "generation_failed" CONTEXT_LENGTH_EXCEEDED = "context_length_exceeded" class ModelError(Exception): """Base exception for model-related errors.""" def __init__( self, message: str, error_type: AIErrorType, details: Optional[Dict[str, Any]] = None, is_retryable: bool = False ): super().__init__(message) self.message = message self.error_type = error_type self.details = details or {} self.is_retryable = is_retryable self.timestamp = datetime.now() def to_dict(self) -> Dict[str, Any]: """Convert error to dictionary for API responses.""" return { "error": self.message, "error_type": self.error_type.value, "details": self.details, "is_retryable": self.is_retryable, "timestamp": self.timestamp.isoformat() } class OutOfMemoryError(ModelError): """GPU/CPU memory exhaustion error.""" def __init__(self, memory_used: Optional[int] = None, memory_available: Optional[int] = None): details = {} if memory_used is not None: details["memory_used_mb"] = memory_used if memory_available is not None: details["memory_available_mb"] = memory_available super().__init__( "Model inference failed due to insufficient memory", AIErrorType.OUT_OF_MEMORY, details=details, is_retryable=False ) class ContextLengthError(ModelError): """Input context too long for model.""" def __init__(self, input_length: int, max_length: int): super().__init__( f"Input length ({input_length}) exceeds model's maximum context length ({max_length})", AIErrorType.CONTEXT_LENGTH_EXCEEDED, details={ "input_length": input_length, "max_length": max_length, "suggestion": "Reduce input length or use a model with larger context window" }, is_retryable=False ) class InferenceTimeoutError(ModelError): """Model inference timeout.""" def __init__(self, timeout_seconds: float): super().__init__( f"Model inference timed out after {timeout_seconds} seconds", AIErrorType.INFERENCE_TIMEOUT, details={"timeout_seconds": timeout_seconds}, is_retryable=True ) ``` ### Input Validation Errors ```python from pydantic import ValidationError from fastapi import HTTPException class ValidationErrorHandler: """Handle and format validation errors.""" @staticmethod def format_pydantic_error(validation_error: ValidationError) -> Dict[str, Any]: """Format Pydantic validation error for user-friendly display.""" formatted_errors = {} for error in validation_error.errors(): field_path = " -> ".join(str(loc) for loc in error['loc']) error_type = error['type'] # Create user-friendly error messages if error_type == 'value_error.missing': message = "This field is required" elif error_type == 'type_error.str': message = "This field must be text" elif error_type == 'type_error.integer': message = "This field must be a whole number" elif error_type == 'type_error.float': message = "This field must be a number" elif error_type == 'value_error.number.not_ge': limit = error['ctx']['limit_value'] message = f"This field must be at least {limit}" elif error_type == 'value_error.number.not_le': limit = error['ctx']['limit_value'] message = f"This field must be at most {limit}" elif error_type == 'value_error.str.regex': message = "This field has an invalid format" elif error_type == 'value_error.list.min_items': min_items = error['ctx']['limit_value'] message = f"This list must have at least {min_items} items" elif error_type == 'value_error.list.max_items': max_items = error['ctx']['limit_value'] message = f"This list can have at most {max_items} items" else: message = error['msg'] if field_path not in formatted_errors: formatted_errors[field_path] = [] formatted_errors[field_path].append(message) return { "error": "Validation failed", "error_type": "validation_error", "field_errors": formatted_errors, "is_retryable": False } @staticmethod def create_http_exception(validation_error: ValidationError) -> HTTPException: """Create HTTP exception from validation error.""" formatted_error = ValidationErrorHandler.format_pydantic_error(validation_error) return HTTPException( status_code=422, detail=formatted_error ) class InputSanitizer: """Sanitize and validate inputs with error handling.""" @staticmethod def sanitize_text(text: str, max_length: int = 10000) -> str: """Sanitize text input with error handling.""" if not isinstance(text, str): raise ValueError("Input must be text") # Remove null bytes and control characters sanitized = ''.join(char for char in text if ord(char) >= 32 or char in '\n\r\t') # Check length if len(sanitized) > max_length: raise ValueError(f"Input text too long (max {max_length} characters)") if len(sanitized.strip()) == 0: raise ValueError("Input text cannot be empty") return sanitized.strip() @staticmethod def validate_file_upload(file_data: bytes, allowed_types: list, max_size_mb: int = 10): """Validate file upload with comprehensive error checking.""" # Check size if len(file_data) > max_size_mb * 1024 * 1024: raise ValueError(f"File too large (max {max_size_mb}MB)") # Check if empty if len(file_data) == 0: raise ValueError("File is empty") # Basic file type detection file_signatures = { b'\xff\xd8\xff': 'image/jpeg', b'\x89PNG\r\n\x1a\n': 'image/png', b'GIF87a': 'image/gif', b'GIF89a': 'image/gif', b'%PDF': 'application/pdf' } detected_type = None for signature, mime_type in file_signatures.items(): if file_data.startswith(signature): detected_type = mime_type break if detected_type not in allowed_types: raise ValueError(f"File type not allowed. Allowed types: {', '.join(allowed_types)}") return detected_type ``` ## Error Handling Decorators ### Retry Mechanisms ```python import asyncio import functools from typing import Callable, Type, Tuple, Union import random def retry_with_backoff( max_retries: int = 3, base_delay: float = 1.0, max_delay: float = 60.0, backoff_factor: float = 2.0, jitter: bool = True, retryable_exceptions: Tuple[Type[Exception], ...] = (Exception) ): """Decorator for retrying functions with exponential backoff.""" def decorator(func: Callable): @functools.wraps(func) async def async_wrapper(*args, **kwargs): last_exception = None for attempt in range(max_retries + 1): try: if asyncio.iscoroutinefunction(func): return await func(*args, **kwargs) else: return func(*args, **kwargs) except retryable_exceptions as e: last_exception = e # Don't retry on last attempt if attempt == max_retries: break # Calculate delay with exponential backoff delay = min(base_delay * (backoff_factor ** attempt), max_delay) # Add jitter to prevent thundering herd if jitter: delay *= (0.5 + random.random() * 0.5) logging.warning( f"Attempt {attempt + 1} failed for {func.__name__}: {str(e)}. " f"Retrying in {delay:.2f} seconds..." ) await asyncio.sleep(delay) except Exception as e: # Non-retryable exception logging.error(f"Non-retryable error in {func.__name__}: {str(e)}") raise # All retries exhausted raise last_exception @functools.wraps(func) def sync_wrapper(*args, **kwargs): # Handle synchronous functions return asyncio.run(async_wrapper(*args, **kwargs)) if asyncio.iscoroutinefunction(func): return async_wrapper else: return sync_wrapper return decorator def circuit_breaker( failure_threshold: int = 5, timeout_duration: float = 60.0, expected_exception: Type[Exception] = Exception ): """Circuit breaker decorator to prevent cascading failures.""" def decorator(func: Callable): # Shared state across all calls state = { 'failures': 0, 'last_failure_time': None, 'state': 'CLOSED' # CLOSED, OPEN, HALF_OPEN } @functools.wraps(func) async def wrapper(*args, **kwargs): now = datetime.now().timestamp() # Check if circuit should transition to HALF_OPEN if (state['state'] == 'OPEN' and state['last_failure_time'] and now - state['last_failure_time'] > timeout_duration): state['state'] = 'HALF_OPEN' logging.info(f"Circuit breaker for {func.__name__} is now HALF_OPEN") # Reject if circuit is OPEN if state['state'] == 'OPEN': raise ModelError( f"Circuit breaker is OPEN for {func.__name__}", AIErrorType.MODEL_OVERLOADED, details={'circuit_state': 'OPEN'}, is_retryable=True ) try: result = await func(*args, **kwargs) if asyncio.iscoroutinefunction(func) else func(*args, **kwargs) # Success - reset circuit if it was HALF_OPEN if state['state'] == 'HALF_OPEN': state['state'] = 'CLOSED' state['failures'] = 0 logging.info(f"Circuit breaker for {func.__name__} is now CLOSED") return result except expected_exception as e: state['failures'] += 1 state['last_failure_time'] = now # Open circuit if threshold exceeded if state['failures'] >= failure_threshold: state['state'] = 'OPEN' logging.error(f"Circuit breaker for {func.__name__} is now OPEN") raise return wrapper return decorator # Usage examples @retry_with_backoff( max_retries=3, base_delay=1.0, retryable_exceptions=(InferenceTimeoutError, ModelError) ) @circuit_breaker( failure_threshold=5, timeout_duration=30.0, expected_exception=ModelError ) async def robust_model_inference(self, input_data: str) -> str: """Model inference with retry and circuit breaker protection.""" try: result = await self.model.generate(input_data) return result except torch.cuda.OutOfMemoryError: raise OutOfMemoryError() except TimeoutError: raise InferenceTimeoutError(30.0) ``` ### Error Context Management ```python import contextlib from typing import Optional, Dict, Any, List class ErrorContext: """Manage error context and correlation across operations.""" def __init__(self): self.context_stack: List[Dict[str, Any]] = [] self.correlation_id: Optional[str] = None def push_context(self, **kwargs): """Add context information.""" self.context_stack.append(kwargs) def pop_context(self): """Remove last context.""" if self.context_stack: self.context_stack.pop() def get_full_context(self) -> Dict[str, Any]: """Get complete context information.""" context = {} for ctx in self.context_stack: context.update(ctx) if self.correlation_id: context['correlation_id'] = self.correlation_id return context @contextlib.contextmanager def operation_context(self, **kwargs): """Context manager for operation-specific error context.""" self.push_context(**kwargs) try: yield self finally: self.pop_context() class ContextualError(Exception): """Exception that includes context information.""" def __init__(self, message: str, context: Optional[ErrorContext] = None): super().__init__(message) self.message = message self.context = context.get_full_context() if context else {} self.timestamp = datetime.now() def to_dict(self) -> Dict[str, Any]: """Convert to dictionary for logging/API response.""" return { "error": self.message, "context": self.context, "timestamp": self.timestamp.isoformat() } def with_error_context(error_context: ErrorContext): """Decorator to add error context to functions.""" def decorator(func: Callable): @functools.wraps(func) async def wrapper(*args, **kwargs): try: if asyncio.iscoroutinefunction(func): return await func(*args, **kwargs) else: return func(*args, **kwargs) except Exception as e: # Wrap exception with context if not isinstance(e, ContextualError): raise ContextualError(str(e), error_context) from e raise return wrapper return decorator # Usage error_context = ErrorContext() error_context.correlation_id = "req-12345" @with_error_context(error_context) async def process_with_context(self, data: str): """Process data with error context tracking.""" with error_context.operation_context(operation="preprocessing", input_size=len(data)): # Preprocessing step cleaned_data = self.preprocess(data) with error_context.operation_context(operation="inference", model="gpt-3.5-turbo"): # Inference step result = await self.model.generate(cleaned_data) return result ``` ## Centralized Error Handling ### Error Handler Class ```python import traceback import sys from typing import Union, Optional class CentralizedErrorHandler: """Centralized error handling for Chutes applications.""" def __init__(self, logger: Optional[logging.Logger] = None): self.logger = logger or logging.getLogger(__name__) self.error_counts = {} self.error_history = [] async def handle_error( self, error: Exception, context: Optional[Dict[str, Any]] = None, user_message: Optional[str] = None ) -> Dict[str, Any]: """Handle error and return appropriate response.""" context = context or {} error_type = type(error).__name__ # Track error statistics self.error_counts[error_type] = self.error_counts.get(error_type, 0) + 1 # Create error record error_record = { "error_type": error_type, "message": str(error), "context": context, "timestamp": datetime.now().isoformat(), "traceback": traceback.format_exc() if self.logger.level <= logging.DEBUG else None } # Store in history (limited size) self.error_history.append(error_record) if len(self.error_history) > 1000: self.error_history.pop(0) # Log error self.logger.error( f"Error in {context.get('operation', 'unknown')}: {str(error)}", extra={ "error_type": error_type, "context": context, "correlation_id": context.get("correlation_id") } ) # Create user-facing response if isinstance(error, ModelError): response = error.to_dict() elif isinstance(error, ValidationError): response = ValidationErrorHandler.format_pydantic_error(error) elif isinstance(error, ContextualError): response = error.to_dict() else: # Generic error handling response = { "error": user_message or "An unexpected error occurred", "error_type": "internal_error", "is_retryable": False, "timestamp": datetime.now().isoformat() } # Add details in development mode if self.logger.level <= logging.DEBUG: response["details"] = { "original_error": str(error), "error_type": error_type } return response def get_error_statistics(self) -> Dict[str, Any]: """Get error statistics for monitoring.""" recent_errors = [ err for err in self.error_history if (datetime.now() - datetime.fromisoformat(err["timestamp"])).seconds < 3600 ] return { "total_errors": len(self.error_history), "recent_errors_1h": len(recent_errors), "error_types": self.error_counts, "recent_error_types": { err_type: sum(1 for err in recent_errors if err["error_type"] == err_type) for err_type in set(err["error_type"] for err in recent_errors) } } async def handle_critical_error(self, error: Exception, context: Dict[str, Any]): """Handle critical errors that require immediate attention.""" self.logger.critical( f"CRITICAL ERROR: {str(error)}", extra={ "context": context, "traceback": traceback.format_exc() } ) # Could trigger alerts, notifications, etc. await self._trigger_alert(error, context) async def _trigger_alert(self, error: Exception, context: Dict[str, Any]): """Trigger alert for critical errors (implement as needed).""" # This could send notifications to Slack, email, PagerDuty, etc. pass # Integrate with Chute @chute.on_startup() async def initialize_error_handler(self): """Initialize centralized error handler.""" self.error_handler = CentralizedErrorHandler(logger=logging.getLogger("chutes.errors")) ``` ### Error Middleware ```python from fastapi import Request, Response from fastapi.responses import JSONResponse import time class ErrorMiddleware: """Middleware to catch and handle all errors.""" def __init__(self, error_handler: CentralizedErrorHandler): self.error_handler = error_handler async def __call__(self, request: Request, call_next): """Process request with error handling.""" start_time = time.time() correlation_id = request.headers.get("X-Correlation-ID", str(uuid.uuid4())) # Add correlation ID to request state request.state.correlation_id = correlation_id try: response = await call_next(request) # Add correlation ID to response headers response.headers["X-Correlation-ID"] = correlation_id return response except Exception as error: # Create error context context = { "correlation_id": correlation_id, "request_path": str(request.url.path), "request_method": request.method, "processing_time": time.time() - start_time, "user_agent": request.headers.get("User-Agent"), "remote_addr": request.client.host if request.client else None } # Handle error error_response = await self.error_handler.handle_error(error, context) # Determine HTTP status code if isinstance(error, ValidationError): status_code = 422 elif isinstance(error, ModelError): if error.error_type == AIErrorType.OUT_OF_MEMORY: status_code = 507 # Insufficient Storage elif error.error_type == AIErrorType.INFERENCE_TIMEOUT: status_code = 504 # Gateway Timeout elif error.error_type == AIErrorType.INVALID_INPUT: status_code = 400 # Bad Request else: status_code = 500 # Internal Server Error else: status_code = 500 # Create JSON response response = JSONResponse( content=error_response, status_code=status_code ) response.headers["X-Correlation-ID"] = correlation_id return response # Add middleware to Chute @chute.on_startup() async def add_error_middleware(self): """Add error handling middleware.""" self.app.middleware("http")(ErrorMiddleware(self.error_handler)) ``` ## Model-Specific Error Handling ### LLM Error Handling ```python class LLMErrorHandler: """Handle LLM-specific errors.""" @staticmethod async def safe_generate( model, tokenizer, prompt: str, max_tokens: int = 100, temperature: float = 0.7, timeout: float = 30.0 ) -> Dict[str, Any]: """Generate text with comprehensive error handling.""" try: # Validate input length inputs = tokenizer.encode(prompt, return_tensors="pt") if len(inputs[0]) > model.config.max_position_embeddings: raise ContextLengthError( len(inputs[0]), model.config.max_position_embeddings ) # Check available memory if torch.cuda.is_available(): memory_allocated = torch.cuda.memory_allocated() memory_cached = torch.cuda.memory_reserved() memory_total = torch.cuda.get_device_properties(0).total_memory if memory_allocated > memory_total * 0.9: raise OutOfMemoryError( memory_used=memory_allocated // (1024**2), memory_available=(memory_total - memory_allocated) // (1024**2) ) # Generate with timeout result = await asyncio.wait_for( LLMErrorHandler._generate_async(model, inputs, max_tokens, temperature), timeout=timeout ) return { "generated_text": result, "input_tokens": len(inputs[0]), "success": True } except asyncio.TimeoutError: raise InferenceTimeoutError(timeout) except torch.cuda.OutOfMemoryError: # Clear cache and retry once torch.cuda.empty_cache() try: result = await asyncio.wait_for( LLMErrorHandler._generate_async(model, inputs, max_tokens // 2, temperature), timeout=timeout ) return { "generated_text": result, "input_tokens": len(inputs[0]), "success": True, "warning": "Reduced max_tokens due to memory constraints" } except: raise OutOfMemoryError() except Exception as e: raise ModelError( f"Text generation failed: {str(e)}", AIErrorType.GENERATION_FAILED, details={"original_error": str(e)}, is_retryable=True ) @staticmethod async def _generate_async(model, inputs, max_tokens, temperature): """Async wrapper for model generation.""" def _generate(): with torch.no_grad(): outputs = model.generate( inputs, max_new_tokens=max_tokens, temperature=temperature, do_sample=True, pad_token_id=model.config.eos_token_id ) return model.tokenizer.decode(outputs[0], skip_special_tokens=True) # Run in thread pool to avoid blocking loop = asyncio.get_event_loop() return await loop.run_in_executor(None, _generate) # Usage in Chute @chute.cord(public_api_path="/generate", method="POST") async def generate_text_safe(self, prompt: str, max_tokens: int = 100): """Generate text with comprehensive error handling.""" try: result = await LLMErrorHandler.safe_generate( self.model, self.tokenizer, prompt, max_tokens=max_tokens ) return result except ModelError as e: # Let the middleware handle model errors raise except Exception as e: # Convert unexpected errors to ModelError raise ModelError( f"Unexpected error in text generation: {str(e)}", AIErrorType.GENERATION_FAILED, is_retryable=False ) ``` ### Image Generation Error Handling ```python class ImageGenerationErrorHandler: """Handle image generation specific errors.""" @staticmethod async def safe_generate_image( pipeline, prompt: str, width: int = 512, height: int = 512, num_inference_steps: int = 20, guidance_scale: float = 7.5 ) -> Dict[str, Any]: """Generate image with error handling.""" try: # Validate parameters if width * height > 1024 * 1024: raise ModelError( "Image resolution too high", AIErrorType.INVALID_INPUT, details={ "max_resolution": "1024x1024", "requested": f"{width}x{height}" } ) # Check memory before generation if torch.cuda.is_available(): torch.cuda.empty_cache() memory_before = torch.cuda.memory_allocated() # Generate image image = pipeline( prompt=prompt, width=width, height=height, num_inference_steps=num_inference_steps, guidance_scale=guidance_scale ).images[0] # Convert to base64 import io import base64 img_buffer = io.BytesIO() image.save(img_buffer, format='PNG') img_b64 = base64.b64encode(img_buffer.getvalue()).decode() return { "image": img_b64, "width": width, "height": height, "steps": num_inference_steps, "success": True } except torch.cuda.OutOfMemoryError: torch.cuda.empty_cache() # Try with reduced parameters try: reduced_steps = max(10, num_inference_steps // 2) image = pipeline( prompt=prompt, width=min(512, width), height=min(512, height), num_inference_steps=reduced_steps, guidance_scale=guidance_scale ).images[0] # Convert to base64 img_buffer = io.BytesIO() image.save(img_buffer, format='PNG') img_b64 = base64.b64encode(img_buffer.getvalue()).decode() return { "image": img_b64, "width": min(512, width), "height": min(512, height), "steps": reduced_steps, "success": True, "warning": "Parameters reduced due to memory constraints" } except: raise OutOfMemoryError() except Exception as e: raise ModelError( f"Image generation failed: {str(e)}", AIErrorType.GENERATION_FAILED, details={"prompt": prompt, "parameters": { "width": width, "height": height, "steps": num_inference_steps }}, is_retryable=True ) ``` ## Fallback Strategies ### Model Fallback Chain ```python class ModelFallbackChain: """Chain of fallback models for resilience.""" def __init__(self): self.primary_model = None self.fallback_models = [] self.model_health = {} def add_primary_model(self, model, name: str): """Set primary model.""" self.primary_model = {"model": model, "name": name} self.model_health[name] = {"failures": 0, "last_success": datetime.now()} def add_fallback_model(self, model, name: str, priority: int = 1): """Add fallback model.""" self.fallback_models.append({ "model": model, "name": name, "priority": priority }) self.model_health[name] = {"failures": 0, "last_success": datetime.now()} # Sort by priority self.fallback_models.sort(key=lambda x: x["priority"]) async def generate_with_fallback(self, prompt: str, **kwargs) -> Dict[str, Any]: """Generate with automatic fallback on failure.""" # Try primary model first if self.primary_model and self._is_model_healthy(self.primary_model["name"]): try: result = await self._try_model(self.primary_model, prompt, **kwargs) self._record_success(self.primary_model["name"]) result["model_used"] = self.primary_model["name"] result["was_fallback"] = False return result except ModelError as e: self._record_failure(self.primary_model["name"]) logging.warning(f"Primary model {self.primary_model['name']} failed: {e}") # Try fallback models for fallback in self.fallback_models: if not self._is_model_healthy(fallback["name"]): continue try: result = await self._try_model(fallback, prompt, **kwargs) self._record_success(fallback["name"]) result["model_used"] = fallback["name"] result["was_fallback"] = True return result except ModelError as e: self._record_failure(fallback["name"]) logging.warning(f"Fallback model {fallback['name']} failed: {e}") continue # All models failed raise ModelError( "All models in fallback chain failed", AIErrorType.MODEL_OVERLOADED, details={"tried_models": [ self.primary_model["name"] if self.primary_model else None ] + [fb["name"] for fb in self.fallback_models]}, is_retryable=True ) async def _try_model(self, model_info: Dict, prompt: str, **kwargs) -> Dict[str, Any]: """Try generating with a specific model.""" model = model_info["model"] # Implement actual model generation here # This is a placeholder - replace with your actual model calls if hasattr(model, 'generate'): result = await model.generate(prompt, **kwargs) else: result = f"Generated by {model_info['name']}: {prompt}" return {"generated_text": result} def _is_model_healthy(self, model_name: str) -> bool: """Check if model is healthy (not in circuit breaker state).""" health = self.model_health.get(model_name, {}) # If too many recent failures, consider unhealthy if health.get("failures", 0) > 3: last_success = health.get("last_success", datetime.min) if (datetime.now() - last_success).seconds < 300: # 5 minutes return False return True def _record_success(self, model_name: str): """Record successful model use.""" self.model_health[model_name].update({ "failures": 0, "last_success": datetime.now() }) def _record_failure(self, model_name: str): """Record model failure.""" self.model_health[model_name]["failures"] += 1 # Usage in Chute @chute.on_startup() async def initialize_fallback_chain(self): """Initialize model fallback chain.""" self.fallback_chain = ModelFallbackChain() # Add primary model self.fallback_chain.add_primary_model(self.primary_llm, "gpt-3.5-turbo") # Add fallback models self.fallback_chain.add_fallback_model(self.backup_llm, "gpt-3.5-turbo-backup", priority=1) self.fallback_chain.add_fallback_model(self.simple_llm, "simple-model", priority=2) @chute.cord(public_api_path="/generate_resilient", method="POST") async def generate_with_resilience(self, prompt: str): """Generate text with automatic fallback.""" return await self.fallback_chain.generate_with_fallback(prompt) ``` ### Graceful Degradation ```python class GracefulDegradationHandler: """Handle graceful degradation of service quality.""" def __init__(self): self.degradation_levels = { "full": {"quality": 1.0, "speed": 1.0}, "reduced": {"quality": 0.7, "speed": 1.5}, "minimal": {"quality": 0.4, "speed": 3.0} } self.current_level = "full" self.system_load = 0.0 def update_system_load(self, cpu_percent: float, memory_percent: float, gpu_percent: float): """Update system load metrics.""" self.system_load = max(cpu_percent, memory_percent, gpu_percent) / 100.0 # Automatically adjust degradation level if self.system_load > 0.9: self.current_level = "minimal" elif self.system_load > 0.7: self.current_level = "reduced" else: self.current_level = "full" def get_adjusted_parameters(self, base_params: Dict[str, Any]) -> Dict[str, Any]: """Adjust parameters based on current degradation level.""" level_config = self.degradation_levels[self.current_level] adjusted_params = base_params.copy() # Adjust quality-related parameters if "num_inference_steps" in adjusted_params: adjusted_params["num_inference_steps"] = int( adjusted_params["num_inference_steps"] * level_config["quality"] ) if "max_tokens" in adjusted_params: adjusted_params["max_tokens"] = int( adjusted_params["max_tokens"] * level_config["quality"] ) # Adjust for speed (reduce batch size, etc.) if "batch_size" in adjusted_params: adjusted_params["batch_size"] = max(1, int( adjusted_params["batch_size"] / level_config["speed"] )) return adjusted_params def get_degradation_warning(self) -> Optional[str]: """Get warning message for current degradation level.""" if self.current_level == "reduced": return "Service is running in reduced quality mode due to high system load" elif self.current_level == "minimal": return "Service is running in minimal quality mode due to very high system load" return None # Usage in endpoint @chute.cord(public_api_path="/adaptive_generate", method="POST") async def adaptive_generate(self, prompt: str, max_tokens: int = 100): """Generate with adaptive quality based on system load.""" # Get system metrics (implement based on your monitoring) cpu_percent = self.get_cpu_usage() memory_percent = self.get_memory_usage() gpu_percent = self.get_gpu_usage() # Update degradation handler self.degradation_handler.update_system_load(cpu_percent, memory_percent, gpu_percent) # Adjust parameters base_params = {"max_tokens": max_tokens} adjusted_params = self.degradation_handler.get_adjusted_parameters(base_params) try: result = await self.generate_text(prompt, **adjusted_params) # Add degradation warning if applicable warning = self.degradation_handler.get_degradation_warning() if warning: result["warning"] = warning result["degradation_level"] = self.degradation_handler.current_level return result except ModelError as e: # If still failing, try with even more conservative parameters if self.degradation_handler.current_level != "minimal": conservative_params = self.degradation_handler.get_adjusted_parameters({ "max_tokens": max_tokens // 2 }) try: result = await self.generate_text(prompt, **conservative_params) result["warning"] = "Used emergency conservative parameters due to system stress" return result except: pass raise ``` ## Monitoring and Alerting ### Error Metrics Collection ```python import time from collections import defaultdict, deque from typing import Dict, List, Tuple class ErrorMetricsCollector: """Collect and analyze error metrics.""" def __init__(self, window_size: int = 300): # 5 minute window self.window_size = window_size self.error_timeline = deque(maxlen=10000) # Recent errors self.error_rates = defaultdict(lambda: deque(maxlen=100)) self.error_patterns = defaultdict(int) def record_error( self, error_type: str, error_message: str, context: Dict[str, Any] = None ): """Record an error occurrence.""" timestamp = time.time() error_record = { "timestamp": timestamp, "error_type": error_type, "message": error_message, "context": context or {} } self.error_timeline.append(error_record) self.error_rates[error_type].append(timestamp) # Track error patterns pattern_key = f"{error_type}:{context.get('operation', 'unknown')}" self.error_patterns[pattern_key] += 1 def get_error_rate(self, error_type: str = None, window_seconds: int = 60) -> float: """Get error rate (errors per minute).""" current_time = time.time() cutoff_time = current_time - window_seconds if error_type: recent_errors = [ t for t in self.error_rates[error_type] if t > cutoff_time ] else: recent_errors = [ err["timestamp"] for err in self.error_timeline if err["timestamp"] > cutoff_time ] return len(recent_errors) * (60 / window_seconds) # Errors per minute def get_top_error_patterns(self, limit: int = 10) -> List[Tuple[str, int]]: """Get most common error patterns.""" return sorted( self.error_patterns.items(), key=lambda x: x[1], reverse=True )[:limit] def detect_error_spikes(self, threshold_multiplier: float = 3.0) -> List[Dict[str, Any]]: """Detect error rate spikes.""" alerts = [] current_time = time.time() for error_type in self.error_rates: # Compare recent rate to historical average recent_rate = self.get_error_rate(error_type, window_seconds=60) historical_rate = self.get_error_rate(error_type, window_seconds=3600) # 1 hour if historical_rate > 0 and recent_rate > historical_rate * threshold_multiplier: alerts.append({ "type": "error_spike", "error_type": error_type, "recent_rate": recent_rate, "historical_rate": historical_rate, "multiplier": recent_rate / historical_rate, "timestamp": current_time }) return alerts def get_error_summary(self) -> Dict[str, Any]: """Get comprehensive error summary.""" current_time = time.time() one_hour_ago = current_time - 3600 recent_errors = [ err for err in self.error_timeline if err["timestamp"] > one_hour_ago ] error_type_counts = defaultdict(int) for err in recent_errors: error_type_counts[err["error_type"]] += 1 return { "total_errors_1h": len(recent_errors), "error_rate_1h": len(recent_errors) / 60, # per minute "error_types": dict(error_type_counts), "top_patterns": self.get_top_error_patterns(), "spikes": self.detect_error_spikes() } # Integrate with error handler @chute.on_startup() async def initialize_metrics_collector(self): """Initialize error metrics collection.""" self.error_metrics = ErrorMetricsCollector() # Integrate with error handler original_handle_error = self.error_handler.handle_error async def handle_error_with_metrics(error, context=None, user_message=None): # Record metrics self.error_metrics.record_error( error_type=type(error).__name__, error_message=str(error), context=context ) # Call original handler return await original_handle_error(error, context, user_message) self.error_handler.handle_error = handle_error_with_metrics @chute.cord(public_api_path="/error_metrics", method="GET") async def get_error_metrics(self): """Get error metrics for monitoring.""" return self.error_metrics.get_error_summary() ``` ### Health Checks and Status ```python class HealthChecker: """Comprehensive health checking for Chutes applications.""" def __init__(self): self.health_checks = {} self.last_check_results = {} def register_check(self, name: str, check_func: Callable, critical: bool = False): """Register a health check.""" self.health_checks[name] = { "func": check_func, "critical": critical, "last_result": None, "last_check": None } async def run_all_checks(self) -> Dict[str, Any]: """Run all registered health checks.""" results = {} overall_status = "healthy" critical_failures = [] for name, check_info in self.health_checks.items(): try: start_time = time.time() result = await check_info["func"]() duration = time.time() - start_time check_result = { "status": "healthy" if result.get("healthy", True) else "unhealthy", "details": result, "duration_ms": duration * 1000, "timestamp": datetime.now().isoformat() } # Update tracking check_info["last_result"] = check_result check_info["last_check"] = time.time() results[name] = check_result # Check if this affects overall status if not result.get("healthy", True): if check_info["critical"]: overall_status = "critical" critical_failures.append(name) elif overall_status == "healthy": overall_status = "degraded" except Exception as e: error_result = { "status": "error", "error": str(e), "timestamp": datetime.now().isoformat() } results[name] = error_result if check_info["critical"]: overall_status = "critical" critical_failures.append(name) elif overall_status == "healthy": overall_status = "degraded" return { "overall_status": overall_status, "checks": results, "critical_failures": critical_failures, "timestamp": datetime.now().isoformat() } async def check_model_health(self) -> Dict[str, Any]: """Check model loading and basic inference.""" try: # Test basic model functionality test_result = await self.model.generate("test", max_tokens=1) return { "healthy": True, "model_loaded": True, "inference_working": True } except Exception as e: return { "healthy": False, "model_loaded": hasattr(self, 'model'), "inference_working": False, "error": str(e) } async def check_gpu_health(self) -> Dict[str, Any]: """Check GPU availability and memory.""" try: if not torch.cuda.is_available(): return { "healthy": False, "gpu_available": False, "message": "CUDA not available" } device_count = torch.cuda.device_count() device_info = [] for i in range(device_count): props = torch.cuda.get_device_properties(i) memory_allocated = torch.cuda.memory_allocated(i) memory_total = props.total_memory memory_percent = (memory_allocated / memory_total) * 100 device_info.append({ "device_id": i, "name": props.name, "memory_used_mb": memory_allocated // (1024**2), "memory_total_mb": memory_total // (1024**2), "memory_percent": memory_percent }) # Consider unhealthy if any GPU is over 95% memory gpu_healthy = all(info["memory_percent"] < 95 for info in device_info) return { "healthy": gpu_healthy, "gpu_available": True, "device_count": device_count, "devices": device_info } except Exception as e: return { "healthy": False, "gpu_available": False, "error": str(e) } async def check_disk_space(self) -> Dict[str, Any]: """Check available disk space.""" try: import shutil total, used, free = shutil.disk_usage("/") free_percent = (free / total) * 100 return { "healthy": free_percent > 10, # Unhealthy if less than 10% free "free_space_gb": free // (1024**3), "total_space_gb": total // (1024**3), "free_percent": free_percent } except Exception as e: return { "healthy": False, "error": str(e) } # Initialize health checks @chute.on_startup() async def initialize_health_checks(self): """Initialize health checking system.""" self.health_checker = HealthChecker() # Register health checks self.health_checker.register_check("model", self.health_checker.check_model_health, critical=True) self.health_checker.register_check("gpu", self.health_checker.check_gpu_health, critical=True) self.health_checker.register_check("disk", self.health_checker.check_disk_space, critical=False) @chute.cord(public_api_path="/health", method="GET") async def health_check(self): """Health check endpoint.""" return await self.health_checker.run_all_checks() # Detailed status endpoint @chute.cord(public_api_path="/status", method="GET") async def detailed_status(self): """Detailed system status including errors and health.""" health_results = await self.health_checker.run_all_checks() error_summary = self.error_metrics.get_error_summary() return { "health": health_results, "errors": error_summary, "uptime": time.time() - self.startup_time, "version": "1.0.0", # Your app version "timestamp": datetime.now().isoformat() } ``` ## Testing Error Handling ### Error Scenario Testing ```python import pytest from unittest.mock import Mock, patch import asyncio class ErrorHandlingTests: """Test suite for error handling scenarios.""" @pytest.fixture def error_handler(self): """Create error handler for testing.""" return CentralizedErrorHandler() @pytest.fixture def mock_chute(self): """Create mock chute for testing.""" chute = Mock() chute.error_handler = CentralizedErrorHandler() chute.error_metrics = ErrorMetricsCollector() return chute @pytest.mark.asyncio async def test_out_of_memory_handling(self, mock_chute): """Test OOM error handling.""" # Simulate OOM error oom_error = OutOfMemoryError(memory_used=8000, memory_available=500) result = await mock_chute.error_handler.handle_error(oom_error) assert result["error_type"] == "out_of_memory" assert result["is_retryable"] is False assert "memory_used_mb" in result["details"] @pytest.mark.asyncio async def test_context_length_error(self, mock_chute): """Test context length error handling.""" context_error = ContextLengthError(input_length=5000, max_length=4096) result = await mock_chute.error_handler.handle_error(context_error) assert result["error_type"] == "context_length_exceeded" assert "suggestion" in result["details"] assert result["details"]["input_length"] == 5000 @pytest.mark.asyncio async def test_retry_mechanism(self, mock_chute): """Test retry with backoff.""" call_count = 0 @retry_with_backoff(max_retries=2, base_delay=0.01) async def failing_function(): nonlocal call_count call_count += 1 if call_count < 3: raise InferenceTimeoutError(30.0) return "success" result = await failing_function() assert result == "success" assert call_count == 3 @pytest.mark.asyncio async def test_circuit_breaker(self, mock_chute): """Test circuit breaker functionality.""" call_count = 0 @circuit_breaker(failure_threshold=2, timeout_duration=0.1) async def unreliable_function(): nonlocal call_count call_count += 1 raise ModelError("Simulated failure", AIErrorType.GENERATION_FAILED) # First two calls should fail normally with pytest.raises(ModelError): await unreliable_function() with pytest.raises(ModelError): await unreliable_function() # Third call should be blocked by circuit breaker with pytest.raises(ModelError) as exc_info: await unreliable_function() assert "Circuit breaker is OPEN" in str(exc_info.value) @pytest.mark.asyncio async def test_fallback_chain(self, mock_chute): """Test model fallback chain.""" # Create mock models primary_model = Mock() primary_model.generate = Mock(side_effect=ModelError("Primary failed", AIErrorType.GENERATION_FAILED)) fallback_model = Mock() fallback_model.generate = Mock(return_value="Fallback success") # Create fallback chain chain = ModelFallbackChain() chain.add_primary_model(primary_model, "primary") chain.add_fallback_model(fallback_model, "fallback") result = await chain.generate_with_fallback("test prompt") assert result["generated_text"] == "Fallback success" assert result["model_used"] == "fallback" assert result["was_fallback"] is True def test_error_metrics_collection(self, mock_chute): """Test error metrics collection.""" metrics = ErrorMetricsCollector() # Record some errors metrics.record_error("ModelError", "Test error 1", {"operation": "inference"}) metrics.record_error("ValidationError", "Test error 2", {"operation": "input_validation"}) metrics.record_error("ModelError", "Test error 3", {"operation": "inference"}) # Check metrics model_error_rate = metrics.get_error_rate("ModelError", window_seconds=60) assert model_error_rate > 0 patterns = metrics.get_top_error_patterns() assert ("ModelError:inference", 2) in patterns @pytest.mark.asyncio async def test_graceful_degradation(self, mock_chute): """Test graceful degradation under load.""" degradation_handler = GracefulDegradationHandler() # Simulate high load degradation_handler.update_system_load(cpu_percent=95, memory_percent=85, gpu_percent=90) assert degradation_handler.current_level == "minimal" # Test parameter adjustment base_params = {"max_tokens": 100, "num_inference_steps": 20} adjusted_params = degradation_handler.get_adjusted_parameters(base_params) assert adjusted_params["max_tokens"] < base_params["max_tokens"] assert adjusted_params["num_inference_steps"] < base_params["num_inference_steps"] @pytest.mark.asyncio async def test_health_checks(self, mock_chute): """Test health check system.""" health_checker = HealthChecker() # Register mock health checks async def mock_healthy_check(): return {"healthy": True, "status": "OK"} async def mock_unhealthy_check(): return {"healthy": False, "status": "FAILED", "error": "Service down"} health_checker.register_check("service1", mock_healthy_check, critical=False) health_checker.register_check("service2", mock_unhealthy_check, critical=True) results = await health_checker.run_all_checks() assert results["overall_status"] == "critical" assert "service2" in results["critical_failures"] assert results["checks"]["service1"]["status"] == "healthy" assert results["checks"]["service2"]["status"] == "unhealthy" # Run tests if __name__ == "__main__": pytest.main([__file__, "-v"]) ``` ## Best Practices Summary ### Error Handling Checklist ```python class ErrorHandlingBestPractices: """Best practices for error handling in Chutes applications.""" CHECKLIST = { "Input Validation": [ "Validate all inputs with Pydantic schemas", "Sanitize text inputs for security", "Check file uploads for type and size", "Provide clear validation error messages", "Handle edge cases (empty inputs, extreme values)" ], "Model Error Handling": [ "Wrap model calls with appropriate try-catch blocks", "Handle GPU memory errors gracefully", "Implement timeout mechanisms for inference", "Check context length before processing", "Provide fallback models for resilience" ], "System Resilience": [ "Implement retry mechanisms with exponential backoff", "Use circuit breakers to prevent cascading failures", "Monitor system resources and degrade gracefully", "Implement health checks for all critical components", "Log errors with sufficient context for debugging" ], "User Experience": [ "Return user-friendly error messages", "Avoid exposing internal system details", "Provide actionable guidance in error responses", "Maintain consistent error response format", "Include correlation IDs for support requests" ], "Monitoring and Alerting": [ "Collect comprehensive error metrics", "Set up alerts for error rate spikes", "Monitor health check failures", "Track error patterns and trends", "Implement performance degradation alerts" ] } @classmethod def validate_implementation(cls, chute_instance) -> Dict[str, bool]: """Validate error handling implementation.""" results = {} # Check for error handler results["has_error_handler"] = hasattr(chute_instance, 'error_handler') # Check for health checks results["has_health_checks"] = hasattr(chute_instance, 'health_checker') # Check for metrics collection results["has_error_metrics"] = hasattr(chute_instance, 'error_metrics') # Check for fallback mechanisms results["has_fallback_chain"] = hasattr(chute_instance, 'fallback_chain') return results ``` ## Next Steps - **Advanced Monitoring**: Implement distributed tracing and APM integration - **Alert Management**: Set up PagerDuty, Slack, or email alerting - **Error Recovery**: Implement automatic recovery mechanisms - **Performance Impact**: Minimize error handling overhead in hot paths For more advanced topics, see: - [Monitoring and Observability](monitoring) - [Best Practices Guide](best-practices) - [Production Deployment](production-deployment) --- ## SOURCE: https://chutes.ai/docs/guides/modern-audio # Modern Audio Processing Guide This guide covers deploying state-of-the-art audio models on Chutes, specifically focusing on **Kokoro** for high-quality Text-to-Speech (TTS) and **Whisper v3** for Speech-to-Text (STT) transcription. ## High-Quality TTS with Kokoro-82M Kokoro is a frontier TTS model that produces extremely natural-sounding speech despite its small size (82M parameters). ### 1. Define the Image Kokoro requires specific system dependencies (`espeak-ng`, `git-lfs`) and Python packages (`phonemizer`, `scipy`, etc.). ```python from chutes.image import Image image = ( Image( username="myuser", name="kokoro-82m", tag="0.0.1", readme="## Text-to-speech using hexgrade/Kokoro-82M", ) .from_base("parachutes/base-python:3.12.7") # Install system dependencies as root .set_user("root") .run_command("apt update && apt install -y espeak-ng git-lfs") # Switch back to chutes user for python packages .set_user("chutes") .run_command("pip install phonemizer scipy munch torch transformers") # Download model weights into the image .run_command("git lfs install") .run_command("git clone https://huggingface.co/hexgrad/Kokoro-82M") .run_command("mv -f Kokoro-82M/* /app/") ) ``` ### 2. Define the Chute & Schemas ```python from enum import Enum from io import BytesIO import uuid from fastapi.responses import StreamingResponse from pydantic import BaseModel, Field from chutes.chute import Chute, NodeSelector class VoicePack(str, Enum): DEFAULT = "af" BELLA = "af_bella" SARAH = "af_sarah" ADAM = "am_adam" MICHAEL = "am_michael" class InputArgs(BaseModel): text: str voice: VoicePack = Field(default=VoicePack.DEFAULT) chute = Chute( username="myuser", name="kokoro-tts", image=image, node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=24), ) ``` ### 3. Initialize & Define Endpoint We load the model and voice packs into GPU memory on startup for low-latency inference. ```python @chute.on_startup() async def initialize(self): from models import build_model import torch import wave import numpy as np from kokoro import generate # Store libraries for use in the endpoint self.wave = wave self.np = np self.generate = generate # Load model self.model = build_model("kokoro-v0_19.pth", "cuda") # Pre-load voice packs self.voice_packs = {} for voice_id in VoicePack: self.voice_packs[voice_id.value] = torch.load( f"voices/{voice_id.value}.pt", weights_only=True ).to("cuda") @chute.cord( public_api_path="/speak", method="POST", output_content_type="audio/wav" ) async def speak(self, args: InputArgs) -> StreamingResponse: # Generate audio audio_data, _ = self.generate( self.model, args.text, self.voice_packs[args.voice.value], lang=args.voice.value[0] ) # Convert to WAV buffer = BytesIO() audio_int16 = (audio_data * 32768).astype(self.np.int16) with self.wave.open(buffer, "wb") as wav_file: wav_file.setnchannels(1) wav_file.setsampwidth(2) wav_file.setframerate(24000) wav_file.writeframes(audio_int16.tobytes()) buffer.seek(0) return StreamingResponse( buffer, media_type="audio/wav", headers={"Content-Disposition": f"attachment; filename={uuid.uuid4()}.wav"} ) ``` ## Speech Transcription with Whisper v3 Deploying OpenAI's Whisper Large v3 allows for state-of-the-art transcription and translation. ### 1. Setup ```python from chutes.image import Image from chutes.chute import Chute, NodeSelector from pydantic import BaseModel, Field import tempfile import base64 # Simple image with transformers and acceleration image = ( Image(username="myuser", name="whisper-v3", tag="1.0") .from_base("parachutes/base-python:3.12.7") .run_command("pip install transformers torch accelerate") ) chute = Chute( username="myuser", name="whisper-v3", image=image, node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=24) ) class TranscriptionArgs(BaseModel): audio_b64: str = Field(..., description="Base64 encoded audio file") language: str = Field(None, description="Target language code (e.g., 'en', 'fr')") ``` ### 2. Initialize Pipeline ```python @chute.on_startup() async def load_model(self): from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline import torch model_id = "openai/whisper-large-v3" model = AutoModelForSpeechSeq2Seq.from_pretrained( model_id, torch_dtype=torch.float16, use_safetensors=True ).to("cuda") processor = AutoProcessor.from_pretrained(model_id) self.pipe = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, torch_dtype=torch.float16, device="cuda", ) ``` ### 3. Transcription Endpoint ```python @chute.cord(public_api_path="/transcribe", method="POST") async def transcribe(self, args: TranscriptionArgs): # Decode base64 audio to temporary file with tempfile.NamedTemporaryFile(mode="wb", suffix=".wav") as tmpfile: tmpfile.write(base64.b64decode(args.audio_b64)) tmpfile.flush() generate_kwargs = {} if args.language: generate_kwargs["language"] = args.language result = self.pipe( tmpfile.name, return_timestamps=True, generate_kwargs=generate_kwargs ) # Format chunks for cleaner output formatted_chunks = [ { "start": chunk["timestamp"][0], "end": chunk["timestamp"][1], "text": chunk["text"] } for chunk in result["chunks"] ] return {"text": result["text"], "chunks": formatted_chunks} ``` ## Usage Tips 1. **Latency**: For real-time applications (like voice bots), prefer smaller models or streaming architectures. Kokoro is extremely fast and suitable for near real-time use. 2. **Audio Format**: When sending audio to the API, standard formats like WAV or MP3 are supported. For base64 uploads, ensure you strip any data URI headers (e.g., `data:audio/wav;base64,`) before sending. 3. **VRAM**: `whisper-large-v3` typically requires ~10GB VRAM for inference. Kokoro is very lightweight (<4GB). A single 24GB GPU (e.g., A10G, 3090, 4090) can easily host both if combined into one chute! --- ## SOURCE: https://chutes.ai/docs/guides/performance # Performance Optimization Guide This comprehensive guide covers performance optimization strategies for Chutes applications, from model inference to network efficiency and resource management. ## Overview Performance optimization in Chutes involves several key areas: - **Model Optimization**: Quantization, compilation, and batching - **Resource Management**: Efficient GPU and memory usage - **Scaling Strategies**: Auto-scaling and load balancing - **Caching**: Reducing redundant computations - **Network Optimization**: Minimizing latency and payload size - **Monitoring**: Tracking metrics to identify bottlenecks ## Model Inference Optimization ### Dynamic Batching Processing requests in batches significantly improves GPU utilization. Here's a robust dynamic batcher implementation: ```python import asyncio import time from typing import List, Dict, Any from dataclasses import dataclass @dataclass class BatchRequest: data: Dict[str, Any] future: asyncio.Future timestamp: float class DynamicBatcher: def __init__(self, max_batch_size: int = 32, max_wait_time: float = 0.01): self.max_batch_size = max_batch_size self.max_wait_time = max_wait_time self.pending_requests: List[BatchRequest] = [] self.processing = False self.lock = asyncio.Lock() async def add_request(self, data: Dict[str, Any]) -> Any: """Add request to batch queue""" future = asyncio.Future() request = BatchRequest(data, future, time.time()) async with self.lock: self.pending_requests.append(request) if not self.processing: asyncio.create_task(self._process_batch()) return await future async def _process_batch(self): """Process accumulated requests""" async with self.lock: if self.processing or not self.pending_requests: return self.processing = True while True: # Wait for batch to accumulate or timeout start_time = time.time() while (len(self.pending_requests) < self.max_batch_size and time.time() - start_time < self.max_wait_time): await asyncio.sleep(0.001) async with self.lock: if not self.pending_requests: break # Extract batch batch_size = min(len(self.pending_requests), self.max_batch_size) batch = self.pending_requests[:batch_size] self.pending_requests = self.pending_requests[batch_size:] # Run inference try: batch_data = [req.data for req in batch] results = await self._run_inference(batch_data) for req, result in zip(batch, results): if not req.future.done(): req.future.set_result(result) except Exception as e: for req in batch: if not req.future.done(): req.future.set_exception(e) async with self.lock: self.processing = False async def _run_inference(self, batch_data: List[Dict]) -> List[Any]: """Override this with your actual inference logic""" # Example: # inputs = tokenizer([item["text"] for item in batch_data], padding=True, return_tensors="pt") # outputs = model(**inputs) # return outputs return [{"result": "mock_result"} for _ in batch_data] ``` ### Model Quantization Reduce model size and memory footprint using quantization (e.g., 8-bit or 4-bit): ```python from chutes.image import Image # Build image with quantization support image = ( Image(username="myuser", name="quantized-model", tag="1.0") .pip_install([ "torch", "transformers", "bitsandbytes", # Required for 8-bit/4-bit "accelerate" ]) ) # Loading a quantized model def load_quantized_model(): from transformers import AutoModelForCausalLM, BitsAndBytesConfig import torch quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16 ) model = AutoModelForCausalLM.from_pretrained( "microsoft/DialoGPT-medium", quantization_config=quant_config, device_map="auto" ) return model ``` ### TorchScript Compilation Compile PyTorch models for faster execution: ```python import torch def optimize_model(model, example_input): # Trace the model traced_model = torch.jit.trace(model, example_input) return torch.jit.freeze(traced_model) ``` ## Resource Management ### GPU Memory Management Properly managing GPU memory is critical to avoid OOM errors and maximize throughput. ```python import torch import gc from contextlib import contextmanager class GPUMemoryManager: @contextmanager def optimization_context(self): """Context manager to clear cache before and after operations""" self.cleanup() try: yield finally: self.cleanup() def cleanup(self): """Aggressive memory cleanup""" gc.collect() if torch.cuda.is_available(): torch.cuda.empty_cache() torch.cuda.synchronize() def get_usage(self): if not torch.cuda.is_available(): return 0 return torch.cuda.memory_allocated() / torch.cuda.max_memory_allocated() # Usage memory_manager = GPUMemoryManager() async def run_inference(inputs): with memory_manager.optimization_context(): # Run heavy inference here pass ``` ## Scaling Strategies ### Auto-scaling Configuration Configure your chute to scale automatically based on load: ```python from chutes.chute import Chute, NodeSelector chute = Chute( # ... other args ... concurrency=10, # Max concurrent requests per instance # Auto-scaling settings auto_scale=True, min_instances=1, max_instances=10, scale_up_threshold=0.8, # Scale up when 80% concurrency reached scale_down_threshold=0.3, # Scale down when <30% utilized scale_up_cooldown=60, # Wait 60s before next scale up scale_down_cooldown=300 # Wait 5m before scaling down ) ``` ## Caching Strategies ### Redis Caching Use Redis for distributed caching across multiple chute instances: ```python import redis import pickle import hashlib class CacheManager: def __init__(self, redis_url="redis://localhost:6379"): self.redis = redis.from_url(redis_url) def get_key(self, prefix, *args, **kwargs): key_str = str(args) + str(sorted(kwargs.items())) return f"{prefix}:{hashlib.md5(key_str.encode()).hexdigest()}" def get(self, key): data = self.redis.get(key) return pickle.loads(data) if data else None def set(self, key, value, ttl=3600): self.redis.setex(key, ttl, pickle.dumps(value)) # Decorator usage def cached(ttl=3600): def decorator(func): async def wrapper(self, *args, **kwargs): key = self.cache.get_key(func.__name__, *args, **kwargs) result = self.cache.get(key) if result: return result result = await func(self, *args, **kwargs) self.cache.set(key, result, ttl) return result return wrapper return decorator ``` ## Network Optimization ### Response Compression Compress large JSON responses to reduce network transfer time: ```python import gzip import json def compress_response(data: dict) -> dict: json_str = json.dumps(data) if len(json_str) < 1024: # Don't compress small responses return data compressed = gzip.compress(json_str.encode()) return { "compressed": True, "data": compressed.hex() } ``` ### Streaming For long-running generations (like LLMs), use streaming to provide immediate feedback. See the [Streaming Guide](streaming) for details. ## Monitoring Track performance metrics to identify bottlenecks. ```python import time from prometheus_client import Histogram, Counter REQUEST_TIME = Histogram('request_processing_seconds', 'Time spent processing request') REQUEST_COUNT = Counter('request_count', 'Total request count') @chute.cord(public_api_path="/run", method="POST") async def run(self, data: dict): REQUEST_COUNT.inc() with REQUEST_TIME.time(): # Process request return await self.process(data) ``` ## Next Steps - **[Cost Optimization](cost-optimization)**: Balance performance with cost - **[Best Practices](best-practices)**: General deployment guidelines - **[Streaming Guide](streaming)**: Implement real-time responses --- ## SOURCE: https://chutes.ai/docs/guides/production-readiness # Production Readiness Guide Moving from a prototype to a production-grade application on Chutes requires attention to reliability, security, and scaling. This checklist covers the essential steps to ensure your chute is ready for the real world. ## 1. Reliability & Stability ### ✅ Handle Startup & Shutdown Ensure your `on_startup` logic is robust. Pre-download all necessary models and artifacts so the first request is fast. ```python @chute.on_startup() async def startup(self): # Fail fast if critical resources are missing if not os.path.exists("model.bin"): raise RuntimeError("Model file missing!") self.model = load_model("model.bin") ``` ### ✅ Implement Health Checks Define a lightweight cord for health monitoring (e.g., by load balancers). ```python @chute.cord(public_api_path="/health", method="GET") async def health(self): if self.model is None: raise HTTPException(503, "Model not loaded") return {"status": "ok"} ``` ### ✅ Graceful Error Handling Don't let internal errors crash your service or leak stack traces to users. Wrap logic in try/except blocks and return appropriate HTTP status codes. ```python try: result = self.model.predict(data) except ValueError: raise HTTPException(400, "Invalid input data") except Exception as e: logger.error(f"Inference failed: {e}") raise HTTPException(500, "Internal inference error") ``` ## 2. Performance & Scaling ### ✅ Concurrency Tuning Set `concurrency` appropriately. * **1**: For heavy, atomic workloads (e.g., image generation) where batching isn't possible. * **High (e.g., 64+)**: For async engines like vLLM that handle internal batching. ### ✅ Auto-Scaling Configuration Configure scaling parameters to handle traffic spikes without over-provisioning. ```python chute = Chute( ... min_instances=1, # Keep one warm if low latency is critical max_instances=10, # Cap costs/resources scaling_threshold=0.75, # Scale up when 75% utilized shutdown_after_seconds=300 # Scale down after 5 min idle ) ``` ### ✅ Caching Use internal caching (LRU) or external caches (Redis) for frequent, identical queries to save compute. ## 3. Security ### ✅ Scoped API Keys Never use your admin API key in client-side code. Create scoped keys for specific functions. ```bash # Create a key that can ONLY invoke this specific chute chute keys create --name "app-client" --action invoke --chute-ids ``` ### ✅ Input Validation Use Pydantic schemas strictly. Validate string lengths, image sizes, and numeric ranges to prevent DOS attacks or memory overflows. ```python class Input(BaseModel): prompt: str = Field(..., max_length=1000) # Prevent massive prompt attacks steps: int = Field(..., ge=1, le=50) # Bound compute usage ``` ## 4. Observability ### ✅ Logging Log structured data (JSON) where possible for easy parsing. Log important events (startup, errors) but avoid logging sensitive user data (PII). ### ✅ Metrics Use the built-in Prometheus client if you need custom metrics (e.g., "images_generated_total"), or rely on the platform's standard metrics (requests/sec, latency). ## 5. Deployment Strategy ### ✅ Pinned Versions Always pin your dependencies in `requirements.txt` or your `Image` definition. * **Bad**: `pip install torch` * **Good**: `pip install torch==2.4.0` ### ✅ Immutable Tags Don't rely on `latest` tags for base images. Use specific SHA digests or version tags to ensure reproducibility. ### ✅ Staging Environment Deploy a separate "staging" chute (e.g., `my-app-staging`) to test changes before updating your production chute. ## Production Checklist Summary - [ ] **Model Loading**: Pre-loaded on startup, not per-request. - [ ] **Error Handling**: User-friendly HTTP errors, no stack traces. - [ ] **Validation**: Strict Pydantic schemas for all inputs. - [ ] **Scaling**: `max_instances` set to protect budget. - [ ] **Security**: Scoped API keys generated for clients. - [ ] **Dependencies**: All packages pinned to specific versions. - [ ] **Monitoring**: Health check endpoint exists and works. --- ## SOURCE: https://chutes.ai/docs/guides/rag-pipeline # Building a RAG Pipeline Retrieval-Augmented Generation (RAG) combines the power of Large Language Models (LLMs) with your own custom data. This guide walks through building a complete RAG pipeline on Chutes using **ChromaDB** for vector storage, **vLLM** for embeddings, and **SGLang/vLLM** for generation. ## Architecture A standard RAG pipeline on Chutes consists of three components: 1. **Embedding Service**: Converts text into vector representations. 2. **Vector Database (Chroma)**: Stores vectors and performs similarity search. 3. **LLM (Generation)**: Takes the query + retrieved context and generates an answer. You can deploy these as separate chutes for scalability, or combine them for simplicity. Here, we'll deploy them as modular components. --- ## Step 1: Deploy Embedding Service Use the `embedding` template to deploy a high-performance embedding model like `bge-large-en-v1.5`. ```python # deploy_embedding.py from chutes.chute import NodeSelector from chutes.chute.template.embedding import build_embedding_chute chute = build_embedding_chute( username="myuser", model_name="BAAI/bge-large-en-v1.5", readme="High performance embeddings", node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=16), concurrency=32, ) ``` Deploy it: ```bash chutes deploy deploy_embedding:chute ``` ## Step 2: Deploy ChromaDB We'll create a custom chute that runs ChromaDB. Chroma is persistent, so we'll use a **Job** or a persistent storage pattern if we need data to survive restarts. For this example, we'll set up an ephemeral vector DB that ingests data on startup (great for read-only knowledge bases). ```python # deploy_chroma.py from chutes.image import Image from chutes.chute import Chute, NodeSelector from pydantic import BaseModel, Field from typing import List image = ( Image(username="myuser", name="chroma-db", tag="0.1") .from_base("parachutes/base-python:3.12.7") .run_command("pip install chromadb") ) chute = Chute( username="myuser", name="rag-vector-db", image=image, node_selector=NodeSelector(gpu_count=0, min_cpu_count=2, min_memory_gb=8), ) class Query(BaseModel): query_embeddings: List[List[float]] n_results: int = 5 @chute.on_startup() async def setup_db(self): import chromadb self.client = chromadb.Client() self.collection = self.client.create_collection("knowledge_base") # INGESTION: In a real app, you might fetch this from S3 or a database documents = [ "Chutes is a serverless GPU platform.", "You can deploy LLMs, diffusion models, and custom code on Chutes.", "Chutes uses a decentralized network of GPUs." ] ids = [f"doc_{i}" for i in range(len(documents))] # Note: In a real setup, you'd generate embeddings for these docs first # For simplicity, we assume you send pre-computed embeddings or compute them here # self.collection.add(documents=documents, ids=ids, embeddings=...) print("ChromaDB initialized!") @chute.cord(public_api_path="/query", method="POST") async def query(self, q: Query): results = self.collection.query( query_embeddings=q.query_embeddings, n_results=q.n_results ) return results ``` ## Step 3: The RAG Controller (Client-Side or Chute) You can orchestrate the RAG flow from your client application, or deploy a "Controller Chute" that talks to the other services. Here is a Python client example that ties it all together. ```python import requests import openai # Configuration EMBEDDING_URL = "https://myuser-bge-large.chutes.ai/v1/embeddings" CHROMA_URL = "https://myuser-rag-vector-db.chutes.ai/query" LLM_BASE_URL = "https://myuser-deepseek-r1.chutes.ai/v1" API_KEY = "your-api-key" def get_embedding(text): """Get embedding vector for text.""" resp = requests.post( EMBEDDING_URL, headers={"Authorization": API_KEY}, json={"input": text, "model": "BAAI/bge-large-en-v1.5"} ) return resp.json()["data"][0]["embedding"] def search_knowledge_base(embedding): """Search vector DB.""" resp = requests.post( CHROMA_URL, headers={"Authorization": API_KEY}, json={"query_embeddings": [embedding], "n_results": 3} ) # Format results into a context string results = resp.json() return "\n".join(results["documents"][0]) def generate_answer(query, context): """Generate answer using LLM.""" client = openai.OpenAI(base_url=LLM_BASE_URL, api_key=API_KEY) prompt = f""" Use the following context to answer the question. Context: {context} Question: {query} """ resp = client.chat.completions.create( model="deepseek-ai/DeepSeek-R1", messages=[{"role": "user", "content": prompt}], temperature=0.1 ) return resp.choices[0].message.content # Main Flow user_query = "What is Chutes?" print(f"Querying: {user_query}...") # 1. Embed vector = get_embedding(user_query) # 2. Retrieve context = search_knowledge_base(vector) print(f"Retrieved Context:\n{context}\n") # 3. Generate answer = generate_answer(user_query, context) print(f"Answer:\n{answer}") ``` ## Advanced: ComfyUI Workflow for RAG You can also use ComfyUI on Chutes to build visual RAG pipelines. The `chroma.py` example in the Chutes examples directory demonstrates how to wrap a ComfyUI workflow (which can include RAG nodes) inside a Chute API. 1. Build a ComfyUI workflow that includes text loading, embedding, and LLM query nodes. 2. Export the workflow as JSON API format. 3. Use the `chroma.py` pattern to load this workflow into a Chute, exposing inputs (like "prompt") as API parameters. This allows you to drag-and-drop your RAG logic and deploy it as a scalable API instantly. --- ## SOURCE: https://chutes.ai/docs/guides/reasoning-models # Reasoning Models Guide (DeepSeek R1) DeepSeek R1 is a powerful open-source reasoning model that rivals proprietary models like OpenAI's o1. This guide shows you how to deploy DeepSeek R1 on Chutes using the SGLang template, optimized for high-performance reasoning tasks. ## Overview DeepSeek R1 is a "reasoning model", meaning it is designed to "think" before it answers. This manifests as a chain-of-thought (CoT) process where the model explores the problem space, breaks down complex queries, and self-corrects before generating a final response. Key requirements for deploying DeepSeek R1: - **Large Context Window**: Reasoning traces can be long, requiring support for large context lengths (e.g., 65k-128k tokens). - **High VRAM**: The full 671B parameter model (even quantized) requires significant GPU memory (multiple H100s/H200s). - **Optimized Serving**: SGLang is recommended for its efficient handling of structured generation and long contexts. ## Quick Start: DeepSeek R1 Distill (Recommended) For most use cases, the distilled versions of DeepSeek R1 (based on Llama 3 or Qwen 2.5) offer an excellent balance of performance and cost. They can often run on single GPUs. ```python from chutes.chute import NodeSelector from chutes.chute.template.vllm import build_vllm_chute chute = build_vllm_chute( username="myuser", readme="DeepSeek R1 Distill Llama 8B - Efficient Reasoning", model_name="deepseek-ai/DeepSeek-R1-Distill-Llama-8B", revision="main", concurrency=16, node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24, # Fits comfortably on A10g, A100, etc. ), engine_args={ "max_model_len": 32768, # Reasoning models need context! "enable_prefix_caching": True, } ) ``` ## Advanced: Full DeepSeek R1 (671B) To deploy the full DeepSeek R1 model, you will need a multi-node or high-end multi-GPU setup. Chutes makes this accessible via the `sglang` template. ### Configuration The full model is massive. We recommend using `chutes/sglang` images which are highly optimized for this workload. ```python import os from chutes.chute import NodeSelector from chutes.chute.template.sglang import build_sglang_chute # Helper to configure environment for multi-node communication os.environ["NO_PROXY"] = "localhost,127.0.0.1" chute = build_sglang_chute( username="myuser", readme="## DeepSeek R1 (Full 671B)\n\nState-of-the-art open reasoning model.", model_name="deepseek-ai/DeepSeek-R1", # Use a recent SGLang image for best R1 support image="chutes/sglang:0.4.6.post5b", concurrency=24, # Hardware Requirements node_selector=NodeSelector( gpu_count=8, # Requires 8 GPUs min_vram_gb_per_gpu=140, # H200s or H100s with high memory usage include=["h200"], # Specifically target H200s for best performance ), # SGLang Engine Arguments engine_args=( "--trust-remote-code " "--revision f7361cd9ff99396dbf6bd644ad846015e59ed4fc " # Pin a known good revision "--tp-size 8 " # Tensor Parallelism across 8 GPUs "--context-length 65536 " # Large context for reasoning traces "--mem-fraction-static 0.90 " # Optimize memory usage ), ) ``` ### Deployment Save the above code to `deepseek_r1.py` and deploy: ```bash chutes deploy deepseek_r1:chute ``` *Note: This deployment uses high-end hardware (8x H200s). Ensure your account has sufficient limits and balance.* ## Using Reasoning Models When interacting with reasoning models, the "thinking process" is often returned as part of the output, enclosed in specific tags (e.g., `...`). ### Example Request ```python import openai client = openai.OpenAI( base_url="https://myuser-deepseek-r1.chutes.ai/v1", api_key="your-api-key" ) response = client.chat.completions.create( model="deepseek-ai/DeepSeek-R1", messages=[ {"role": "user", "content": "How many Rs are in the word strawberry?"} ], temperature=0.6, ) content = response.choices[0].message.content print(content) ``` **Output Structure:** ```text The user is asking for the count of the letter 'r' in "strawberry". 1. S-t-r-a-w-b-e-r-r-y 2. Let's count them: - s - t - r (1) - a - w - b - e - r (2) - r (3) - y 3. There are 3 Rs. There are 3 Rs in the word "strawberry". ``` ## Best Practices 1. **Prompting**: Reasoning models respond well to simple, direct prompts. You often don't need complex "Chain of Thought" prompting strategies because the model does this natively. 2. **Temperature**: Keep temperature slightly higher (0.5 - 0.7) than standard code models (0.0) to allow the model to explore different reasoning paths, but not too high to avoid incoherence. 3. **Context Management**: The `` traces consume tokens. Ensure your `max_model_len` / `context_length` is sufficient (e.g., 32k+) to accommodate long reasoning chains plus the final answer. 4. **Streaming**: Always use `stream=True` for a better user experience, as the initial "thinking" phase can take several seconds before the final answer begins to appear. ## Troubleshooting * **OOM (Out of Memory)**: If the chute fails to start, try reducing `max_model_len` or `max_num_seqs` in `engine_args`. For the full 671B model, ensure you are targeting 8x80GB (A100/H100) or 8x141GB (H200) nodes. * **Slow "Time to First Token"**: This is normal for reasoning models as they generate internal thought tokens before producing visible output. --- ## SOURCE: https://chutes.ai/docs/guides/schemas # Input/Output Schemas with Pydantic This guide covers how to use Pydantic for robust input/output validation in Chutes applications, enabling type safety, automatic API documentation, and data transformation. ## Overview Pydantic schemas in Chutes provide: - **Type Safety**: Automatic type validation and conversion - **API Documentation**: Auto-generated OpenAPI/Swagger docs - **Error Handling**: Clear validation error messages - **Data Transformation**: Automatic serialization/deserialization - **IDE Support**: Full autocomplete and type checking - **Validation Rules**: Custom validators and constraints ## Basic Schema Definition ### Simple Input/Output Schemas ```python from pydantic import BaseModel, Field from typing import Optional, List from datetime import datetime class TextInput(BaseModel): text: str = Field(..., min_length=1, max_length=5000, description="Input text to analyze") language: Optional[str] = Field("auto", description="Language code (auto-detect if not specified)") options: Optional[List[str]] = Field(default=[], description="Additional processing options") class AnalysisOutput(BaseModel): result: str = Field(..., description="Analysis result") confidence: float = Field(..., ge=0.0, le=1.0, description="Confidence score between 0 and 1") language_detected: Optional[str] = Field(None, description="Detected language code") processing_time: float = Field(..., gt=0, description="Processing time in seconds") timestamp: datetime = Field(default_factory=datetime.now, description="Processing timestamp") # Usage in chute from chutes.chute import Chute chute = Chute(username="myuser", name="text-analyzer") @chute.cord( public_api_path="/analyze", method="POST", input_schema=TextInput, output_schema=AnalysisOutput ) async def analyze_text(self, input_data: TextInput) -> AnalysisOutput: """Analyze text with full type safety.""" # Input is automatically validated and typed text = input_data.text language = input_data.language options = input_data.options # Process text (example) result = f"Analyzed: {text[:50]}..." confidence = 0.95 # Return validated output return AnalysisOutput( result=result, confidence=confidence, language_detected="en", processing_time=0.1 ) ``` ### Advanced Field Validation ```python from pydantic import BaseModel, Field, validator, root_validator from typing import Union, Literal import re class ImageGenerationInput(BaseModel): prompt: str = Field( ..., min_length=3, max_length=500, description="Text prompt for image generation" ) width: int = Field( 512, ge=128, le=2048, multiple_of=64, # Must be divisible by 64 description="Image width in pixels" ) height: int = Field( 512, ge=128, le=2048, multiple_of=64, description="Image height in pixels" ) steps: int = Field( 20, ge=1, le=100, description="Number of inference steps" ) guidance_scale: float = Field( 7.5, ge=1.0, le=20.0, description="Guidance scale for generation" ) style: Literal["realistic", "artistic", "cartoon", "abstract"] = Field( "realistic", description="Image style" ) seed: Optional[int] = Field( None, ge=0, le=2**32-1, description="Random seed for reproducibility" ) negative_prompt: Optional[str] = Field( None, max_length=500, description="Negative prompt to avoid certain elements" ) @validator('prompt') def validate_prompt(cls, v): """Custom prompt validation.""" # Remove excessive whitespace v = re.sub(r'\s+', ' ', v.strip()) # Check for inappropriate content (example) forbidden_words = ['violence', 'harmful'] if any(word in v.lower() for word in forbidden_words): raise ValueError('Prompt contains inappropriate content') return v @validator('width', 'height') def validate_dimensions(cls, v, field): """Validate image dimensions.""" if v % 64 != 0: raise ValueError(f'{field.name} must be divisible by 64') return v @root_validator def validate_aspect_ratio(cls, values): """Validate overall aspect ratio.""" width = values.get('width', 512) height = values.get('height', 512) aspect_ratio = width / height if aspect_ratio > 4 or aspect_ratio < 0.25: raise ValueError('Extreme aspect ratios not supported (must be between 0.25 and 4)') return values class Config: # Generate example values for documentation schema_extra = { "example": { "prompt": "a beautiful sunset over mountains", "width": 1024, "height": 768, "steps": 25, "guidance_scale": 7.5, "style": "realistic", "seed": 42, "negative_prompt": "blurry, low quality" } } ``` ## Complex Schema Patterns ### Nested Schemas ```python from typing import List, Dict, Any from enum import Enum class ProcessingOptions(BaseModel): """Nested schema for processing options.""" enable_caching: bool = Field(True, description="Enable result caching") timeout_seconds: int = Field(30, ge=1, le=300, description="Processing timeout") parallel_processing: bool = Field(False, description="Enable parallel processing") class ModelConfig(BaseModel): """Model configuration schema.""" model_name: str = Field(..., description="Model identifier") temperature: float = Field(0.7, ge=0.0, le=2.0, description="Sampling temperature") max_tokens: int = Field(100, ge=1, le=4096, description="Maximum output tokens") top_p: float = Field(0.9, ge=0.0, le=1.0, description="Nucleus sampling parameter") class BatchProcessingInput(BaseModel): """Complex input schema with nested structures.""" texts: List[str] = Field(..., min_items=1, max_items=100, description="List of texts to process") model_config: ModelConfig = Field(..., description="Model configuration") processing_options: ProcessingOptions = Field(default_factory=ProcessingOptions, description="Processing options") metadata: Optional[Dict[str, Any]] = Field(None, description="Additional metadata") @validator('texts') def validate_texts(cls, v): """Validate text list.""" # Check each text for i, text in enumerate(v): if not text.strip(): raise ValueError(f'Text at index {i} cannot be empty') if len(text) > 5000: raise ValueError(f'Text at index {i} too long (max 5000 characters)') return v class ProcessingResult(BaseModel): """Individual processing result.""" input_text: str output_text: str confidence: float = Field(..., ge=0.0, le=1.0) processing_time: float = Field(..., gt=0) model_used: str class BatchProcessingOutput(BaseModel): """Complex output schema.""" results: List[ProcessingResult] = Field(..., description="Processing results") total_processed: int = Field(..., ge=0, description="Total items processed") total_time: float = Field(..., gt=0, description="Total processing time") success_rate: float = Field(..., ge=0.0, le=1.0, description="Success rate") metadata: Dict[str, Any] = Field(default_factory=dict, description="Result metadata") @validator('success_rate') def validate_success_rate(cls, v, values): """Validate success rate consistency.""" results = values.get('results', []) total_processed = values.get('total_processed', 0) if total_processed > 0: expected_rate = len(results) / total_processed if abs(v - expected_rate) > 0.01: # Allow small floating point errors raise ValueError('Success rate inconsistent with results') return v ``` ### Union Types and Polymorphic Schemas ```python from typing import Union from pydantic import Field, discriminator class TextTask(BaseModel): task_type: Literal["text"] = "text" text: str = Field(..., description="Input text") model: str = Field("gpt-3.5-turbo", description="Text model to use") class ImageTask(BaseModel): task_type: Literal["image"] = "image" prompt: str = Field(..., description="Image generation prompt") width: int = Field(512, ge=128, le=2048) height: int = Field(512, ge=128, le=2048) class AudioTask(BaseModel): task_type: Literal["audio"] = "audio" text: str = Field(..., description="Text to convert to speech") voice: str = Field("default", description="Voice to use") speed: float = Field(1.0, ge=0.5, le=2.0) # Union type with discriminator TaskInput = Union[TextTask, ImageTask, AudioTask] class UniversalProcessingInput(BaseModel): """Schema supporting multiple task types.""" task: TaskInput = Field(..., discriminator='task_type', description="Task to process") priority: int = Field(1, ge=1, le=5, description="Task priority") callback_url: Optional[str] = Field(None, description="Callback URL for results") # Usage in endpoint @chute.cord( public_api_path="/process", method="POST", input_schema=UniversalProcessingInput ) async def process_universal(self, input_data: UniversalProcessingInput): """Process different types of tasks.""" task = input_data.task if task.task_type == "text": # Type narrowing - IDE knows this is TextTask return await self.process_text(task.text, task.model) elif task.task_type == "image": return await self.process_image(task.prompt, task.width, task.height) elif task.task_type == "audio": return await self.process_audio(task.text, task.voice, task.speed) ``` ## Advanced Validation Techniques ### Custom Validators ```python from pydantic import validator, ValidationError import base64 import mimetypes class FileUploadSchema(BaseModel): """Schema for file upload validation.""" filename: str = Field(..., description="Original filename") content_type: str = Field(..., description="MIME type") data: str = Field(..., description="Base64 encoded file data") max_size_mb: int = Field(10, ge=1, le=100, description="Maximum file size in MB") @validator('filename') def validate_filename(cls, v): """Validate filename.""" if not v or len(v.strip()) == 0: raise ValueError('Filename cannot be empty') # Check for path traversal if '..' in v or '/' in v or '\\' in v: raise ValueError('Invalid filename') return v.strip() @validator('content_type') def validate_content_type(cls, v): """Validate MIME type.""" allowed_types = [ 'image/jpeg', 'image/png', 'image/gif', 'text/plain', 'application/pdf', 'audio/mpeg', 'audio/wav' ] if v not in allowed_types: raise ValueError(f'Content type {v} not allowed') return v @validator('data') def validate_base64_data(cls, v, values): """Validate base64 data and size.""" try: # Decode base64 decoded = base64.b64decode(v) except Exception: raise ValueError('Invalid base64 encoding') # Check file size max_size_mb = values.get('max_size_mb', 10) max_size_bytes = max_size_mb * 1024 * 1024 if len(decoded) > max_size_bytes: raise ValueError(f'File size exceeds {max_size_mb}MB limit') # Validate content type matches data content_type = values.get('content_type') if content_type: # Simple validation - in practice, you'd use more sophisticated detection if content_type.startswith('image/') and not decoded[:10].startswith(b'\xff\xd8\xff'): if not (decoded[:8] == b'\x89PNG\r\n\x1a\n'): # PNG header raise ValueError('File content does not match declared type') return v class ModelSelectionSchema(BaseModel): """Schema with model-specific validation.""" model_name: str = Field(..., description="Model identifier") input_text: str = Field(..., description="Input text") parameters: Dict[str, Any] = Field(default_factory=dict, description="Model parameters") @validator('parameters') def validate_model_parameters(cls, v, values): """Validate parameters based on model.""" model_name = values.get('model_name', '') # Model-specific parameter validation if 'gpt' in model_name.lower(): # GPT models if 'temperature' in v and not (0.0 <= v['temperature'] <= 2.0): raise ValueError('Temperature must be between 0.0 and 2.0 for GPT models') if 'max_tokens' in v and not (1 <= v['max_tokens'] <= 4096): raise ValueError('max_tokens must be between 1 and 4096 for GPT models') elif 'bert' in model_name.lower(): # BERT models don't use temperature if 'temperature' in v: raise ValueError('Temperature parameter not applicable for BERT models') return v ``` ### Dynamic Validation ```python from typing import Callable, Any import inspect class DynamicValidationSchema(BaseModel): """Schema with dynamic validation rules.""" operation: str = Field(..., description="Operation to perform") parameters: Dict[str, Any] = Field(..., description="Operation parameters") @validator('parameters') def validate_parameters_for_operation(cls, v, values): """Validate parameters based on operation type.""" operation = values.get('operation') validation_rules = { 'sentiment_analysis': { 'required': ['text'], 'optional': ['model', 'language'], 'types': {'text': str, 'model': str, 'language': str} }, 'image_generation': { 'required': ['prompt'], 'optional': ['width', 'height', 'steps'], 'types': {'prompt': str, 'width': int, 'height': int, 'steps': int}, 'ranges': {'width': (128, 2048), 'height': (128, 2048), 'steps': (1, 100)} }, 'translation': { 'required': ['text', 'target_language'], 'optional': ['source_language'], 'types': {'text': str, 'target_language': str, 'source_language': str} } } if operation not in validation_rules: raise ValueError(f'Unknown operation: {operation}') rules = validation_rules[operation] # Check required parameters for param in rules['required']: if param not in v: raise ValueError(f'Missing required parameter: {param}') # Check parameter types for param, value in v.items(): if param in rules['types']: expected_type = rules['types'][param] if not isinstance(value, expected_type): raise ValueError(f'Parameter {param} must be of type {expected_type.__name__}') # Check ranges if 'ranges' in rules: for param, (min_val, max_val) in rules['ranges'].items(): if param in v: if not (min_val <= v[param] <= max_val): raise ValueError(f'Parameter {param} must be between {min_val} and {max_val}') return v class ConfigurableSchema(BaseModel): """Schema that can be configured at runtime.""" class Config: extra = "forbid" # Don't allow extra fields by default @classmethod def create_with_extra_fields(cls, extra_fields: Dict[str, Any]): """Create schema variant that allows specific extra fields.""" class DynamicSchema(cls): class Config: extra = "allow" @validator('*', pre=True, allow_reuse=True) def validate_extra_fields(cls, v, field): if field.name in extra_fields: # Validate against provided rules field_rules = extra_fields[field.name] if 'type' in field_rules and not isinstance(v, field_rules['type']): raise ValueError(f'Field {field.name} must be of type {field_rules["type"].__name__}') if 'range' in field_rules: min_val, max_val = field_rules['range'] if not (min_val <= v <= max_val): raise ValueError(f'Field {field.name} must be between {min_val} and {max_val}') return v return DynamicSchema ``` ## Error Handling and User-Friendly Messages ### Custom Error Messages ```python from pydantic import ValidationError, Field from typing import List, Dict class UserFriendlySchema(BaseModel): """Schema with user-friendly error messages.""" email: str = Field( ..., regex=r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', description="Valid email address", error_msg="Please enter a valid email address (e.g., user@example.com)" ) age: int = Field( ..., ge=13, le=120, description="Age in years", error_msg="Age must be between 13 and 120 years" ) password: str = Field( ..., min_length=8, description="Password (minimum 8 characters)", error_msg="Password must be at least 8 characters long" ) @validator('password') def validate_password_strength(cls, v): """Validate password strength with clear messages.""" if not any(c.isupper() for c in v): raise ValueError('Password must contain at least one uppercase letter') if not any(c.islower() for c in v): raise ValueError('Password must contain at least one lowercase letter') if not any(c.isdigit() for c in v): raise ValueError('Password must contain at least one number') return v def format_validation_errors(e: ValidationError) -> Dict[str, List[str]]: """Format validation errors for user-friendly display.""" error_dict = {} for error in e.errors(): field_path = " -> ".join(str(loc) for loc in error['loc']) error_msg = error['msg'] # Customize error messages if error['type'] == 'value_error.missing': error_msg = "This field is required" elif error['type'] == 'type_error.str': error_msg = "This field must be text" elif error['type'] == 'type_error.integer': error_msg = "This field must be a number" elif error['type'] == 'value_error.number.not_ge': error_msg = f"This field must be at least {error['ctx']['limit_value']}" elif error['type'] == 'value_error.number.not_le': error_msg = f"This field must be at most {error['ctx']['limit_value']}" if field_path not in error_dict: error_dict[field_path] = [] error_dict[field_path].append(error_msg) return error_dict # Usage in endpoint @chute.cord(public_api_path="/register", method="POST") async def register_user(self, input_data: UserFriendlySchema): """Register user with friendly error handling.""" try: # Process registration return {"message": "Registration successful"} except ValidationError as e: formatted_errors = format_validation_errors(e) raise HTTPException(status_code=422, detail=formatted_errors) ``` ### Validation Error Recovery ```python from typing import Union, Optional class FlexibleInputSchema(BaseModel): """Schema that attempts to recover from validation errors.""" text: str = Field(..., description="Input text") confidence_threshold: Union[float, str] = Field(0.5, description="Confidence threshold") max_results: Union[int, str] = Field(10, description="Maximum number of results") @validator('confidence_threshold', pre=True) def parse_confidence_threshold(cls, v): """Attempt to parse confidence threshold from string.""" if isinstance(v, str): try: v = float(v) except ValueError: raise ValueError('Confidence threshold must be a number between 0 and 1') if not isinstance(v, (int, float)): raise ValueError('Confidence threshold must be a number') if not (0.0 <= v <= 1.0): raise ValueError('Confidence threshold must be between 0 and 1') return float(v) @validator('max_results', pre=True) def parse_max_results(cls, v): """Attempt to parse max_results from string.""" if isinstance(v, str): try: v = int(v) except ValueError: raise ValueError('Max results must be a positive integer') if not isinstance(v, int): raise ValueError('Max results must be an integer') if v <= 0: raise ValueError('Max results must be positive') if v > 100: v = 100 # Auto-correct to maximum allowed return v class AutoCorrectingSchema(BaseModel): """Schema that auto-corrects common input errors.""" text: str = Field(..., description="Input text") language: str = Field("auto", description="Language code") @validator('text', pre=True) def clean_text(cls, v): """Clean and normalize text input.""" if not isinstance(v, str): v = str(v) # Normalize whitespace v = re.sub(r'\s+', ' ', v.strip()) # Remove common problematic characters v = v.replace('\x00', '') # Remove null bytes v = v.replace('\ufeff', '') # Remove BOM if len(v) == 0: raise ValueError('Text cannot be empty after cleaning') return v @validator('language', pre=True) def normalize_language(cls, v): """Normalize language codes.""" if not isinstance(v, str): v = str(v) v = v.lower().strip() # Common language code mappings language_mappings = { 'english': 'en', 'spanish': 'es', 'french': 'fr', 'german': 'de', 'chinese': 'zh', 'japanese': 'ja', 'korean': 'ko', 'auto-detect': 'auto', 'automatic': 'auto' } if v in language_mappings: v = language_mappings[v] # Validate language code format if v != 'auto' and not re.match(r'^[a-z]{2}(-[A-Z]{2})?$', v): raise ValueError(f'Invalid language code: {v}') return v ``` ## Schema Documentation and Examples ### Comprehensive Documentation ```python class DocumentedAPISchema(BaseModel): """Fully documented API schema with examples.""" prompt: str = Field( ..., min_length=1, max_length=1000, description="Text prompt for AI processing", example="Generate a creative story about space exploration" ) model: str = Field( "gpt-3.5-turbo", description="AI model to use for processing", example="gpt-4", regex=r'^(gpt-3\.5-turbo|gpt-4|claude-2)$' ) temperature: float = Field( 0.7, ge=0.0, le=2.0, description="Controls randomness in the output. Higher values make output more random.", example=0.8 ) max_tokens: int = Field( 100, ge=1, le=4096, description="Maximum number of tokens to generate", example=250 ) stop_sequences: Optional[List[str]] = Field( None, max_items=4, description="List of sequences where generation should stop", example=[".", "!", "?"] ) class Config: schema_extra = { "example": { "prompt": "Write a haiku about artificial intelligence", "model": "gpt-3.5-turbo", "temperature": 0.8, "max_tokens": 50, "stop_sequences": ["\n\n"] }, "examples": { "creative_writing": { "summary": "Creative writing example", "value": { "prompt": "Write a short story about a robot discovering emotions", "model": "gpt-4", "temperature": 0.9, "max_tokens": 500 } }, "technical_explanation": { "summary": "Technical explanation example", "value": { "prompt": "Explain how neural networks work", "model": "gpt-3.5-turbo", "temperature": 0.3, "max_tokens": 300 } } } } class ResponseSchema(BaseModel): """Well-documented response schema.""" generated_text: str = Field( ..., description="The generated text output from the AI model", example="Artificial intelligence learns,\nProcessing data endlessly,\nFuture unfolds bright." ) model_used: str = Field( ..., description="The actual model used for generation", example="gpt-3.5-turbo" ) tokens_used: int = Field( ..., ge=0, description="Number of tokens consumed in generation", example=32 ) processing_time: float = Field( ..., gt=0, description="Time taken to process the request in seconds", example=1.25 ) finish_reason: Literal["completed", "max_tokens", "stop_sequence"] = Field( ..., description="Reason why generation finished", example="completed" ) ``` ### Schema Testing and Validation ```python import pytest from pydantic import ValidationError class SchemaTestSuite: """Test suite for schema validation.""" @staticmethod def test_valid_inputs(): """Test valid input scenarios.""" # Test basic valid input valid_data = { "prompt": "Hello world", "model": "gpt-3.5-turbo", "temperature": 0.7, "max_tokens": 100 } schema = DocumentedAPISchema(**valid_data) assert schema.prompt == "Hello world" assert schema.temperature == 0.7 # Test with optional fields valid_with_optional = { "prompt": "Test prompt", "stop_sequences": [".", "!"] } schema2 = DocumentedAPISchema(**valid_with_optional) assert schema2.model == "gpt-3.5-turbo" # Default value assert schema2.stop_sequences == [".", "!"] @staticmethod def test_invalid_inputs(): """Test invalid input scenarios.""" # Test missing required field with pytest.raises(ValidationError) as exc_info: DocumentedAPISchema(model="gpt-4") errors = exc_info.value.errors() assert any(error['type'] == 'value_error.missing' for error in errors) # Test invalid temperature with pytest.raises(ValidationError) as exc_info: DocumentedAPISchema(prompt="test", temperature=3.0) errors = exc_info.value.errors() assert any('temperature' in str(error['loc']) for error in errors) # Test invalid model with pytest.raises(ValidationError) as exc_info: DocumentedAPISchema(prompt="test", model="invalid-model") errors = exc_info.value.errors() assert any('regex' in error['type'] for error in errors) @staticmethod def test_edge_cases(): """Test edge cases and boundary conditions.""" # Test minimum values min_data = { "prompt": "a", # Minimum length "temperature": 0.0, "max_tokens": 1 } schema = DocumentedAPISchema(**min_data) assert schema.temperature == 0.0 # Test maximum values max_data = { "prompt": "x" * 1000, # Maximum length "temperature": 2.0, "max_tokens": 4096 } schema = DocumentedAPISchema(**max_data) assert len(schema.prompt) == 1000 # Run tests if __name__ == "__main__": test_suite = SchemaTestSuite() test_suite.test_valid_inputs() test_suite.test_invalid_inputs() test_suite.test_edge_cases() print("All schema tests passed!") ``` ## Performance and Best Practices ### Schema Performance Optimization ```python from pydantic import BaseModel, Field, validator from typing import ClassVar class OptimizedSchema(BaseModel): """Performance-optimized schema.""" # Use ClassVar for constants to avoid creating fields MAX_TEXT_LENGTH: ClassVar[int] = 5000 ALLOWED_MODELS: ClassVar[set] = {"gpt-3.5-turbo", "gpt-4", "claude-2"} text: str = Field(..., max_length=MAX_TEXT_LENGTH) model: str = Field("gpt-3.5-turbo") @validator('model') def validate_model(cls, v): """Fast model validation using set lookup.""" if v not in cls.ALLOWED_MODELS: raise ValueError(f'Model must be one of: {", ".join(cls.ALLOWED_MODELS)}') return v class Config: # Performance optimizations validate_assignment = False # Don't validate on assignment allow_reuse = True # Allow validator reuse str_strip_whitespace = True # Auto-strip strings anystr_lower = False # Don't auto-lowercase class CachedValidationSchema(BaseModel): """Schema with cached validation results.""" _validation_cache: ClassVar[Dict[str, bool]] = {} data: str = Field(...) @validator('data') def validate_with_cache(cls, v): """Use caching for expensive validation.""" # Check cache first if v in cls._validation_cache: if not cls._validation_cache[v]: raise ValueError('Cached validation failed') return v # Perform expensive validation is_valid = cls._expensive_validation(v) # Cache result cls._validation_cache[v] = is_valid if not is_valid: raise ValueError('Validation failed') return v @staticmethod def _expensive_validation(data: str) -> bool: """Simulate expensive validation.""" # This would be your actual expensive validation logic return len(data) > 0 and not any(char in data for char in ['<', '>', '&']) ``` ### Schema Composition and Reuse ```python from abc import ABC from typing import Generic, TypeVar # Base schemas for reuse class TimestampMixin(BaseModel): """Mixin for timestamp fields.""" created_at: datetime = Field(default_factory=datetime.now) updated_at: Optional[datetime] = None class PaginationMixin(BaseModel): """Mixin for pagination parameters.""" page: int = Field(1, ge=1, description="Page number") page_size: int = Field(20, ge=1, le=100, description="Items per page") class MetadataMixin(BaseModel): """Mixin for metadata fields.""" metadata: Dict[str, Any] = Field(default_factory=dict) tags: List[str] = Field(default_factory=list, max_items=10) # Composed schemas class UserInput(MetadataMixin): """User input with metadata support.""" username: str = Field(..., min_length=3, max_length=50) email: str = Field(..., regex=r'^[^@]+@[^@]+\.[^@]+$') class PaginatedResponse(Generic[T], TimestampMixin, PaginationMixin): """Generic paginated response.""" items: List[T] = Field(..., description="Response items") total: int = Field(..., ge=0, description="Total number of items") has_next: bool = Field(..., description="Whether there are more pages") # Usage T = TypeVar('T') class ProcessingResult(BaseModel): result: str confidence: float # Create specific paginated response PaginatedProcessingResponse = PaginatedResponse[ProcessingResult] ``` ## Next Steps - **API Documentation**: Generate comprehensive API docs from schemas - **Client Generation**: Auto-generate typed clients from schemas - **Database Integration**: Connect schemas with ORMs and databases - **Testing Strategies**: Implement comprehensive schema testing For more advanced topics, see: - [Error Handling Guide](error-handling) - [Custom Chutes Guide](custom-chutes) --- ## SOURCE: https://chutes.ai/docs/guides/security # Security Guide This comprehensive guide covers security best practices for Chutes applications. For a deep dive into the Chutes platform's underlying security architecture, including Trusted Execution Environments (TEEs) and hardware attestation, please see the [Security Architecture](/docs/core-concepts/security-architecture) documentation. ## Overview Security in Chutes involves multiple layers: - **Authentication & Authorization**: Secure API access and user management - **Data Protection**: Encrypting sensitive data and communications - **Container Security**: Securing Docker images and runtime environments - **Network Security**: Protecting network communications - **Monitoring & Incident Response**: Detecting and responding to security threats ## Authentication & Authorization ### API Key Management Secure API key handling: ```python import os import hashlib import hmac import time from typing import Optional class APIKeyManager: def __init__(self): self.secret_key = os.environ.get("API_SECRET_KEY") if not self.secret_key: raise ValueError("API_SECRET_KEY environment variable is required") def generate_api_key(self, user_id: str) -> str: """Generate secure API key for user""" timestamp = str(int(time.time())) payload = f"{user_id}:{timestamp}" signature = hmac.new( self.secret_key.encode(), payload.encode(), hashlib.sha256 ).hexdigest() return f"{payload}:{signature}" def validate_api_key(self, api_key: str) -> Optional[str]: """Validate API key and return user_id if valid""" try: parts = api_key.split(":") if len(parts) != 3: return None user_id, timestamp, signature = parts payload = f"{user_id}:{timestamp}" # Verify signature expected_signature = hmac.new( self.secret_key.encode(), payload.encode(), hashlib.sha256 ).hexdigest() if not hmac.compare_digest(signature, expected_signature): return None # Check if key is expired (24 hours) key_age = time.time() - int(timestamp) if key_age > 86400: # 24 hours return None return user_id except Exception: return None # Use in chute api_manager = APIKeyManager() async def authenticate_request(headers: dict) -> Optional[str]: """Authenticate incoming request""" auth_header = headers.get("Authorization", "") if not auth_header.startswith("Bearer "): return None api_key = auth_header[7:] # Remove "Bearer " prefix return api_manager.validate_api_key(api_key) async def run_secure(inputs: dict) -> dict: """Secure endpoint with authentication""" headers = inputs.get("headers", {}) user_id = await authenticate_request(headers) if not user_id: return {"error": "Unauthorized", "status": 401} # Process authenticated request result = await process_for_user(user_id, inputs) return {"result": result, "user_id": user_id} ``` ### Role-Based Access Control Implement authorization: ```python from enum import Enum from typing import List, Set import json class Permission(Enum): READ = "read" WRITE = "write" DELETE = "delete" ADMIN = "admin" class Role: def __init__(self, name: str, permissions: Set[Permission]): self.name = name self.permissions = permissions class RBACManager: def __init__(self): # Define roles self.roles = { "user": Role("user", {Permission.READ}), "editor": Role("editor", {Permission.READ, Permission.WRITE}), "admin": Role("admin", {Permission.READ, Permission.WRITE, Permission.DELETE, Permission.ADMIN}) } # User role assignments (in production, store in database) self.user_roles = {} def assign_role(self, user_id: str, role_name: str): """Assign role to user""" if role_name not in self.roles: raise ValueError(f"Role {role_name} does not exist") self.user_roles[user_id] = role_name def check_permission(self, user_id: str, required_permission: Permission) -> bool: """Check if user has required permission""" role_name = self.user_roles.get(user_id) if not role_name: return False role = self.roles.get(role_name) if not role: return False return required_permission in role.permissions def require_permission(self, permission: Permission): """Decorator to require specific permission""" def decorator(func): async def wrapper(*args, **kwargs): # Extract user_id from inputs inputs = args[0] if args else kwargs.get("inputs", {}) user_id = inputs.get("user_id") if not user_id or not self.check_permission(user_id, permission): return {"error": "Forbidden", "status": 403} return await func(*args, **kwargs) return wrapper return decorator # Global RBAC manager rbac = RBACManager() @rbac.require_permission(Permission.WRITE) async def create_resource(inputs: dict) -> dict: """Endpoint that requires write permission""" # Create resource logic return {"message": "Resource created successfully"} @rbac.require_permission(Permission.ADMIN) async def admin_operation(inputs: dict) -> dict: """Admin-only endpoint""" # Admin operation logic return {"message": "Admin operation completed"} ``` ## Data Protection ### Input Validation & Sanitization Prevent injection attacks: ```python import re import html from typing import Any, Dict from pydantic import BaseModel, validator, Field class SecureInput(BaseModel): text: str = Field(..., max_length=10000) email: str = Field(..., regex=r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$') filename: str = Field(..., regex=r'^[a-zA-Z0-9._-]+$') @validator('text') def sanitize_text(cls, v): """Sanitize text input""" # Remove potentially dangerous characters sanitized = re.sub(r'[<>"\']', '', v) # HTML escape sanitized = html.escape(sanitized) return sanitized @validator('filename') def validate_filename(cls, v): """Validate filename for path traversal""" # Prevent path traversal if '..' in v or '/' in v or '\\' in v: raise ValueError("Invalid filename") return v class InputSanitizer: @staticmethod def sanitize_sql_input(value: str) -> str: """Sanitize input to prevent SQL injection""" # Remove SQL keywords and special characters dangerous_patterns = [ r'(\bUNION\b)|(\bSELECT\b)|(\bINSERT\b)|(\bUPDATE\b)|(\bDELETE\b)', r'(\bDROP\b)|(\bCREATE\b)|(\bALTER\b)|(\bEXEC\b)', r'[;\'"`]' ] sanitized = value for pattern in dangerous_patterns: sanitized = re.sub(pattern, '', sanitized, flags=re.IGNORECASE) return sanitized.strip() @staticmethod def sanitize_file_path(path: str) -> str: """Sanitize file path to prevent directory traversal""" # Remove dangerous path components sanitized = re.sub(r'\.\.+', '', path) sanitized = re.sub(r'[/\\]+', '_', sanitized) return sanitized async def run_with_validation(inputs: dict) -> dict: """Validate and sanitize all inputs""" try: # Validate using Pydantic model validated_input = SecureInput(**inputs) # Additional sanitization sanitizer = InputSanitizer() if "file_path" in inputs: inputs["file_path"] = sanitizer.sanitize_file_path(inputs["file_path"]) # Process with sanitized inputs result = await process_secure_inputs(validated_input.dict()) return {"result": result} except Exception as e: return {"error": f"Invalid input: {str(e)}", "status": 400} ``` ### Data Encryption Encrypt sensitive data: ```python from cryptography.fernet import Fernet from cryptography.hazmat.primitives import hashes from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2HMAC import base64 import os class DataEncryption: def __init__(self, password: str = None): self.password = password or os.environ.get("ENCRYPTION_PASSWORD") if not self.password: raise ValueError("Encryption password is required") # Generate key from password self.key = self._generate_key(self.password) self.cipher = Fernet(self.key) def _generate_key(self, password: str) -> bytes: """Generate encryption key from password""" # Use a fixed salt for consistent keys (in production, use random salt per data) salt = b'chutes_security_salt' kdf = PBKDF2HMAC( algorithm=hashes.SHA256(), length=32, salt=salt, iterations=100000) key = base64.urlsafe_b64encode(kdf.derive(password.encode())) return key def encrypt(self, data: str) -> str: """Encrypt string data""" encrypted_data = self.cipher.encrypt(data.encode()) return base64.urlsafe_b64encode(encrypted_data).decode() def decrypt(self, encrypted_data: str) -> str: """Decrypt string data""" encrypted_bytes = base64.urlsafe_b64decode(encrypted_data.encode()) decrypted_data = self.cipher.decrypt(encrypted_bytes) return decrypted_data.decode() def encrypt_dict(self, data: dict, sensitive_fields: list) -> dict: """Encrypt sensitive fields in dictionary""" encrypted_data = data.copy() for field in sensitive_fields: if field in encrypted_data: encrypted_data[field] = self.encrypt(str(encrypted_data[field])) return encrypted_data def decrypt_dict(self, data: dict, sensitive_fields: list) -> dict: """Decrypt sensitive fields in dictionary""" decrypted_data = data.copy() for field in sensitive_fields: if field in decrypted_data: decrypted_data[field] = self.decrypt(decrypted_data[field]) return decrypted_data # Global encryption instance encryption = DataEncryption() async def run_with_encryption(inputs: dict) -> dict: """Handle sensitive data with encryption""" sensitive_fields = ["personal_info", "api_keys", "passwords"] # Encrypt sensitive inputs encrypted_inputs = encryption.encrypt_dict(inputs, sensitive_fields) # Process with encrypted data result = await process_encrypted_data(encrypted_inputs) # Decrypt result if needed if "sensitive_result" in result: result["sensitive_result"] = encryption.decrypt(result["sensitive_result"]) return result ``` ## Container Security ### Secure Docker Images Build secure container images: ```python from chutes.image import Image # Security-hardened image secure_image = ( Image( username="myuser", name="secure-app", tag="hardened", base_image="python:3.11-slim", # Use minimal base image python_version="3.11" ) # Create non-root user .run_command(""" groupadd -r appuser && \\ useradd -r -g appuser -d /app -s /sbin/nologin appuser && \\ mkdir -p /app && \\ chown -R appuser:appuser /app """) # Install security updates .run_command(""" apt-get update && \\ apt-get upgrade -y && \\ apt-get install -y --no-install-recommends \\ ca-certificates && \\ apt-get clean && \\ rm -rf /var/lib/apt/lists/* """) # Install Python dependencies with security focus .pip_install([ "cryptography==41.0.7", # Pin specific versions "pydantic==2.4.2", "bcrypt==4.0.1" ]) # Copy application code with proper ownership .copy_files("./app", "/app", owner="appuser:appuser") # Set secure permissions .run_command("chmod -R 755 /app") # Security configurations .set_environment_variable("PYTHONUNBUFFERED", "1") .set_environment_variable("PYTHONDONTWRITEBYTECODE", "1") .set_environment_variable("PYTHONHASHSEED", "random") # Switch to non-root user .set_user("appuser") .set_working_directory("/app") ) ``` ### Runtime Security Implement runtime security measures: ```python import os import sys import signal import logging from contextlib import contextmanager class SecurityManager: def __init__(self): self.setup_logging() self.setup_signal_handlers() self.validate_environment() def setup_logging(self): """Configure secure logging""" logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', handlers=[ logging.StreamHandler(sys.stdout), logging.FileHandler('/app/logs/security.log', mode='a') ] ) self.logger = logging.getLogger('security') def setup_signal_handlers(self): """Setup graceful shutdown handlers""" def signal_handler(signum, frame): self.logger.info(f"Received signal {signum}, shutting down gracefully") self.cleanup() sys.exit(0) signal.signal(signal.SIGTERM, signal_handler) signal.signal(signal.SIGINT, signal_handler) def validate_environment(self): """Validate security environment variables""" required_vars = ["API_SECRET_KEY", "ENCRYPTION_PASSWORD"] for var in required_vars: if not os.environ.get(var): self.logger.error(f"Required environment variable {var} is missing") raise ValueError(f"Missing required environment variable: {var}") def log_security_event(self, event_type: str, details: dict): """Log security events""" self.logger.warning(f"SECURITY EVENT: {event_type} - {details}") @contextmanager def secure_execution(self): """Context manager for secure code execution""" try: self.logger.info("Starting secure execution") yield except Exception as e: self.log_security_event("EXECUTION_ERROR", {"error": str(e)}) raise finally: self.logger.info("Secure execution completed") def cleanup(self): """Cleanup resources on shutdown""" self.logger.info("Performing security cleanup") # Clear sensitive data from memory # Close database connections # Cleanup temporary files # Global security manager security_manager = SecurityManager() async def run_secure_execution(inputs: dict) -> dict: """Execute with security monitoring""" with security_manager.secure_execution(): # Log request security_manager.logger.info(f"Processing request: {inputs.get('request_id', 'unknown')}") # Process request result = await process_secure_request(inputs) return result ``` ## Network Security ### TLS/SSL Configuration Secure network communications: ```python import ssl import aiohttp from typing import Optional class SecureHTTPClient: def __init__(self): # Create secure SSL context self.ssl_context = ssl.create_default_context() self.ssl_context.check_hostname = True self.ssl_context.verify_mode = ssl.CERT_REQUIRED # Additional security settings self.ssl_context.minimum_version = ssl.TLSVersion.TLSv1_2 self.ssl_context.set_ciphers('ECDHE+AESGCM:ECDHE+CHACHA20:DHE+AESGCM:DHE+CHACHA20:!aNULL:!MD5:!DSS') async def make_secure_request(self, url: str, data: dict = None, headers: dict = None) -> dict: """Make secure HTTPS request""" default_headers = { 'User-Agent': 'Chutes-Secure-Client/1.0', 'Accept': 'application/json', 'Content-Type': 'application/json' } if headers: default_headers.update(headers) timeout = aiohttp.ClientTimeout(total=30) async with aiohttp.ClientSession( timeout=timeout, connector=aiohttp.TCPConnector(ssl=self.ssl_context) ) as session: async with session.post(url, json=data, headers=default_headers) as response: if response.status != 200: raise Exception(f"Request failed: {response.status}") return await response.json() # Certificate pinning for critical services class CertificatePinnedClient: def __init__(self, pinned_cert_fingerprint: str): self.pinned_fingerprint = pinned_cert_fingerprint def verify_certificate(self, cert_der: bytes) -> bool: """Verify certificate against pinned fingerprint""" import hashlib cert_fingerprint = hashlib.sha256(cert_der).hexdigest() return cert_fingerprint == self.pinned_fingerprint async def make_pinned_request(self, url: str, data: dict) -> dict: """Make request with certificate pinning""" # Implementation would verify certificate fingerprint # This is a simplified example client = SecureHTTPClient() return await client.make_secure_request(url, data) ``` ### Rate Limiting Implement rate limiting: ```python import time import asyncio from collections import defaultdict, deque from typing import Dict, Optional class RateLimiter: def __init__(self, requests_per_minute: int = 60, requests_per_hour: int = 1000): self.rpm_limit = requests_per_minute self.rph_limit = requests_per_hour # Track requests per client self.minute_requests: Dict[str, deque] = defaultdict(deque) self.hour_requests: Dict[str, deque] = defaultdict(deque) def is_allowed(self, client_id: str) -> bool: """Check if request is allowed""" current_time = time.time() # Clean old requests self._cleanup_old_requests(client_id, current_time) # Check limits minute_count = len(self.minute_requests[client_id]) hour_count = len(self.hour_requests[client_id]) if minute_count >= self.rpm_limit or hour_count >= self.rph_limit: return False # Record request self.minute_requests[client_id].append(current_time) self.hour_requests[client_id].append(current_time) return True def _cleanup_old_requests(self, client_id: str, current_time: float): """Remove old requests from tracking""" minute_cutoff = current_time - 60 # 1 minute ago hour_cutoff = current_time - 3600 # 1 hour ago # Clean minute requests while (self.minute_requests[client_id] and self.minute_requests[client_id][0] < minute_cutoff): self.minute_requests[client_id].popleft() # Clean hour requests while (self.hour_requests[client_id] and self.hour_requests[client_id][0] < hour_cutoff): self.hour_requests[client_id].popleft() def get_reset_time(self, client_id: str) -> Dict[str, int]: """Get time until rate limit resets""" current_time = time.time() next_minute_reset = 60 - (current_time % 60) next_hour_reset = 3600 - (current_time % 3600) return { "minute_reset": int(next_minute_reset), "hour_reset": int(next_hour_reset) } # Global rate limiter rate_limiter = RateLimiter(requests_per_minute=100, requests_per_hour=5000) async def run_with_rate_limiting(inputs: dict) -> dict: """Apply rate limiting to requests""" client_id = inputs.get("client_id") or inputs.get("user_id", "unknown") if not rate_limiter.is_allowed(client_id): reset_times = rate_limiter.get_reset_time(client_id) return { "error": "Rate limit exceeded", "status": 429, "reset_time": reset_times } # Process request result = await process_rate_limited_request(inputs) return result ``` ## Monitoring & Incident Response ### Security Monitoring Monitor for security threats: ```python import logging import time from collections import defaultdict from typing import Dict, List import json class SecurityMonitor: def __init__(self): self.logger = logging.getLogger('security_monitor') # Track suspicious activities self.failed_attempts: Dict[str, List[float]] = defaultdict(list) self.suspicious_patterns: Dict[str, int] = defaultdict(int) # Threat detection thresholds self.max_failed_attempts = 5 self.time_window = 300 # 5 minutes self.alert_threshold = 10 def log_failed_authentication(self, client_id: str, details: dict): """Log failed authentication attempt""" current_time = time.time() self.failed_attempts[client_id].append(current_time) # Clean old attempts cutoff_time = current_time - self.time_window self.failed_attempts[client_id] = [ t for t in self.failed_attempts[client_id] if t > cutoff_time ] # Check for brute force attack if len(self.failed_attempts[client_id]) >= self.max_failed_attempts: self.alert_brute_force_attack(client_id, details) def alert_brute_force_attack(self, client_id: str, details: dict): """Alert on potential brute force attack""" alert = { "alert_type": "BRUTE_FORCE_ATTACK", "client_id": client_id, "attempt_count": len(self.failed_attempts[client_id]), "time_window": self.time_window, "details": details, "timestamp": time.time() } self.logger.critical(f"SECURITY ALERT: {json.dumps(alert)}") # In production, send to SIEM or alerting system self.send_security_alert(alert) def detect_suspicious_patterns(self, request_data: dict) -> bool: """Detect suspicious request patterns""" suspicious_indicators = [ # SQL injection patterns r'(\bUNION\b.*\bSELECT\b)|(\bSELECT\b.*\bFROM\b)', # XSS patterns r'= self.alert_threshold: self.alert_suspicious_pattern(activity_type, details) def alert_suspicious_pattern(self, pattern_type: str, details: dict): """Alert on suspicious activity pattern""" alert = { "alert_type": "SUSPICIOUS_PATTERN", "pattern_type": pattern_type, "occurrence_count": self.suspicious_patterns[pattern_type], "details": details, "timestamp": time.time() } self.logger.critical(f"SECURITY ALERT: {json.dumps(alert)}") self.send_security_alert(alert) def send_security_alert(self, alert: dict): """Send security alert to monitoring system""" # In production, integrate with: # - SIEM systems (Splunk, ELK Stack) # - Alerting platforms (PagerDuty, Slack) # - Security orchestration tools pass # Global security monitor security_monitor = SecurityMonitor() async def run_with_security_monitoring(inputs: dict) -> dict: """Monitor requests for security threats""" client_id = inputs.get("client_id", "unknown") # Check for suspicious patterns if security_monitor.detect_suspicious_patterns(inputs): return {"error": "Suspicious request blocked", "status": 403} try: # Process request result = await process_monitored_request(inputs) return result except Exception as e: # Log potential security incident security_monitor.log_suspicious_activity("REQUEST_ERROR", { "error": str(e), "client_id": client_id, "inputs": inputs }) raise ``` ### Incident Response Automated incident response: ```python import asyncio from enum import Enum from typing import Dict, List, Callable class IncidentSeverity(Enum): LOW = 1 MEDIUM = 2 HIGH = 3 CRITICAL = 4 class IncidentResponse: def __init__(self): self.response_handlers: Dict[str, Callable] = {} self.blocked_clients: set = set() self.incident_log: List[dict] = [] def register_handler(self, incident_type: str, handler: Callable): """Register incident response handler""" self.response_handlers[incident_type] = handler async def handle_incident(self, incident_type: str, severity: IncidentSeverity, details: dict): """Handle security incident""" incident = { "type": incident_type, "severity": severity.name, "details": details, "timestamp": time.time(), "status": "ACTIVE" } self.incident_log.append(incident) # Execute response handler if incident_type in self.response_handlers: await self.response_handlers[incident_type](incident) # Default responses based on severity if severity == IncidentSeverity.CRITICAL: await self.emergency_response(incident) elif severity == IncidentSeverity.HIGH: await self.high_priority_response(incident) async def emergency_response(self, incident: dict): """Emergency response for critical incidents""" client_id = incident["details"].get("client_id") # Immediately block client if client_id: self.blocked_clients.add(client_id) # Notify security team await self.notify_security_team(incident) # Scale down if under attack await self.initiate_defensive_scaling() async def high_priority_response(self, incident: dict): """High priority incident response""" client_id = incident["details"].get("client_id") # Temporarily throttle client if client_id: await self.throttle_client(client_id) # Alert monitoring systems await self.send_alert(incident) async def notify_security_team(self, incident: dict): """Notify security team of critical incident""" # Integration with alerting systems pass async def initiate_defensive_scaling(self): """Scale resources defensively during attack""" # Implement defensive scaling logic pass async def throttle_client(self, client_id: str): """Apply temporary throttling to client""" # Implement client throttling pass def is_client_blocked(self, client_id: str) -> bool: """Check if client is blocked""" return client_id in self.blocked_clients # Global incident response incident_response = IncidentResponse() # Register handlers async def brute_force_handler(incident: dict): """Handle brute force attack""" client_id = incident["details"].get("client_id") if client_id: incident_response.blocked_clients.add(client_id) incident_response.register_handler("BRUTE_FORCE_ATTACK", brute_force_handler) async def run_with_incident_response(inputs: dict) -> dict: """Process requests with incident response""" client_id = inputs.get("client_id", "unknown") # Check if client is blocked if incident_response.is_client_blocked(client_id): return {"error": "Client blocked due to security incident", "status": 403} # Process request result = await process_secure_request(inputs) return result ``` ## Security Checklist ### Pre-deployment Security - [ ] Enable authentication and authorization - [ ] Implement input validation and sanitization - [ ] Use encryption for sensitive data - [ ] Build secure Docker images - [ ] Configure TLS/SSL properly - [ ] Set up rate limiting - [ ] Implement security monitoring - [ ] Test for common vulnerabilities ### Runtime Security - [ ] Monitor for security events - [ ] Implement incident response procedures - [ ] Keep dependencies updated - [ ] Regular security audits - [ ] Backup and recovery procedures - [ ] Access logging and monitoring ### Compliance Considerations - [ ] GDPR compliance for EU users - [ ] HIPAA compliance for healthcare data - [ ] SOC 2 compliance for enterprise customers - [ ] Industry-specific security requirements ## Next Steps - **[Best Practices](best-practices)** - General security best practices - **[Compliance Guide](../compliance)** - Meet regulatory requirements - **[Monitoring](../monitoring)** - Advanced security monitoring - **[Incident Response Playbook](../incident-response)** - Detailed response procedures For enterprise security requirements, see the [Enterprise Security Guide](../enterprise/security). --- ## SOURCE: https://chutes.ai/docs/guides/streaming # Real-time Streaming Responses This guide covers how to implement real-time streaming responses in Chutes, enabling live data transmission, progressive content delivery, and interactive AI applications. ## Overview Streaming in Chutes provides: - **Real-time Response**: Send data as it's generated - **Better UX**: Users see progress instead of waiting - **Memory Efficiency**: Process large outputs without memory buildup - **Interactive Applications**: Enable chat-like experiences - **Scalability**: Handle long-running tasks efficiently - **WebSocket Support**: Full duplex communication ## Basic Streaming Concepts ### HTTP Streaming vs WebSockets ```python from chutes.chute import Chute from fastapi import Response, WebSocket from fastapi.responses import StreamingResponse import asyncio import json chute = Chute(username="myuser", name="streaming-demo") # HTTP Streaming - Server-sent events @chute.cord( public_api_path="/stream_text", method="POST", stream=True # Enable streaming ) async def stream_text_generation(self, prompt: str): """Stream text generation token by token.""" async def generate_tokens(): """Generate tokens progressively.""" # Simulate token generation tokens = ["Hello", " world", "!", " This", " is", " streaming", " text", "."] for token in tokens: # Yield each token as it's generated yield f"data: {json.dumps({'token': token})}\n\n" await asyncio.sleep(0.1) # Simulate processing time # Send completion signal yield f"data: {json.dumps({'done': True})}\n\n" return StreamingResponse( generate_tokens(), media_type="text/plain", headers={ "Cache-Control": "no-cache", "Connection": "keep-alive", "X-Accel-Buffering": "no" # Disable nginx buffering } ) # WebSocket - Full duplex communication @chute.websocket("/ws") async def websocket_endpoint(self, websocket: WebSocket): """WebSocket endpoint for interactive communication.""" await websocket.accept() try: while True: # Receive message from client data = await websocket.receive_text() # Process message response = await self.process_message(data) # Send response back await websocket.send_text(response) except Exception as e: print(f"WebSocket error: {e}") finally: await websocket.close() async def process_message(self, message: str) -> str: """Process incoming message.""" return f"Echo: {message}" ``` ## AI Model Streaming ### Streaming LLM Text Generation ```python from typing import AsyncGenerator, Dict, Any import time @chute.on_startup() async def initialize_streaming_llm(self): """Initialize streaming-capable LLM.""" from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_name = "microsoft/DialoGPT-medium" self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.model = AutoModelForCausalLM.from_pretrained(model_name) if torch.cuda.is_available(): self.model = self.model.to("cuda") # Add padding token if not present if self.tokenizer.pad_token is None: self.tokenizer.pad_token = self.tokenizer.eos_token async def stream_llm_generation( self, prompt: str, max_tokens: int = 100, temperature: float = 0.7 ) -> AsyncGenerator[Dict[str, Any], None]: """Stream LLM generation token by token.""" # Tokenize input inputs = self.tokenizer.encode(prompt, return_tensors="pt") if torch.cuda.is_available(): inputs = inputs.to("cuda") # Generation parameters attention_mask = torch.ones_like(inputs) generated_tokens = 0 with torch.no_grad(): while generated_tokens < max_tokens: # Generate next token outputs = self.model(inputs, attention_mask=attention_mask) logits = outputs.logits[0, -1, :] # Apply temperature if temperature > 0: logits = logits / temperature probs = torch.softmax(logits, dim=-1) next_token = torch.multinomial(probs, 1) else: next_token = torch.argmax(logits, keepdim=True) # Decode token token_text = self.tokenizer.decode(next_token, skip_special_tokens=True) # Yield token data yield { "token": token_text, "token_id": next_token.item(), "generated_tokens": generated_tokens + 1, "is_complete": False } # Update inputs for next iteration inputs = torch.cat([inputs, next_token.unsqueeze(0)], dim=-1) attention_mask = torch.cat([attention_mask, torch.ones((1, 1), device=attention_mask.device)], dim=-1) generated_tokens += 1 # Check for end token if next_token.item() == self.tokenizer.eos_token_id: break # Small delay to prevent overwhelming the client await asyncio.sleep(0.01) # Send completion yield { "token": "", "token_id": None, "generated_tokens": generated_tokens, "is_complete": True } @chute.cord( public_api_path="/generate_stream", method="POST", stream=True ) async def generate_streaming_text(self, prompt: str, max_tokens: int = 100): """Generate streaming text response.""" async def stream_response(): # Send SSE headers yield "event: start\n" yield f"data: {json.dumps({'message': 'Starting generation'})}\n\n" async for token_data in self.stream_llm_generation(prompt, max_tokens): if token_data["is_complete"]: yield "event: complete\n" yield f"data: {json.dumps(token_data)}\n\n" else: yield "event: token\n" yield f"data: {json.dumps(token_data)}\n\n" return StreamingResponse( stream_response(), media_type="text/event-stream", headers={ "Cache-Control": "no-cache", "Connection": "keep-alive", "Access-Control-Allow-Origin": "*", "Access-Control-Allow-Headers": "Cache-Control" } ) ``` ### Streaming Image Generation ```python from PIL import Image import io import base64 class StreamingImageGenerator: """Stream image generation progress.""" def __init__(self, diffusion_model): self.model = diffusion_model async def stream_image_generation( self, prompt: str, steps: int = 20 ) -> AsyncGenerator[Dict[str, Any], None]: """Stream image generation progress.""" # Initialize generation yield { "step": 0, "total_steps": steps, "status": "initializing", "image": None } # Simulate diffusion steps for step in range(1, steps + 1): # Process one diffusion step await asyncio.sleep(0.1) # Simulate processing # Every few steps, send intermediate image if step % 5 == 0 or step == steps: # Generate intermediate or final image if step == steps: image = await self._generate_final_image(prompt) status = "complete" else: image = await self._generate_intermediate_image(prompt, step, steps) status = "processing" # Convert image to base64 img_buffer = io.BytesIO() image.save(img_buffer, format='JPEG', quality=85) img_b64 = base64.b64encode(img_buffer.getvalue()).decode() yield { "step": step, "total_steps": steps, "status": status, "image": img_b64, "progress": step / steps } else: # Send progress update without image yield { "step": step, "total_steps": steps, "status": "processing", "image": None, "progress": step / steps } async def _generate_intermediate_image(self, prompt: str, step: int, total_steps: int): """Generate intermediate image (placeholder for actual implementation).""" # This would use your actual diffusion model's intermediate output # For demo, create a simple placeholder img = Image.new('RGB', (512, 512), color=f'#{step*10:02x}{step*5:02x}{step*15:02x}') return img async def _generate_final_image(self, prompt: str): """Generate final high-quality image.""" # This would use your actual diffusion model img = Image.new('RGB', (512, 512), color='blue') return img @chute.cord( public_api_path="/generate_image_stream", method="POST", stream=True ) async def generate_streaming_image(self, prompt: str, steps: int = 20): """Stream image generation with progress updates.""" generator = StreamingImageGenerator(self.diffusion_model) async def stream_response(): async for update in generator.stream_image_generation(prompt, steps): yield f"data: {json.dumps(update)}\n\n" return StreamingResponse( stream_response(), media_type="text/event-stream" ) ``` ## Advanced Streaming Patterns ### Chunked Data Processing ```python from typing import AsyncIterator import hashlib class ChunkedProcessor: """Process large datasets in chunks with streaming updates.""" async def process_large_dataset( self, data: List[str], chunk_size: int = 10 ) -> AsyncIterator[Dict[str, Any]]: """Process data in chunks and stream results.""" total_items = len(data) processed_items = 0 results = [] # Process in chunks for i in range(0, total_items, chunk_size): chunk = data[i:i + chunk_size] # Process chunk chunk_results = await self._process_chunk(chunk) results.extend(chunk_results) processed_items += len(chunk) # Yield progress update yield { "type": "progress", "processed": processed_items, "total": total_items, "progress": processed_items / total_items, "chunk_results": chunk_results } # Allow other coroutines to run await asyncio.sleep(0) # Send final results yield { "type": "complete", "processed": processed_items, "total": total_items, "progress": 1.0, "all_results": results, "summary": self._generate_summary(results) } async def _process_chunk(self, chunk: List[str]) -> List[Dict[str, Any]]: """Process a single chunk of data.""" results = [] for item in chunk: # Simulate processing await asyncio.sleep(0.01) result = { "original": item, "processed": item.upper(), "length": len(item), "hash": hashlib.md5(item.encode()).hexdigest()[:8] } results.append(result) return results def _generate_summary(self, results: List[Dict[str, Any]]) -> Dict[str, Any]: """Generate summary statistics.""" total_length = sum(r["length"] for r in results) avg_length = total_length / len(results) if results else 0 return { "total_items": len(results), "total_length": total_length, "average_length": avg_length } @chute.cord( public_api_path="/process_stream", method="POST", stream=True ) async def process_data_stream(self, data: List[str], chunk_size: int = 10): """Stream large data processing.""" processor = ChunkedProcessor() async def stream_response(): async for update in processor.process_large_dataset(data, chunk_size): yield f"data: {json.dumps(update)}\n\n" return StreamingResponse( stream_response(), media_type="text/event-stream" ) ``` ### Multi-Model Streaming Pipeline ```python class StreamingPipeline: """Stream processing through multiple AI models.""" def __init__(self): self.models = {} async def stream_multi_model_processing( self, text: str ) -> AsyncIterator[Dict[str, Any]]: """Process text through multiple models with streaming updates.""" pipeline_steps = [ ("preprocessing", self._preprocess), ("sentiment", self._analyze_sentiment), ("entities", self._extract_entities), ("summary", self._generate_summary), ("translation", self._translate_text) ] current_data = {"text": text} for step_name, step_func in pipeline_steps: yield { "step": step_name, "status": "starting", "input": current_data } try: # Process step step_result = await step_func(current_data) current_data.update(step_result) yield { "step": step_name, "status": "completed", "result": step_result, "accumulated_data": current_data } except Exception as e: yield { "step": step_name, "status": "error", "error": str(e) } break # Send final result yield { "step": "pipeline_complete", "status": "completed", "final_result": current_data } async def _preprocess(self, data: Dict[str, Any]) -> Dict[str, Any]: """Preprocessing step.""" await asyncio.sleep(0.1) return { "cleaned_text": data["text"].strip().lower(), "word_count": len(data["text"].split()) } async def _analyze_sentiment(self, data: Dict[str, Any]) -> Dict[str, Any]: """Sentiment analysis step.""" await asyncio.sleep(0.2) # Simulate sentiment analysis return { "sentiment": "positive", "sentiment_score": 0.8 } async def _extract_entities(self, data: Dict[str, Any]) -> Dict[str, Any]: """Entity extraction step.""" await asyncio.sleep(0.15) return { "entities": [ {"text": "example", "type": "MISC", "confidence": 0.9} ] } async def _generate_summary(self, data: Dict[str, Any]) -> Dict[str, Any]: """Text summarization step.""" await asyncio.sleep(0.3) return { "summary": f"Summary of: {data['text'][:50]}..." } async def _translate_text(self, data: Dict[str, Any]) -> Dict[str, Any]: """Translation step.""" await asyncio.sleep(0.25) return { "translated_text": f"Translated: {data['text']}" } @chute.cord( public_api_path="/pipeline_stream", method="POST", stream=True ) async def stream_pipeline_processing(self, text: str): """Stream multi-model pipeline processing.""" pipeline = StreamingPipeline() async def stream_response(): async for update in pipeline.stream_multi_model_processing(text): yield f"data: {json.dumps(update)}\n\n" return StreamingResponse( stream_response(), media_type="text/event-stream" ) ``` ## WebSocket Applications ### Interactive Chat Application ```python from typing import Dict, Set import uuid class ChatManager: """Manage WebSocket chat sessions.""" def __init__(self): self.active_connections: Dict[str, WebSocket] = {} self.chat_sessions: Dict[str, Dict] = {} async def connect(self, websocket: WebSocket, session_id: str = None): """Connect a new WebSocket client.""" await websocket.accept() if session_id is None: session_id = str(uuid.uuid4()) self.active_connections[session_id] = websocket self.chat_sessions[session_id] = { "messages": [], "connected_at": time.time() } # Send welcome message await self.send_message(session_id, { "type": "system", "message": f"Connected to chat session {session_id}", "session_id": session_id }) return session_id async def disconnect(self, session_id: str): """Disconnect a WebSocket client.""" if session_id in self.active_connections: del self.active_connections[session_id] if session_id in self.chat_sessions: del self.chat_sessions[session_id] async def send_message(self, session_id: str, message: Dict): """Send message to specific session.""" if session_id in self.active_connections: websocket = self.active_connections[session_id] await websocket.send_text(json.dumps(message)) async def broadcast_message(self, message: Dict, exclude_session: str = None): """Broadcast message to all connected sessions.""" for session_id, websocket in self.active_connections.items(): if session_id != exclude_session: try: await websocket.send_text(json.dumps(message)) except: # Connection may be closed pass @chute.on_startup() async def initialize_chat(self): """Initialize chat manager.""" self.chat_manager = ChatManager() @chute.websocket("/chat") async def chat_websocket(self, websocket: WebSocket, session_id: str = None): """WebSocket endpoint for interactive chat.""" session_id = await self.chat_manager.connect(websocket, session_id) try: while True: # Receive message data = await websocket.receive_text() message_data = json.loads(data) # Process based on message type if message_data.get("type") == "user_message": await self._handle_user_message(session_id, message_data) elif message_data.get("type") == "typing": await self._handle_typing_indicator(session_id, message_data) elif message_data.get("type") == "ping": await self._handle_ping(session_id) except Exception as e: print(f"Chat error for session {session_id}: {e}") finally: await self.chat_manager.disconnect(session_id) async def _handle_user_message(self, session_id: str, message_data: Dict): """Handle user message and generate AI response.""" user_message = message_data.get("message", "") # Store user message self.chat_manager.chat_sessions[session_id]["messages"].append({ "role": "user", "content": user_message, "timestamp": time.time() }) # Send typing indicator await self.chat_manager.send_message(session_id, { "type": "ai_typing", "typing": True }) # Generate streaming AI response ai_response = "" async for token_data in self.stream_llm_generation(user_message): if not token_data["is_complete"]: ai_response += token_data["token"] # Send partial response await self.chat_manager.send_message(session_id, { "type": "ai_message_partial", "content": ai_response, "token": token_data["token"] }) else: # Send complete response await self.chat_manager.send_message(session_id, { "type": "ai_message_complete", "content": ai_response }) # Store AI message self.chat_manager.chat_sessions[session_id]["messages"].append({ "role": "assistant", "content": ai_response, "timestamp": time.time() }) async def _handle_typing_indicator(self, session_id: str, message_data: Dict): """Handle typing indicator.""" typing = message_data.get("typing", False) # Broadcast typing status to other users (if multi-user chat) await self.chat_manager.broadcast_message({ "type": "user_typing", "session_id": session_id, "typing": typing }, exclude_session=session_id) async def _handle_ping(self, session_id: str): """Handle ping for connection keepalive.""" await self.chat_manager.send_message(session_id, { "type": "pong", "timestamp": time.time() }) ``` ### Real-time Collaboration ```python class CollaborativeEditor: """Real-time collaborative document editing.""" def __init__(self): self.documents: Dict[str, Dict] = {} self.subscribers: Dict[str, Set[str]] = {} # doc_id -> set of session_ids self.session_connections: Dict[str, WebSocket] = {} async def join_document(self, doc_id: str, session_id: str, websocket: WebSocket): """Join a collaborative document.""" # Initialize document if doesn't exist if doc_id not in self.documents: self.documents[doc_id] = { "content": "", "version": 0, "last_modified": time.time() } self.subscribers[doc_id] = set() # Add subscriber self.subscribers[doc_id].add(session_id) self.session_connections[session_id] = websocket # Send current document state await websocket.send_text(json.dumps({ "type": "document_state", "doc_id": doc_id, "content": self.documents[doc_id]["content"], "version": self.documents[doc_id]["version"] })) # Notify other users await self._broadcast_to_document(doc_id, { "type": "user_joined", "session_id": session_id }, exclude_session=session_id) async def leave_document(self, doc_id: str, session_id: str): """Leave a collaborative document.""" if doc_id in self.subscribers: self.subscribers[doc_id].discard(session_id) if session_id in self.session_connections: del self.session_connections[session_id] # Notify other users await self._broadcast_to_document(doc_id, { "type": "user_left", "session_id": session_id }, exclude_session=session_id) async def apply_operation(self, doc_id: str, session_id: str, operation: Dict): """Apply an edit operation to the document.""" if doc_id not in self.documents: return doc = self.documents[doc_id] # Apply operation (simplified - real implementation would use OT) if operation["type"] == "insert": pos = operation["position"] text = operation["text"] content = doc["content"] doc["content"] = content[:pos] + text + content[pos:] elif operation["type"] == "delete": start = operation["start"] length = operation["length"] content = doc["content"] doc["content"] = content[:start] + content[start + length:] # Update version doc["version"] += 1 doc["last_modified"] = time.time() # Broadcast operation to other users await self._broadcast_to_document(doc_id, { "type": "operation", "operation": operation, "version": doc["version"], "author": session_id }, exclude_session=session_id) async def _broadcast_to_document(self, doc_id: str, message: Dict, exclude_session: str = None): """Broadcast message to all document subscribers.""" if doc_id not in self.subscribers: return for session_id in self.subscribers[doc_id]: if session_id != exclude_session and session_id in self.session_connections: try: websocket = self.session_connections[session_id] await websocket.send_text(json.dumps(message)) except: # Connection may be closed pass @chute.websocket("/collaborate/{doc_id}") async def collaborative_editing(self, websocket: WebSocket, doc_id: str): """WebSocket endpoint for collaborative editing.""" session_id = str(uuid.uuid4()) editor = getattr(self, 'collaborative_editor', None) if editor is None: self.collaborative_editor = CollaborativeEditor() editor = self.collaborative_editor await websocket.accept() await editor.join_document(doc_id, session_id, websocket) try: while True: data = await websocket.receive_text() message = json.loads(data) if message["type"] == "operation": await editor.apply_operation(doc_id, session_id, message["operation"]) elif message["type"] == "cursor_position": # Broadcast cursor position to other users await editor._broadcast_to_document(doc_id, { "type": "cursor_update", "session_id": session_id, "position": message["position"] }, exclude_session=session_id) except Exception as e: print(f"Collaboration error: {e}") finally: await editor.leave_document(doc_id, session_id) ``` ## Performance and Optimization ### Streaming Buffer Management ```python import asyncio from collections import deque class StreamingBuffer: """Manage streaming data with buffering and backpressure handling.""" def __init__(self, max_buffer_size: int = 1000): self.buffer = deque(maxlen=max_buffer_size) self.consumers = set() self.producer_task = None self.is_producing = False async def start_producing(self, producer_func): """Start producing data.""" if self.is_producing: return self.is_producing = True self.producer_task = asyncio.create_task(self._produce_data(producer_func)) async def stop_producing(self): """Stop producing data.""" self.is_producing = False if self.producer_task: self.producer_task.cancel() try: await self.producer_task except asyncio.CancelledError: pass async def _produce_data(self, producer_func): """Internal producer loop.""" try: async for data in producer_func(): self.buffer.append(data) # Notify consumers await self._notify_consumers(data) # Backpressure handling if len(self.buffer) >= self.buffer.maxlen * 0.8: await asyncio.sleep(0.01) # Slow down production except asyncio.CancelledError: pass except Exception as e: print(f"Producer error: {e}") finally: self.is_producing = False async def _notify_consumers(self, data): """Notify all consumers of new data.""" dead_consumers = set() for consumer in self.consumers: try: await consumer.put_nowait(data) except: dead_consumers.add(consumer) # Remove dead consumers self.consumers -= dead_consumers async def subscribe(self) -> asyncio.Queue: """Subscribe to the stream.""" consumer_queue = asyncio.Queue(maxsize=100) self.consumers.add(consumer_queue) # Send buffered data to new consumer for data in self.buffer: await consumer_queue.put(data) return consumer_queue def unsubscribe(self, consumer_queue: asyncio.Queue): """Unsubscribe from the stream.""" self.consumers.discard(consumer_queue) # Usage in streaming endpoint @chute.on_startup() async def initialize_streaming_buffer(self): """Initialize streaming buffer.""" self.streaming_buffer = StreamingBuffer(max_buffer_size=500) @chute.cord( public_api_path="/buffered_stream", method="GET", stream=True ) async def buffered_streaming_endpoint(self): """Stream with buffering and backpressure handling.""" # Start producing if not already started if not self.streaming_buffer.is_producing: await self.streaming_buffer.start_producing(self._data_producer) # Subscribe to stream consumer_queue = await self.streaming_buffer.subscribe() async def stream_response(): try: while True: # Get data from buffer data = await asyncio.wait_for(consumer_queue.get(), timeout=30.0) yield f"data: {json.dumps(data)}\n\n" except asyncio.TimeoutError: yield "event: timeout\ndata: {}\n\n" except Exception as e: yield f"event: error\ndata: {json.dumps({'error': str(e)})}\n\n" finally: self.streaming_buffer.unsubscribe(consumer_queue) return StreamingResponse( stream_response(), media_type="text/event-stream" ) async def _data_producer(self): """Example data producer.""" counter = 0 while True: yield { "timestamp": time.time(), "counter": counter, "data": f"Generated data {counter}" } counter += 1 await asyncio.sleep(0.1) ``` ### Connection Management ```python class ConnectionManager: """Manage WebSocket connections with health monitoring.""" def __init__(self): self.connections: Dict[str, Dict] = {} self.monitoring_task = None async def start_monitoring(self): """Start connection health monitoring.""" if self.monitoring_task is None: self.monitoring_task = asyncio.create_task(self._monitor_connections()) async def stop_monitoring(self): """Stop connection monitoring.""" if self.monitoring_task: self.monitoring_task.cancel() try: await self.monitoring_task except asyncio.CancelledError: pass self.monitoring_task = None async def add_connection(self, session_id: str, websocket: WebSocket): """Add a new WebSocket connection.""" self.connections[session_id] = { "websocket": websocket, "connected_at": time.time(), "last_ping": time.time(), "is_alive": True } # Start monitoring if first connection if len(self.connections) == 1: await self.start_monitoring() async def remove_connection(self, session_id: str): """Remove a WebSocket connection.""" if session_id in self.connections: del self.connections[session_id] # Stop monitoring if no connections if len(self.connections) == 0: await self.stop_monitoring() async def send_to_connection(self, session_id: str, message: Dict) -> bool: """Send message to specific connection.""" if session_id not in self.connections: return False try: websocket = self.connections[session_id]["websocket"] await websocket.send_text(json.dumps(message)) return True except: # Mark connection as dead self.connections[session_id]["is_alive"] = False return False async def broadcast(self, message: Dict, exclude: Set[str] = None): """Broadcast message to all connections.""" if exclude is None: exclude = set() dead_connections = [] for session_id, conn_info in self.connections.items(): if session_id not in exclude and conn_info["is_alive"]: success = await self.send_to_connection(session_id, message) if not success: dead_connections.append(session_id) # Clean up dead connections for session_id in dead_connections: await self.remove_connection(session_id) async def _monitor_connections(self): """Monitor connection health.""" try: while True: await asyncio.sleep(30) # Check every 30 seconds current_time = time.time() dead_connections = [] for session_id, conn_info in self.connections.items(): # Check if connection is stale if current_time - conn_info["last_ping"] > 60: # 1 minute timeout dead_connections.append(session_id) continue # Send ping success = await self.send_to_connection(session_id, { "type": "ping", "timestamp": current_time }) if success: conn_info["last_ping"] = current_time else: dead_connections.append(session_id) # Clean up dead connections for session_id in dead_connections: await self.remove_connection(session_id) except asyncio.CancelledError: pass except Exception as e: print(f"Connection monitoring error: {e}") ``` ## Client-Side Integration ### JavaScript/TypeScript Client ```javascript class ChutesStreamingClient { constructor(baseUrl) { this.baseUrl = baseUrl; this.eventSource = null; this.websocket = null; } // HTTP Streaming (Server-Sent Events) streamHTTP(endpoint, options = {}) { return new Promise((resolve, reject) => { const url = `${this.baseUrl}${endpoint}`; this.eventSource = new EventSource(url); const results = []; this.eventSource.onmessage = (event) => { try { const data = JSON.parse(event.data); results.push(data); // Call progress callback if provided if (options.onProgress) { options.onProgress(data); } // Check for completion if (data.done || data.is_complete) { this.eventSource.close(); resolve(results); } } catch (e) { console.error('Failed to parse SSE data:', e); } }; this.eventSource.onerror = (error) => { this.eventSource.close(); reject(error); }; }); } // WebSocket Streaming async connectWebSocket(endpoint) { return new Promise((resolve, reject) => { const wsUrl = `ws${ this.baseUrl.startsWith('https') ? 's' : '' }://${this.baseUrl.replace(/^https?:\/\//, '')}${endpoint}`; this.websocket = new WebSocket(wsUrl); this.websocket.onopen = () => { resolve(this); }; this.websocket.onerror = (error) => { reject(error); }; this.websocket.onclose = () => { console.log('WebSocket connection closed'); }; }); } // Send message via WebSocket sendMessage(message) { if (this.websocket && this.websocket.readyState === WebSocket.OPEN) { this.websocket.send(JSON.stringify(message)); } } // Set message handler onMessage(handler) { if (this.websocket) { this.websocket.onmessage = (event) => { try { const data = JSON.parse(event.data); handler(data); } catch (e) { console.error('Failed to parse WebSocket message:', e); } }; } } // Clean up connections disconnect() { if (this.eventSource) { this.eventSource.close(); this.eventSource = null; } if (this.websocket) { this.websocket.close(); this.websocket = null; } } } // Usage examples const client = new ChutesStreamingClient('https://myuser-my-chute.chutes.ai'); // HTTP Streaming example client .streamHTTP('/generate_stream', { onProgress: (data) => { console.log('Received token:', data.token); // Update UI with streaming content document.getElementById('output').textContent += data.token; } }) .then((results) => { console.log('Streaming complete:', results); }); // WebSocket example client.connectWebSocket('/chat').then(() => { client.onMessage((data) => { if (data.type === 'ai_message_partial') { // Update chat interface with partial message updateChatInterface(data.content); } }); // Send a message client.sendMessage({ type: 'user_message', message: 'Hello, AI!' }); }); ``` ### Python Client ```python import asyncio import aiohttp import json from typing import AsyncIterator, Callable, Optional class ChutesAsyncClient: """Async Python client for Chutes streaming APIs.""" def __init__(self, base_url: str): self.base_url = base_url.rstrip('/') self.session = None async def __aenter__(self): self.session = aiohttp.ClientSession() return self async def __aexit__(self, exc_type, exc_val, exc_tb): if self.session: await self.session.close() async def stream_http( self, endpoint: str, method: str = 'GET', data: dict = None, progress_callback: Callable = None ) -> AsyncIterator[dict]: """Stream data via HTTP Server-Sent Events.""" url = f"{self.base_url}{endpoint}" async with self.session.request( method, url, json=data, headers={'Accept': 'text/event-stream'} ) as response: async for line in response.content: line_str = line.decode('utf-8').strip() if line_str.startswith('data: '): try: data_str = line_str[6:] # Remove 'data: ' prefix data_obj = json.loads(data_str) if progress_callback: progress_callback(data_obj) yield data_obj except json.JSONDecodeError: continue async def connect_websocket( self, endpoint: str, message_handler: Callable = None ): """Connect to WebSocket endpoint.""" ws_url = f"ws{self.base_url[4:]}{endpoint}" async with self.session.ws_connect(ws_url) as ws: self.websocket = ws async for msg in ws: if msg.type == aiohttp.WSMsgType.TEXT: try: data = json.loads(msg.data) if message_handler: await message_handler(data) yield data except json.JSONDecodeError: continue elif msg.type == aiohttp.WSMsgType.ERROR: break async def send_websocket_message(self, message: dict): """Send message via WebSocket.""" if hasattr(self, 'websocket'): await self.websocket.send_str(json.dumps(message)) # Usage example async def example_usage(): async with ChutesAsyncClient('https://myuser-my-chute.chutes.ai') as client: # HTTP Streaming async for token_data in client.stream_http( '/generate_stream', method='POST', data={'prompt': 'Tell me a story'}, progress_callback=lambda x: print(f"Token: {x.get('token', '')}") ): if token_data.get('is_complete'): print("Generation complete!") break # WebSocket example async for message in client.connect_websocket( '/chat', message_handler=lambda msg: print(f"Received: {msg}") ): if message.get('type') == 'system': # Send a message await client.send_websocket_message({ 'type': 'user_message', 'message': 'Hello from Python client!' }) # Run the example # asyncio.run(example_usage()) ``` ## Best Practices and Troubleshooting ### Error Handling in Streams ```python class StreamErrorHandler: """Handle errors in streaming applications.""" @staticmethod async def safe_stream_wrapper(stream_func, error_callback=None): """Wrap streaming function with error handling.""" try: async for item in stream_func(): yield item except asyncio.CancelledError: yield {"type": "error", "error": "Stream cancelled"} except Exception as e: error_msg = { "type": "error", "error": str(e), "error_type": type(e).__name__ } if error_callback: await error_callback(error_msg) yield error_msg @staticmethod async def retry_stream(stream_func, max_retries=3, delay=1.0): """Retry streaming function on failure.""" for attempt in range(max_retries): try: async for item in stream_func(): yield item return # Success, exit retry loop except Exception as e: if attempt == max_retries - 1: yield { "type": "error", "error": f"Failed after {max_retries} attempts: {str(e)}" } return yield { "type": "retry", "attempt": attempt + 1, "max_retries": max_retries, "error": str(e) } await asyncio.sleep(delay * (2 ** attempt)) # Exponential backoff # Usage @chute.cord(public_api_path="/safe_stream", method="POST", stream=True) async def safe_streaming_endpoint(self, prompt: str): """Streaming endpoint with error handling.""" async def stream_with_errors(): error_handler = StreamErrorHandler() async for item in error_handler.safe_stream_wrapper( lambda: self.stream_llm_generation(prompt), error_callback=lambda err: self.log_error(err) ): yield f"data: {json.dumps(item)}\n\n" return StreamingResponse( stream_with_errors(), media_type="text/event-stream" ) ``` ### Performance Monitoring ```python class StreamingMetrics: """Monitor streaming performance.""" def __init__(self): self.active_streams = 0 self.total_streams = 0 self.avg_stream_duration = 0 self.stream_start_times = {} def start_stream(self, stream_id: str): """Record stream start.""" self.active_streams += 1 self.total_streams += 1 self.stream_start_times[stream_id] = time.time() def end_stream(self, stream_id: str): """Record stream end.""" self.active_streams = max(0, self.active_streams - 1) if stream_id in self.stream_start_times: duration = time.time() - self.stream_start_times[stream_id] self.avg_stream_duration = ( (self.avg_stream_duration * (self.total_streams - 1) + duration) / self.total_streams ) del self.stream_start_times[stream_id] def get_metrics(self) -> dict: """Get current metrics.""" return { "active_streams": self.active_streams, "total_streams": self.total_streams, "avg_duration": self.avg_stream_duration, "current_streams": list(self.stream_start_times.keys()) } @chute.on_startup() async def initialize_metrics(self): """Initialize streaming metrics.""" self.streaming_metrics = StreamingMetrics() @chute.cord(public_api_path="/metrics", method="GET") async def get_streaming_metrics(self): """Get streaming performance metrics.""" return self.streaming_metrics.get_metrics() ``` ## Next Steps - **Advanced Protocols**: Implement WebRTC for peer-to-peer streaming - **Scale Optimization**: Handle thousands of concurrent streams - **Security**: Implement authentication and rate limiting for streams - **Integration**: Connect with real-time databases and message queues For more advanced topics, see: - [Error Handling Guide](error-handling) - [Best Practices](best-practices) - [Performance Optimization](performance-optimization) --- ## SOURCE: https://chutes.ai/docs/guides/templates # Using Pre-built Templates This guide covers how to effectively use Chutes' pre-built templates to rapidly deploy AI applications with minimal configuration while maintaining flexibility for customization. ## Overview Pre-built templates provide: - **Rapid Deployment**: Get AI models running in minutes - **Best Practices**: Optimized configurations and performance tuning - **Proven Architectures**: Battle-tested model serving patterns - **Easy Customization**: Modify templates to fit your needs - **Production Ready**: Built-in scaling, monitoring, and error handling ## Available Templates ### VLLM Template High-performance large language model serving with OpenAI compatibility. ```python from chutes.chute import NodeSelector from chutes.chute.template.vllm import build_vllm_chute # Basic VLLM deployment chute = build_vllm_chute( username="myuser", readme="microsoft/DialoGPT-medium for conversational AI", model_name="microsoft/DialoGPT-medium", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24 ), concurrency=4 ) ``` **Key Features:** - OpenAI-compatible API endpoints - Automatic batching and CUDA graph optimization - Support for all major open-source LLMs - Built-in streaming and function calling - Multi-GPU distributed inference ### SGLang Template Advanced structured generation with programmable text generation. ```python from chutes.chute import NodeSelector from chutes.chute.template.sglang import build_sglang_chute chute = build_sglang_chute( username="myuser", readme="Qwen2.5-7B-Instruct with SGLang", model_name="Qwen/Qwen2.5-7B-Instruct", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16 ), concurrency=8 ) ``` **Key Features:** - Advanced structured generation - Custom sampling and constraints - Batch processing optimizations - Memory-efficient serving - Real-time streaming responses ### TEI Template (Text Embeddings Inference) High-performance text embedding generation for similarity search and RAG. ```python from chutes.chute import NodeSelector from chutes.chute.template.tei import build_tei_chute chute = build_tei_chute( username="myuser", readme="sentence-transformers/all-MiniLM-L6-v2 embeddings", model_name="sentence-transformers/all-MiniLM-L6-v2", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=8 ), concurrency=16 ) ``` **Key Features:** - Optimized embedding generation - Batch processing for efficiency - Multiple pooling strategies - Built-in similarity computation - Support for various embedding models ### Diffusion Template Image generation using state-of-the-art diffusion models. ```python from chutes.chute import NodeSelector from chutes.chute.template.diffusion import build_diffusion_chute chute = build_diffusion_chute( username="myuser", readme="Stable Diffusion XL for image generation", model_name="stabilityai/stable-diffusion-xl-base-1.0", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24 ), concurrency=2 ) ``` **Key Features:** - Support for various diffusion architectures - Text-to-image and image-to-image generation - Optimized memory usage and inference - Built-in image processing and validation - Support for ControlNet and LoRA ## Template Customization ### Basic Parameter Tuning All templates support common parameters for customization: ```python from chutes.chute import NodeSelector from chutes.chute.template.vllm import build_vllm_chute # Customized VLLM deployment chute = build_vllm_chute( username="myuser", readme="Customized Llama 2 deployment", model_name="meta-llama/Llama-2-7b-chat-hf", # Hardware configuration node_selector=NodeSelector( gpu_count=2, # Multi-GPU setup min_vram_gb_per_gpu=40, # High memory requirement include=["h100", "a100"], # Prefer specific GPU types exclude=["k80", "v100"] # Exclude older GPUs ), # Performance settings concurrency=8, # Handle 8 concurrent requests # Model-specific arguments engine_args=dict( gpu_memory_utilization=0.95, # Use 95% of GPU memory max_model_len=4096, # Context length max_num_seqs=16, # Batch size temperature=0.7, # Default temperature trust_remote_code=True, # Enable custom models quantization="awq", # Use AWQ quantization tensor_parallel_size=2, # Use both GPUs ), # Custom image (optional) image="chutes/vllm:0.8.0", # Revision pinning for reproducibility revision="main" ) ``` ### Advanced Engine Configuration #### VLLM Advanced Settings ```python # Production VLLM configuration chute = build_vllm_chute( username="myuser", model_name="microsoft/WizardLM-2-8x22B", node_selector=NodeSelector( gpu_count=8, min_vram_gb_per_gpu=80, include=["h100", "h200"] ), engine_args=dict( # Memory optimization gpu_memory_utilization=0.97, cpu_offload_gb=0, # Performance tuning max_model_len=32768, max_num_seqs=32, max_paddings=256, # Advanced features enable_prefix_caching=True, use_v2_block_manager=True, enable_chunked_prefill=True, # Model loading load_format="auto", dtype="auto", quantization="fp8", # Distributed settings tensor_parallel_size=8, pipeline_parallel_size=1, # API compatibility served_model_name="wizardlm-2-8x22b", chat_template="chatml", # Logging and monitoring disable_log_requests=False, max_log_len=2048), concurrency=16 ) ``` #### SGLang Optimization ```python # Optimized SGLang configuration chute = build_sglang_chute( username="myuser", model_name="mistralai/Mistral-7B-Instruct-v0.2", engine_args=( "--host 0.0.0.0 " "--port 30000 " "--model-path mistralai/Mistral-7B-Instruct-v0.2 " "--tokenizer-path mistralai/Mistral-7B-Instruct-v0.2 " "--context-length 32768 " "--mem-fraction-static 0.9 " "--tp-size 1 " "--stream-interval 1 " "--disable-flashinfer " # For compatibility "--trust-remote-code" ), node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16 ) ) ``` ### Custom Images with Templates You can combine templates with custom images for additional dependencies: ```python from chutes.image import Image from chutes.chute.template.vllm import build_vllm_chute # Build custom image with additional packages custom_image = ( Image(username="myuser", name="custom-vllm", tag="1.0") .from_base("chutes/vllm:0.8.0") .run_command("pip install langchain openai tiktoken") .run_command("pip install numpy pandas matplotlib") .with_env("CUSTOM_CONFIG", "production") ) # Use custom image with template chute = build_vllm_chute( username="myuser", model_name="meta-llama/Llama-2-7b-chat-hf", image=custom_image, # Use our custom image node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24 ) ) ``` ## Template Patterns ### Multi-Model Deployment Deploy multiple models using templates: ```python # Deploy different models for different use cases from chutes.chute.template.vllm import build_vllm_chute from chutes.chute.template.tei import build_tei_chute # Chat model chat_chute = build_vllm_chute( username="myuser", name="chat-service", model_name="microsoft/DialoGPT-medium", node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=16) ) # Code model code_chute = build_vllm_chute( username="myuser", name="code-service", model_name="codellama/CodeLlama-7b-Python-hf", node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=16) ) # Embedding model embedding_chute = build_tei_chute( username="myuser", name="embedding-service", model_name="sentence-transformers/all-mpnet-base-v2", node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=8) ) ``` ### Template Inheritance and Extension Create your own template patterns based on existing ones: ```python from chutes.chute.template.vllm import build_vllm_chute from chutes.chute import NodeSelector from chutes.image import Image def build_chat_template( username: str, model_name: str, system_prompt: str = "You are a helpful assistant.", **kwargs ): """Custom template for chat applications.""" # Custom image with chat-specific tools image = ( Image(username=username, name="chat-optimized", tag="1.0") .from_base("chutes/vllm:latest") .run_command("pip install tiktoken langchain") .with_env("SYSTEM_PROMPT", system_prompt) .with_env("CHAT_MODE", "true") ) # Default settings optimized for chat default_engine_args = { "max_model_len": 8192, "temperature": 0.8, "top_p": 0.9, "max_tokens": 1024, "stream": True } # Merge with user-provided args engine_args = kwargs.pop("engine_args", {}) engine_args = {**default_engine_args, **engine_args} return build_vllm_chute( username=username, model_name=model_name, image=image, engine_args=engine_args, **kwargs ) # Use custom template chat_chute = build_chat_template( username="myuser", model_name="microsoft/DialoGPT-medium", system_prompt="You are a friendly customer service assistant.", node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=16) ) ``` ### Template-Based Microservices Build a complete AI system using multiple templates: ```python # microservices_deployment.py from chutes.chute.template.vllm import build_vllm_chute from chutes.chute.template.tei import build_tei_chute from chutes.chute.template.diffusion import build_diffusion_chute class AIServiceSuite: """Complete AI service suite using templates.""" def __init__(self, username: str): self.username = username self.services = {} def deploy_text_services(self): """Deploy text processing services.""" # Main chat model self.services["chat"] = build_vllm_chute( username=self.username, name="chat-llm", model_name="microsoft/DialoGPT-medium", node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=24), concurrency=8 ) # Specialized reasoning model self.services["reasoning"] = build_vllm_chute( username=self.username, name="reasoning-llm", model_name="deepseek-ai/deepseek-llm-7b-chat", node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=16), concurrency=4 ) # Embeddings for RAG self.services["embeddings"] = build_tei_chute( username=self.username, name="text-embeddings", model_name="sentence-transformers/all-mpnet-base-v2", node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=8), concurrency=16 ) def deploy_multimodal_services(self): """Deploy multimodal AI services.""" # Image generation self.services["image_gen"] = build_diffusion_chute( username=self.username, name="image-generator", model_name="stabilityai/stable-diffusion-xl-base-1.0", node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=24), concurrency=2 ) # Vision-language model self.services["vision"] = build_vllm_chute( username=self.username, name="vision-llm", model_name="llava-hf/llava-1.5-7b-hf", node_selector=NodeSelector(gpu_count=1, min_vram_gb_per_gpu=16), concurrency=4 ) def get_deployment_script(self): """Generate deployment script for all services.""" script_lines = ["#!/bin/bash", "set -e", ""] for service_name, chute in self.services.items(): script_lines.extend([ f"echo 'Deploying {service_name}...'", f"chutes deploy {chute.name}:chute --wait", f"echo '{service_name} deployed successfully'", "" ]) return "\n".join(script_lines) # Usage suite = AIServiceSuite("myuser") suite.deploy_text_services() suite.deploy_multimodal_services() # Generate deployment script deployment_script = suite.get_deployment_script() with open("deploy_ai_suite.sh", "w") as f: f.write(deployment_script) ``` ## Template Configuration Best Practices ### 1. Hardware Selection Choose appropriate hardware for each template: ```python # Memory requirements by model size hardware_configs = { "small_models": { # <7B parameters "node_selector": NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16, include=["rtx4090", "a40", "l40"] ), "concurrency": 8 }, "medium_models": { # 7B-30B parameters "node_selector": NodeSelector( gpu_count=1, min_vram_gb_per_gpu=48, include=["a100", "h100"] ), "concurrency": 4 }, "large_models": { # 30B+ parameters "node_selector": NodeSelector( gpu_count=2, min_vram_gb_per_gpu=80, include=["h100", "h200"] ), "concurrency": 2 } } def select_hardware(model_name: str): """Select hardware configuration based on model.""" # Simple heuristic based on model name if "7b" in model_name.lower(): return hardware_configs["small_models"] elif any(size in model_name.lower() for size in ["13b", "30b"]): return hardware_configs["medium_models"] else: return hardware_configs["large_models"] ``` ### 2. Environment-Specific Configurations ```python import os def get_config_for_environment(env: str = "production"): """Get configuration based on deployment environment.""" configs = { "development": { "concurrency": 2, "engine_args": { "gpu_memory_utilization": 0.8, "max_model_len": 2048, "disable_log_requests": False } }, "staging": { "concurrency": 4, "engine_args": { "gpu_memory_utilization": 0.9, "max_model_len": 4096, "disable_log_requests": False } }, "production": { "concurrency": 8, "engine_args": { "gpu_memory_utilization": 0.95, "max_model_len": 8192, "disable_log_requests": True, "enable_prefix_caching": True } } } return configs.get(env, configs["production"]) # Usage env = os.getenv("DEPLOYMENT_ENV", "production") config = get_config_for_environment(env) chute = build_vllm_chute( username="myuser", model_name="meta-llama/Llama-2-7b-chat-hf", **config ) ``` ### 3. Model-Specific Optimizations ```python def get_model_optimizations(model_name: str): """Get model-specific optimizations.""" optimizations = { # Llama models "llama": { "engine_args": { "quantization": "awq", "enable_prefix_caching": True, "use_v2_block_manager": True } }, # Mistral models "mistral": { "engine_args": { "tokenizer_mode": "mistral", "config_format": "mistral", "trust_remote_code": True } }, # CodeLlama models "code": { "engine_args": { "max_model_len": 16384, # Longer context for code "temperature": 0.1, # Lower temperature for code "enable_prefix_caching": True } }, # Chat models "chat": { "engine_args": { "temperature": 0.8, "top_p": 0.9, "max_tokens": 2048, "stream": True } } } # Detect model type from name model_lower = model_name.lower() if "llama" in model_lower: return optimizations["llama"] elif "mistral" in model_lower: return optimizations["mistral"] elif "code" in model_lower: return optimizations["code"] elif any(term in model_lower for term in ["chat", "instruct", "dialog"]): return optimizations["chat"] else: return {"engine_args": {}} # Usage model_name = "codellama/CodeLlama-7b-Python-hf" optimizations = get_model_optimizations(model_name) chute = build_vllm_chute( username="myuser", model_name=model_name, **optimizations ) ``` ## Monitoring and Debugging Templates ### Template Health Checks ```python import requests import time async def check_template_health(chute_url: str, template_type: str): """Check health of deployed template.""" health_checks = { "vllm": { "endpoint": "/v1/models", "expected_status": 200 }, "sglang": { "endpoint": "/health", "expected_status": 200 }, "tei": { "endpoint": "/health", "expected_status": 200 }, "diffusion": { "endpoint": "/health", "expected_status": 200 } } if template_type not in health_checks: return {"status": "unknown", "error": "Unknown template type"} check_config = health_checks[template_type] try: response = requests.get( f"{chute_url}{check_config['endpoint']}", timeout=10 ) if response.status_code == check_config["expected_status"]: return {"status": "healthy", "response_time": response.elapsed.total_seconds()} else: return {"status": "unhealthy", "status_code": response.status_code} except Exception as e: return {"status": "error", "error": str(e)} # Usage health = await check_template_health( "https://myuser-my-model.chutes.ai", "vllm" ) print(f"Service health: {health}") ``` ### Performance Monitoring ```python def monitor_template_performance(chute_name: str, duration_minutes: int = 60): """Monitor template performance over time.""" import subprocess import json # Collect metrics metrics_cmd = f"chutes chutes metrics {chute_name} --duration {duration_minutes}m --format json" result = subprocess.run(metrics_cmd, shell=True, capture_output=True, text=True) if result.returncode == 0: metrics = json.loads(result.stdout) # Analyze metrics analysis = { "avg_response_time": metrics.get("avg_response_time", 0), "request_count": metrics.get("request_count", 0), "error_rate": metrics.get("error_rate", 0), "gpu_utilization": metrics.get("gpu_utilization", 0), "memory_usage": metrics.get("memory_usage", 0) } # Performance recommendations recommendations = [] if analysis["avg_response_time"] > 5: recommendations.append("Consider increasing concurrency or using faster GPUs") if analysis["gpu_utilization"] < 50: recommendations.append("GPU underutilized - consider reducing instance size") if analysis["error_rate"] > 5: recommendations.append("High error rate - check logs and model configuration") return { "metrics": analysis, "recommendations": recommendations } else: return {"error": "Failed to collect metrics", "details": result.stderr} ``` ## Template Migration and Updates ### Upgrading Template Versions ```python def upgrade_template_safely( current_chute_name: str, new_template_version: str, model_name: str, username: str ): """Safely upgrade a template to a new version.""" # Create new chute with updated template staging_name = f"{current_chute_name}-staging" new_chute = build_vllm_chute( username=username, name=staging_name, model_name=model_name, image=f"chutes/vllm:{new_template_version}", # Copy current configuration node_selector=get_current_node_selector(current_chute_name), engine_args=get_current_engine_args(current_chute_name) ) # Deployment script upgrade_script = f""" # Deploy staging version chutes deploy {staging_name}:chute --wait # Test staging deployment python test_template.py --target {staging_name} # If tests pass, switch traffic if [ $? -eq 0 ]; then echo "Tests passed, deploying to production" chutes deploy {current_chute_name}:chute --wait chutes chutes delete {staging_name} else echo "Tests failed, keeping current version" chutes chutes delete {staging_name} fi """ return upgrade_script ``` ## Troubleshooting Templates ### Common Issues and Solutions ```python def diagnose_template_issues(chute_name: str, template_type: str): """Diagnose common template deployment issues.""" issues = [] # Check deployment status status_cmd = f"chutes chutes get {chute_name}" status_result = subprocess.run(status_cmd, shell=True, capture_output=True, text=True) if "Failed" in status_result.stdout: issues.append({ "issue": "Deployment failed", "solution": "Check logs with: chutes chutes logs " + chute_name }) # Check resource usage metrics_cmd = f"chutes chutes metrics {chute_name}" metrics_result = subprocess.run(metrics_cmd, shell=True, capture_output=True, text=True) if "OutOfMemory" in metrics_result.stdout: issues.append({ "issue": "GPU out of memory", "solution": "Reduce gpu_memory_utilization or increase GPU size" }) # Template-specific checks if template_type == "vllm": # Check for VLLM-specific issues if "CUDA_ERROR_OUT_OF_MEMORY" in metrics_result.stdout: issues.append({ "issue": "VLLM CUDA memory error", "solution": "Reduce max_model_len or batch size (max_num_seqs)" }) elif template_type == "sglang": # Check for SGLang-specific issues if "RuntimeError" in metrics_result.stdout: issues.append({ "issue": "SGLang runtime error", "solution": "Check model compatibility and reduce memory usage" }) return issues # Quick diagnostics issues = diagnose_template_issues("my-llm-service", "vllm") for issue in issues: print(f"Issue: {issue['issue']}") print(f"Solution: {issue['solution']}\n") ``` ## Next Steps - **Custom Templates**: Build your own reusable templates - **Production Scaling**: Monitor and optimize template performance - **Advanced Patterns**: Combine templates for complex architectures - **CI/CD Integration**: Automate template deployments For more advanced topics, see: - [Custom Chutes Guide](custom-chutes) - [Performance Optimization](performance-optimization) - [Production Best Practices](best-practices) --- ## SOURCE: https://chutes.ai/docs/templates/diffusion # Diffusion Template The **Diffusion template** provides high-performance image generation using Stable Diffusion and other diffusion models. Perfect for text-to-image, image-to-image, and inpainting applications. ## What is Stable Diffusion? Stable Diffusion is a powerful diffusion model that generates high-quality images from text prompts: - 🎨 **Text-to-image** generation from prompts - 🖼️ **Image-to-image** transformation and editing - 🎭 **Inpainting** to fill missing parts of images - 🎯 **ControlNet** for guided generation - ⚡ **Optimized inference** with multiple acceleration techniques ## Quick Start ```python from chutes.chute import NodeSelector from chutes.chute.template.diffusion import build_diffusion_chute chute = build_diffusion_chute( username="myuser", model_name="stabilityai/stable-diffusion-xl-base-1.0", revision="main", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=12 ) ) ``` This creates a complete diffusion deployment with: - ✅ Optimized Stable Diffusion pipeline - ✅ Multiple generation modes (txt2img, img2img, inpaint) - ✅ Configurable generation parameters - ✅ Safety filtering and content moderation - ✅ Auto-scaling based on demand ## Function Reference ### `build_diffusion_chute()` ```python def build_diffusion_chute( username: str, model_name: str, revision: str = "main", node_selector: NodeSelector = None, image: str | Image = None, tagline: str = "", readme: str = "", concurrency: int = 1, # Diffusion-specific parameters pipeline_type: str = "text2img", scheduler: str = "euler_a", safety_checker: bool = True, requires_safety_checker: bool = False, guidance_scale: float = 7.5, num_inference_steps: int = 50, height: int = 512, width: int = 512, enable_xformers: bool = True, enable_cpu_offload: bool = False, **kwargs ) -> Chute: ``` #### Required Parameters - **`username`**: Your Chutes username - **`model_name`**: HuggingFace diffusion model identifier #### Diffusion Configuration - **`pipeline_type`**: Generation mode - "text2img", "img2img", or "inpaint" (default: "text2img") - **`scheduler`**: Sampling scheduler - "euler_a", "ddim", "dpm", etc. (default: "euler_a") - **`safety_checker`**: Enable NSFW content filtering (default: True) - **`guidance_scale`**: CFG guidance strength (default: 7.5) - **`num_inference_steps`**: Number of denoising steps (default: 50) - **`height`**: Default image height (default: 512) - **`width`**: Default image width (default: 512) - **`enable_xformers`**: Use memory-efficient attention (default: True) ## Complete Example ```python from chutes.chute import NodeSelector from chutes.chute.template.diffusion import build_diffusion_chute # Build diffusion chute for image generation chute = build_diffusion_chute( username="myuser", model_name="stabilityai/stable-diffusion-xl-base-1.0", revision="main", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16, include=["rtx4090", "a100"] ), tagline="High-quality image generation with SDXL", readme=""" # Image Generation Service Generate stunning images from text prompts using Stable Diffusion XL. ## Features - High-resolution image generation (up to 1024x1024) - Multiple generation modes - ControlNet support for guided generation - Safety filtering for appropriate content ## API Endpoints - `/generate` - Text-to-image generation - `/img2img` - Image-to-image transformation - `/inpaint` - Image inpainting """, # Optimize for SDXL scheduler="euler_a", guidance_scale=7.5, num_inference_steps=30, # SDXL works well with fewer steps height=1024, width=1024, safety_checker=True ) ``` ## API Endpoints ### Text-to-Image Generation ```bash curl -X POST https://myuser-diffusion-chute.chutes.ai/generate \ -H "Content-Type: application/json" \ -d '{ "prompt": "A beautiful landscape with mountains and a lake at sunset", "negative_prompt": "blurry, low quality, distorted", "width": 1024, "height": 1024, "num_inference_steps": 30, "guidance_scale": 7.5, "seed": 42 }' ``` ### Image-to-Image ```bash curl -X POST https://myuser-diffusion-chute.chutes.ai/img2img \ -F "image=@input_image.jpg" \ -F "prompt=A cyberpunk version of this scene" \ -F "strength=0.7" \ -F "guidance_scale=7.5" ``` ### Inpainting ```bash curl -X POST https://myuser-diffusion-chute.chutes.ai/inpaint \ -F "image=@original.jpg" \ -F "mask=@mask.jpg" \ -F "prompt=A beautiful garden" \ -F "num_inference_steps=50" ``` ## Model Recommendations ### Stable Diffusion 1.5 ```python # Classic SD 1.5 - good balance of quality and speed NodeSelector( gpu_count=1, min_vram_gb_per_gpu=8, include=["rtx3090", "rtx4090"] ) # Recommended models: # - runwayml/stable-diffusion-v1-5 # - stabilityai/stable-diffusion-2-1 # - prompthero/openjourney ``` ### Stable Diffusion XL ```python # SDXL - highest quality, more VRAM needed NodeSelector( gpu_count=1, min_vram_gb_per_gpu=12, include=["rtx4090", "a100"] ) # Recommended models: # - stabilityai/stable-diffusion-xl-base-1.0 # - stabilityai/stable-diffusion-xl-refiner-1.0 # - Lykon/DreamShaper-XL-1.0 ``` ### Specialized Models ```python # Anime/artistic styles NodeSelector( gpu_count=1, min_vram_gb_per_gpu=10, include=["rtx4090", "a100"] ) # Recommended models: # - Linaqruf/anything-v3.0 # - hakurei/waifu-diffusion # - SG161222/Realistic_Vision_V6.0_B1_noVAE ``` ## Use Cases ### 1. **Marketing Content Creation** ```python marketing_chute = build_diffusion_chute( username="myuser", model_name="stabilityai/stable-diffusion-xl-base-1.0", tagline="Marketing image generation", guidance_scale=8.0, # Higher guidance for consistent style num_inference_steps=40, height=1024, width=1024 ) ``` ### 2. **Art Generation** ```python art_chute = build_diffusion_chute( username="myuser", model_name="Lykon/DreamShaper-XL-1.0", tagline="Artistic image creation", guidance_scale=6.0, # Lower for more creative freedom scheduler="dpm_solver_multistep", safety_checker=False # For artistic freedom ) ``` ### 3. **Product Visualization** ```python product_chute = build_diffusion_chute( username="myuser", model_name="SG161222/Realistic_Vision_V6.0_B1_noVAE", tagline="Realistic product images", guidance_scale=7.5, num_inference_steps=50, # More steps for photorealism scheduler="euler_a" ) ``` ### 4. **Character Design** ```python character_chute = build_diffusion_chute( username="myuser", model_name="Linaqruf/anything-v3.0", tagline="Character and concept art", guidance_scale=7.0, height=768, width=512 # Portrait orientation ) ``` ## Advanced Features ### ControlNet Integration ```python # Enable ControlNet for guided generation controlnet_chute = build_diffusion_chute( username="myuser", model_name="stabilityai/stable-diffusion-xl-base-1.0", enable_controlnet=True, controlnet_models=[ "diffusers/controlnet-canny-sdxl-1.0", "diffusers/controlnet-depth-sdxl-1.0" ] ) ``` ### Custom VAE ```python # Use custom VAE for better image quality custom_vae_chute = build_diffusion_chute( username="myuser", model_name="stabilityai/stable-diffusion-xl-base-1.0", vae_model="madebyollin/sdxl-vae-fp16-fix", enable_vae_slicing=True ) ``` ### Multi-Model Pipeline ```python # SDXL with refiner for ultimate quality refiner_chute = build_diffusion_chute( username="myuser", model_name="stabilityai/stable-diffusion-xl-base-1.0", refiner_model="stabilityai/stable-diffusion-xl-refiner-1.0", refiner_strength=0.3, num_inference_steps=40 ) ``` ## Performance Optimization ### Speed Optimization ```python # Optimize for fast generation fast_chute = build_diffusion_chute( username="myuser", model_name="runwayml/stable-diffusion-v1-5", num_inference_steps=20, # Fewer steps guidance_scale=5.0, # Lower guidance enable_xformers=True, # Memory efficient attention scheduler="euler_a", # Fast scheduler enable_cpu_offload=False # Keep everything on GPU ) ``` ### Quality Optimization ```python # Optimize for highest quality quality_chute = build_diffusion_chute( username="myuser", model_name="stabilityai/stable-diffusion-xl-base-1.0", num_inference_steps=50, # More steps guidance_scale=8.0, # Higher guidance scheduler="dpm_solver_multistep", # High-quality scheduler height=1024, width=1024 ) ``` ### Memory Optimization ```python # Optimize for lower VRAM usage memory_efficient_chute = build_diffusion_chute( username="myuser", model_name="runwayml/stable-diffusion-v1-5", enable_cpu_offload=True, # Offload to CPU when not in use enable_vae_slicing=True, # Slice VAE for memory efficiency enable_attention_slicing=True, # Slice attention layers height=512, width=512 ) ``` ## Testing Your Diffusion Chute ### Python Client ```python import requests import base64 from PIL import Image import io def generate_image(prompt, negative_prompt="", width=1024, height=1024): """Generate image from text prompt.""" response = requests.post( "https://myuser-diffusion-chute.chutes.ai/generate", json={ "prompt": prompt, "negative_prompt": negative_prompt, "width": width, "height": height, "num_inference_steps": 30, "guidance_scale": 7.5, "seed": -1 # Random seed } ) if response.status_code == 200: result = response.json() # Decode base64 image image_data = base64.b64decode(result["images"][0]) image = Image.open(io.BytesIO(image_data)) return image else: raise Exception(f"Generation failed: {response.text}") # Test image generation image = generate_image( prompt="A serene mountain lake at sunset with purple clouds", negative_prompt="blurry, low quality, distorted, text", width=1024, height=768 ) image.save("generated_image.png") print("Image saved as generated_image.png") ``` ### Batch Generation ```python import asyncio import aiohttp import json async def batch_generate_images(prompts): """Generate multiple images concurrently.""" async def generate_single(session, prompt): async with session.post( "https://myuser-diffusion-chute.chutes.ai/generate", json={ "prompt": prompt, "num_inference_steps": 25, "guidance_scale": 7.0, "width": 512, "height": 512 } ) as response: return await response.json() async with aiohttp.ClientSession() as session: tasks = [generate_single(session, prompt) for prompt in prompts] results = await asyncio.gather(*tasks) return results # Test batch generation prompts = [ "A majestic eagle soaring over mountains", "A cyberpunk cityscape at night with neon lights", "A peaceful garden with cherry blossoms", "A futuristic robot in a sci-fi laboratory" ] results = asyncio.run(batch_generate_images(prompts)) for i, result in enumerate(results): print(f"Generated image {i+1} successfully") ``` ### Image-to-Image Testing ```python import requests from PIL import Image def img2img_transform(input_image_path, prompt, strength=0.7): """Transform an existing image based on prompt.""" with open(input_image_path, 'rb') as f: files = {'image': f} data = { 'prompt': prompt, 'strength': strength, 'guidance_scale': 7.5, 'num_inference_steps': 30 } response = requests.post( "https://myuser-diffusion-chute.chutes.ai/img2img", files=files, data=data ) if response.status_code == 200: result = response.json() # Process result similar to text-to-image return result else: raise Exception(f"Transform failed: {response.text}") # Test image transformation result = img2img_transform( "input_photo.jpg", "Transform this into a watercolor painting", strength=0.8 ) ``` ## Generation Parameters Guide ### Prompt Engineering ```python # Effective prompt structure def create_effective_prompt(subject, style, quality_modifiers=""): """Create well-structured prompts.""" base_prompt = f"{subject}, {style}" if quality_modifiers: base_prompt += f", {quality_modifiers}" # Add quality enhancers quality_terms = "highly detailed, sharp focus, professional photography" return f"{base_prompt}, {quality_terms}" # Examples portrait_prompt = create_effective_prompt( subject="Portrait of a young woman with curly hair", style="Renaissance painting style", quality_modifiers="oil painting, classical lighting" ) landscape_prompt = create_effective_prompt( subject="Mountain landscape with a lake", style="digital art", quality_modifiers="golden hour lighting, cinematic composition" ) ``` ### Parameter Guidelines ```python # Parameter recommendations by use case # Photorealistic images photorealistic_params = { "guidance_scale": 7.5, "num_inference_steps": 50, "scheduler": "euler_a" } # Artistic/creative images artistic_params = { "guidance_scale": 6.0, "num_inference_steps": 30, "scheduler": "dpm_solver_multistep" } # Fast generation fast_params = { "guidance_scale": 5.0, "num_inference_steps": 20, "scheduler": "euler_a" } # High quality (slow) quality_params = { "guidance_scale": 8.5, "num_inference_steps": 80, "scheduler": "dpm_solver_multistep" } ``` ## Integration Examples ### Web Gallery Application ```python from flask import Flask, request, jsonify, render_template import requests import base64 app = Flask(__name__) @app.route('/') def gallery(): return render_template('gallery.html') @app.route('/generate', methods=['POST']) def generate(): data = request.json prompt = data.get('prompt') # Generate image response = requests.post( "https://myuser-diffusion-chute.chutes.ai/generate", json={ "prompt": prompt, "negative_prompt": "blurry, low quality", "width": 512, "height": 512, "num_inference_steps": 25 } ) if response.status_code == 200: result = response.json() return jsonify({ "success": True, "image": result["images"][0], # Base64 encoded "seed": result.get("seed") }) else: return jsonify({"success": False, "error": response.text}) if __name__ == '__main__': app.run(debug=True) ``` ### Image Processing Pipeline ```python import requests from PIL import Image, ImageEnhance import io import base64 class ImageProcessor: def __init__(self, chute_url): self.chute_url = chute_url def generate_base_image(self, prompt): """Generate initial image.""" response = requests.post( f"{self.chute_url}/generate", json={ "prompt": prompt, "width": 1024, "height": 1024, "num_inference_steps": 30 } ) result = response.json() image_data = base64.b64decode(result["images"][0]) return Image.open(io.BytesIO(image_data)) def refine_image(self, image, prompt, strength=0.5): """Refine existing image.""" # Convert PIL image to bytes img_buffer = io.BytesIO() image.save(img_buffer, format='PNG') img_buffer.seek(0) files = {'image': img_buffer} data = { 'prompt': prompt, 'strength': strength, 'num_inference_steps': 20 } response = requests.post( f"{self.chute_url}/img2img", files=files, data=data ) result = response.json() refined_data = base64.b64decode(result["images"][0]) return Image.open(io.BytesIO(refined_data)) def enhance_image(self, image): """Apply post-processing enhancements.""" # Enhance contrast enhancer = ImageEnhance.Contrast(image) image = enhancer.enhance(1.1) # Enhance color enhancer = ImageEnhance.Color(image) image = enhancer.enhance(1.05) return image # Usage example processor = ImageProcessor("https://myuser-diffusion-chute.chutes.ai") # Generate and refine base_image = processor.generate_base_image("A beautiful sunset over the ocean") refined_image = processor.refine_image( base_image, "A beautiful sunset over the ocean, cinematic lighting, golden hour", strength=0.3 ) final_image = processor.enhance_image(refined_image) final_image.save("final_artwork.png") ``` ## Troubleshooting ### Common Issues **Generation too slow?** - Reduce `num_inference_steps` (try 20-30) - Use a faster scheduler like "euler_a" - Lower the resolution (512x512 instead of 1024x1024) - Enable memory optimizations **Out of memory errors?** - Enable CPU offloading: `enable_cpu_offload=True` - Enable attention slicing: `enable_attention_slicing=True` - Reduce image resolution - Use a smaller model (SD 1.5 instead of SDXL) **Poor image quality?** - Increase `num_inference_steps` (try 50-80) - Adjust `guidance_scale` (7.5-12.0) - Improve prompts with quality modifiers - Use a higher resolution **NSFW content blocked?** - Adjust prompts to be more appropriate - Set `safety_checker=False` if appropriate for your use case - Use different negative prompts ## Best Practices ### 1. **Prompt Engineering** ```python # Good prompt structure good_prompt = "Portrait of a person, photorealistic, highly detailed, professional photography, sharp focus, beautiful lighting" # Include style modifiers style_prompt = "Landscape painting, oil on canvas, Bob Ross style, happy little trees, peaceful, serene" # Use negative prompts effectively negative_prompt = "blurry, low quality, distorted, ugly, bad anatomy, extra limbs, text, watermark" ``` ### 2. **Parameter Optimization** ```python # Balance quality and speed balanced_config = { "num_inference_steps": 30, "guidance_scale": 7.5, "width": 768, "height": 768 } # For batch processing batch_config = { "num_inference_steps": 20, "guidance_scale": 6.0, "width": 512, "height": 512 } ``` ### 3. **Memory Management** ```python # For limited VRAM memory_config = { "enable_cpu_offload": True, "enable_attention_slicing": True, "enable_vae_slicing": True, "width": 512, "height": 512 } ``` ### 4. **Content Safety** ```python # Enable safety checking for public-facing applications safe_config = { "safety_checker": True, "requires_safety_checker": True, "guidance_scale": 7.5 # Moderate guidance } ``` ## Next Steps - **[VLLM Template](/docs/templates/vllm)** - Text generation capabilities - **[TEI Template](/docs/templates/tei)** - Text embeddings for image search - **[Image Processing Guide](/docs/guides/image-processing)** - Advanced image manipulation - **[ControlNet Guide](/docs/guides/controlnet)** - Guided image generation --- ## SOURCE: https://chutes.ai/docs/templates/sglang # SGLang Template The **SGLang template** provides structured generation capabilities for complex prompting, reasoning, and multi-step AI workflows. SGLang (Structured Generation Language) excels at complex reasoning tasks and controlled text generation. ## What is SGLang? SGLang is a domain-specific language for complex prompting and generation that provides: - 🧠 **Structured reasoning** with multi-step prompts - 🔄 **Control flow** for dynamic generation - 📊 **State management** across generation steps - 🎯 **Guided generation** with constraints - 🔗 **Chain-of-thought** prompting patterns ## Quick Start ```python from chutes.chute import NodeSelector from chutes.chute.template.sglang import build_sglang_chute chute = build_sglang_chute( username="myuser", model_name="microsoft/DialoGPT-medium", revision="main", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16 ) ) ``` This creates a complete SGLang deployment with: - ✅ Structured generation engine - ✅ Multi-step reasoning capabilities - ✅ Custom prompting patterns - ✅ State-aware generation - ✅ Auto-scaling based on demand ## Function Reference ### `build_sglang_chute()` ```python def build_sglang_chute( username: str, model_name: str, revision: str = "main", node_selector: NodeSelector = None, image: str | Image = None, tagline: str = "", readme: str = "", concurrency: int = 1, # SGLang-specific parameters max_new_tokens: int = 512, temperature: float = 0.7, top_p: float = 0.9, guidance_scale: float = 1.0, enable_sampling: bool = True, structured_output: bool = True, **kwargs ) -> Chute: ``` #### Required Parameters - **`username`**: Your Chutes username - **`model_name`**: HuggingFace model identifier #### SGLang Configuration - **`max_new_tokens`**: Maximum tokens to generate (default: 512) - **`temperature`**: Sampling temperature (default: 0.7) - **`top_p`**: Nucleus sampling parameter (default: 0.9) - **`guidance_scale`**: Guidance strength for controlled generation (default: 1.0) - **`enable_sampling`**: Enable probabilistic sampling (default: True) - **`structured_output`**: Enable structured output formatting (default: True) ## Complete Example ```python from chutes.chute import NodeSelector from chutes.chute.template.sglang import build_sglang_chute # Build SGLang chute for complex reasoning chute = build_sglang_chute( username="myuser", model_name="microsoft/DialoGPT-medium", revision="main", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16 ), tagline="Advanced reasoning with SGLang", readme=""" # Advanced Reasoning Engine This chute provides structured generation capabilities using SGLang for complex reasoning and multi-step AI workflows. ## Features - Multi-step reasoning - Structured output generation - Chain-of-thought prompting - Guided generation ## API Endpoints - `/generate` - Basic text generation - `/reason` - Multi-step reasoning - `/structured` - Structured output generation """, # SGLang configuration max_new_tokens=1024, temperature=0.8, top_p=0.95, guidance_scale=1.2, structured_output=True ) ``` ## API Endpoints ### Basic Generation ```bash curl -X POST https://myuser-sglang-chute.chutes.ai/generate \ -H "Content-Type: application/json" \ -d '{ "prompt": "Explain quantum computing", "max_tokens": 200, "temperature": 0.7 }' ``` ### Structured Reasoning ```bash curl -X POST https://myuser-sglang-chute.chutes.ai/reason \ -H "Content-Type: application/json" \ -d '{ "problem": "What are the environmental impacts of renewable energy?", "steps": [ "analyze_benefits", "identify_drawbacks", "compare_alternatives", "provide_conclusion" ] }' ``` ### Chain-of-Thought ```bash curl -X POST https://myuser-sglang-chute.chutes.ai/chain-of-thought \ -H "Content-Type: application/json" \ -d '{ "question": "If a train travels 60 mph for 2.5 hours, how far does it go?", "show_reasoning": true }' ``` ## SGLang Programs ### Multi-Step Reasoning ```python @sglang.function def analyze_problem(s, problem): s += f"Problem: {problem}\n\n" s += "Let me think about this step by step:\n\n" s += "Step 1: Understanding the problem\n" s += sglang.gen("understanding", max_tokens=100) s += "\n\n" s += "Step 2: Identifying key factors\n" s += sglang.gen("factors", max_tokens=100) s += "\n\n" s += "Step 3: Analysis\n" s += sglang.gen("analysis", max_tokens=150) s += "\n\n" s += "Conclusion:\n" s += sglang.gen("conclusion", max_tokens=100) return s ``` ### Structured Output ```python @sglang.function def extract_information(s, text): s += f"Text: {text}\n\n" s += "Extract the following information:\n\n" s += "Name: " s += sglang.gen("name", max_tokens=20, stop=["\n"]) s += "\n" s += "Age: " s += sglang.gen("age", max_tokens=10, regex=r"\d+") s += "\n" s += "Occupation: " s += sglang.gen("occupation", max_tokens=30, stop=["\n"]) s += "\n" s += "Summary: " s += sglang.gen("summary", max_tokens=100) return s ``` ### Guided Generation ```python @sglang.function def generate_story(s, theme, character): s += f"Write a story about {character} with the theme of {theme}.\n\n" # Structured story generation s += "Title: " s += sglang.gen("title", max_tokens=20, stop=["\n"]) s += "\n\n" s += "Setting: " s += sglang.gen("setting", max_tokens=50, stop=["\n"]) s += "\n\n" s += "Plot:\n" for i in range(3): s += f"Chapter {i+1}: " s += sglang.gen(f"chapter_{i+1}", max_tokens=200) s += "\n\n" s += "Conclusion: " s += sglang.gen("conclusion", max_tokens=100) return s ``` ## Advanced Features ### Custom Templates ```python # Define custom reasoning template reasoning_template = """ Problem: {problem} Analysis Framework: 1. Context: What background information is relevant? 2. Constraints: What limitations or requirements exist? 3. Options: What are the possible approaches or solutions? 4. Evaluation: What are the pros and cons of each option? 5. Conclusion: What is the best approach and why? Let me work through this systematically: """ chute = build_sglang_chute( username="myuser", model_name="microsoft/DialoGPT-medium", custom_templates={"reasoning": reasoning_template}, guidance_scale=1.5 # Higher guidance for structured output ) ``` ### Constraint-Based Generation ```python # Configure constraints for specific output formats chute = build_sglang_chute( username="myuser", model_name="microsoft/DialoGPT-medium", constraints={ "json_format": True, "max_length": 500, "required_fields": ["summary", "key_points", "conclusion"], "stop_sequences": ["END", "STOP"] } ) ``` ### Multi-Modal Reasoning ```python # Enable multi-modal capabilities chute = build_sglang_chute( username="myuser", model_name="microsoft/DialoGPT-medium", multimodal=True, vision_enabled=True, audio_enabled=False ) ``` ## Model Recommendations ### Small Models (< 7B parameters) ```python # Good for basic structured generation NodeSelector( gpu_count=1, min_vram_gb_per_gpu=8, include=["rtx4090", "rtx3090"] ) # Recommended models: # - microsoft/DialoGPT-medium # - google/flan-t5-base # - microsoft/DialoGPT-small ``` ### Medium Models (7B - 13B parameters) ```python # Optimal for complex reasoning NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16, include=["rtx4090", "a100"] ) # Recommended models: # - microsoft/DialoGPT-large # - google/flan-t5-large # - meta-llama/Llama-2-7b-chat-hf ``` ### Large Models (13B+ parameters) ```python # Best for advanced reasoning NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24, include=["a100", "h100"] ) # Recommended models: # - meta-llama/Llama-2-13b-chat-hf # - microsoft/DialoGPT-xlarge # - google/flan-ul2 ``` ## Use Cases ### 1. **Educational Tutoring** ```python tutoring_chute = build_sglang_chute( username="myuser", model_name="microsoft/DialoGPT-medium", tagline="AI Tutor with structured explanations", custom_templates={ "explanation": "Explain {topic} step by step with examples", "quiz": "Create 5 questions about {topic} with explanations" } ) ``` ### 2. **Business Analysis** ```python analysis_chute = build_sglang_chute( username="myuser", model_name="microsoft/DialoGPT-medium", structured_output=True, constraints={ "format": "business_report", "sections": ["executive_summary", "analysis", "recommendations"] } ) ``` ### 3. **Creative Writing** ```python writing_chute = build_sglang_chute( username="myuser", model_name="microsoft/DialoGPT-medium", temperature=0.9, # Higher creativity top_p=0.95, enable_sampling=True ) ``` ### 4. **Code Generation** ```python code_chute = build_sglang_chute( username="myuser", model_name="microsoft/DialoGPT-medium", temperature=0.3, # Lower for more precise code structured_output=True, constraints={ "language": "python", "include_comments": True, "include_tests": True } ) ``` ## Performance Optimization ### Memory Optimization ```python # Optimize for memory efficiency chute = build_sglang_chute( username="myuser", model_name="microsoft/DialoGPT-medium", max_new_tokens=256, # Limit generation length batch_size=4, # Smaller batches gradient_checkpointing=True ) ``` ### Speed Optimization ```python # Optimize for speed chute = build_sglang_chute( username="myuser", model_name="microsoft/DialoGPT-medium", temperature=0.0, # Deterministic (faster) top_p=1.0, # No nucleus sampling enable_caching=True, # Cache intermediate results compile_model=True # JIT compilation ) ``` ## Testing Your SGLang Chute ### Python Client ```python import requests # Test basic generation response = requests.post( "https://myuser-sglang-chute.chutes.ai/generate", json={ "prompt": "Analyze the benefits of renewable energy", "max_tokens": 300, "structured": True } ) result = response.json() print(result["generated_text"]) ``` ### Complex Reasoning Test ```python # Test multi-step reasoning response = requests.post( "https://myuser-sglang-chute.chutes.ai/reason", json={ "problem": "Should companies adopt remote work policies?", "reasoning_steps": [ "identify_stakeholders", "analyze_benefits", "analyze_drawbacks", "consider_implementation", "provide_recommendation" ] } ) reasoning = response.json() for step in reasoning["steps"]: print(f"{step['name']}: {step['output']}") ``` ## Troubleshooting ### Common Issues **Generation too slow?** - Reduce `max_new_tokens` - Lower `temperature` for deterministic output - Disable sampling with `enable_sampling=False` **Output not structured enough?** - Increase `guidance_scale` - Enable `structured_output=True` - Add custom constraints **Memory errors?** - Reduce batch size - Use smaller model - Increase GPU VRAM requirements **Inconsistent outputs?** - Lower temperature for more deterministic results - Use seed for reproducible generation - Add stronger constraints ## Best Practices ### 1. **Template Design** ```python # Good: Clear, structured templates template = """ Task: {task} Requirements: - Be specific and detailed - Provide examples - Explain reasoning Response: """ # Bad: Vague, unstructured template = "Do {task}" ``` ### 2. **Constraint Configuration** ```python # Effective constraints constraints = { "max_length": 500, "required_sections": ["introduction", "analysis", "conclusion"], "format": "markdown", "tone": "professional" } ``` ### 3. **Prompt Engineering** ```python # Structure prompts for better results def create_analysis_prompt(topic): return f""" Analyze the topic: {topic} Please structure your response as: 1. Overview (2-3 sentences) 2. Key factors (bullet points) 3. Analysis (detailed explanation) 4. Conclusion (summary and implications) Analysis: """ ``` ## Next Steps - **[VLLM Template](/docs/templates/vllm)** - High-performance LLM serving - **[Custom Templates Guide](/docs/guides/custom-templates)** - Build custom templates - **[Advanced Prompting](/docs/guides/advanced-prompting)** - Prompt engineering techniques - **[Multi-Model Workflows](/docs/guides/multi-model)** - Combine multiple models --- ## SOURCE: https://chutes.ai/docs/templates/tei # TEI Template The **TEI (Text Embeddings Inference) template** provides optimized text embedding generation using Hugging Face's high-performance inference server. Perfect for semantic search, similarity detection, and RAG applications. ## What is TEI? Text Embeddings Inference (TEI) is a specialized inference server for embedding models that provides: - ⚡ **Optimized performance** with Rust-based implementation - 📊 **Batch processing** for efficient throughput - 🔄 **Automatic batching** and request queuing - 📏 **Embedding normalization** and pooling options - 🎯 **Production-ready** with health checks and metrics ## Quick Start ```python from chutes.chute import NodeSelector from chutes.chute.template.tei import build_tei_chute chute = build_tei_chute( username="myuser", model_name="sentence-transformers/all-MiniLM-L6-v2", revision="main", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=8 ) ) ``` This creates a complete TEI deployment with: - ✅ Optimized embedding inference server - ✅ OpenAI-compatible embeddings API - ✅ Automatic request batching - ✅ Built-in normalization - ✅ Auto-scaling based on demand ## Function Reference ### `build_tei_chute()` ```python def build_tei_chute( username: str, model_name: str, revision: str = "main", node_selector: NodeSelector = None, image: str | Image = None, tagline: str = "", readme: str = "", concurrency: int = 1, # TEI-specific parameters max_batch_tokens: int = 16384, max_batch_requests: int = 512, max_concurrent_requests: int = 512, pooling: str = "mean", normalize: bool = True, trust_remote_code: bool = False, **kwargs ) -> Chute: ``` #### Required Parameters - **`username`**: Your Chutes username - **`model_name`**: HuggingFace embedding model identifier #### TEI Configuration - **`max_batch_tokens`**: Maximum tokens per batch (default: 16384) - **`max_batch_requests`**: Maximum requests per batch (default: 512) - **`max_concurrent_requests`**: Maximum concurrent requests (default: 512) - **`pooling`**: Pooling strategy - "mean", "cls", or "max" (default: "mean") - **`normalize`**: Whether to normalize embeddings (default: True) - **`trust_remote_code`**: Allow custom model code execution (default: False) ## Complete Example ```python from chutes.chute import NodeSelector from chutes.chute.template.tei import build_tei_chute # Build TEI chute for embedding generation chute = build_tei_chute( username="myuser", model_name="sentence-transformers/all-MiniLM-L6-v2", revision="main", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=8 ), tagline="High-performance text embeddings", readme=""" # Text Embeddings Service Fast and efficient text embedding generation using TEI. ## Features - OpenAI-compatible embeddings API - Automatic batching and optimization - Normalized embeddings for similarity search - Production-ready performance ## API Endpoints - `/v1/embeddings` - Generate embeddings - `/embed` - Alternative embedding endpoint - `/health` - Health check """, # TEI optimization max_batch_tokens=32768, max_batch_requests=256, pooling="mean", normalize=True ) ``` ## API Endpoints ### OpenAI-Compatible Embeddings ```bash curl -X POST https://myuser-tei-chute.chutes.ai/v1/embeddings \ -H "Content-Type: application/json" \ -d '{ "model": "sentence-transformers/all-MiniLM-L6-v2", "input": [ "The quick brown fox jumps over the lazy dog", "Machine learning is transforming technology" ] }' ``` ### Single Text Embedding ```bash curl -X POST https://myuser-tei-chute.chutes.ai/embed \ -H "Content-Type: application/json" \ -d '{ "inputs": "This is a sample text for embedding generation" }' ``` ### Batch Processing ```bash curl -X POST https://myuser-tei-chute.chutes.ai/embed \ -H "Content-Type: application/json" \ -d '{ "inputs": [ "First document to embed", "Second document for embedding", "Third text for similarity search" ] }' ``` ## Model Recommendations ### Small & Fast Models ```python # Lightweight, fast inference NodeSelector( gpu_count=1, min_vram_gb_per_gpu=4, include=["rtx3090", "rtx4090"] ) # Recommended models: # - sentence-transformers/all-MiniLM-L6-v2 (384 dim) # - sentence-transformers/all-MiniLM-L12-v2 (384 dim) # - microsoft/codebert-base (768 dim) ``` ### Balanced Performance Models ```python # Good balance of speed and quality NodeSelector( gpu_count=1, min_vram_gb_per_gpu=8, include=["rtx4090", "a100"] ) # Recommended models: # - sentence-transformers/all-mpnet-base-v2 (768 dim) # - sentence-transformers/multi-qa-mpnet-base-dot-v1 (768 dim) # - thenlper/gte-base (768 dim) ``` ### High-Quality Models ```python # Best embedding quality NodeSelector( gpu_count=1, min_vram_gb_per_gpu=12, include=["a100", "h100"] ) # Recommended models: # - sentence-transformers/all-mpnet-base-v2 (768 dim) # - intfloat/e5-large-v2 (1024 dim) # - BAAI/bge-large-en-v1.5 (1024 dim) ``` ## Use Cases ### 1. **Semantic Search** ```python search_chute = build_tei_chute( username="myuser", model_name="sentence-transformers/multi-qa-mpnet-base-dot-v1", tagline="Semantic search embeddings", max_batch_tokens=32768, # Handle large documents normalize=True # Important for similarity search ) ``` ### 2. **Document Similarity** ```python similarity_chute = build_tei_chute( username="myuser", model_name="sentence-transformers/all-mpnet-base-v2", tagline="Document similarity service", pooling="mean", normalize=True ) ``` ### 3. **Code Embeddings** ```python code_chute = build_tei_chute( username="myuser", model_name="microsoft/codebert-base", tagline="Code similarity and search", max_batch_tokens=16384, # Typical code snippet length trust_remote_code=True # May be needed for code models ) ``` ### 4. **Multilingual Embeddings** ```python multilingual_chute = build_tei_chute( username="myuser", model_name="sentence-transformers/paraphrase-multilingual-mpnet-base-v2", tagline="Multilingual text embeddings", max_batch_requests=1024 # Handle diverse languages efficiently ) ``` ## Performance Optimization ### Throughput Optimization ```python # Maximize throughput for batch processing chute = build_tei_chute( username="myuser", model_name="sentence-transformers/all-MiniLM-L6-v2", max_batch_tokens=65536, # Large batches max_batch_requests=1024, # Many requests max_concurrent_requests=2048, # High concurrency concurrency=8 # Multiple chute instances ) ``` ### Latency Optimization ```python # Minimize latency for real-time applications chute = build_tei_chute( username="myuser", model_name="sentence-transformers/all-MiniLM-L6-v2", max_batch_tokens=4096, # Smaller batches max_batch_requests=32, # Fewer requests per batch max_concurrent_requests=128 # Lower concurrency ) ``` ### Memory Optimization ```python # Optimize for memory usage chute = build_tei_chute( username="myuser", model_name="sentence-transformers/all-MiniLM-L6-v2", max_batch_tokens=8192, # Moderate batch size max_batch_requests=256, # Moderate requests node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=6 # Conservative memory ) ) ``` ## Testing Your TEI Chute ### Python Client ```python import requests import numpy as np # Generate embeddings response = requests.post( "https://myuser-tei-chute.chutes.ai/v1/embeddings", json={ "model": "sentence-transformers/all-MiniLM-L6-v2", "input": [ "The quick brown fox", "A fast brown animal", "The weather is nice today" ] } ) result = response.json() embeddings = [item["embedding"] for item in result["data"]] # Calculate similarity emb1 = np.array(embeddings[0]) emb2 = np.array(embeddings[1]) emb3 = np.array(embeddings[2]) similarity_1_2 = np.dot(emb1, emb2) # Should be high similarity_1_3 = np.dot(emb1, emb3) # Should be low print(f"Similarity fox vs animal: {similarity_1_2:.3f}") print(f"Similarity fox vs weather: {similarity_1_3:.3f}") ``` ### OpenAI Client ```python from openai import OpenAI # Use OpenAI client with your chute client = OpenAI( api_key="dummy", # Not needed for Chutes base_url="https://myuser-tei-chute.chutes.ai/v1" ) # Generate embeddings response = client.embeddings.create( model="sentence-transformers/all-MiniLM-L6-v2", input=[ "Document for semantic search", "Query for finding similar content" ] ) for i, item in enumerate(response.data): print(f"Embedding {i}: {len(item.embedding)} dimensions") ``` ### Batch Processing Test ```python import asyncio import aiohttp import time async def test_batch_performance(): """Test batch processing performance.""" # Generate test texts texts = [f"This is test document number {i} for embedding generation." for i in range(100)] # Test batch processing start_time = time.time() async with aiohttp.ClientSession() as session: async with session.post( "https://myuser-tei-chute.chutes.ai/embed", json={"inputs": texts} ) as response: result = await response.json() batch_time = time.time() - start_time print(f"Batch processing:") print(f" Texts: {len(texts)}") print(f" Time: {batch_time:.2f}s") print(f" Throughput: {len(texts)/batch_time:.1f} texts/sec") # Test individual requests start_time = time.time() async with aiohttp.ClientSession() as session: tasks = [] for text in texts[:10]: # Test subset for fairness task = session.post( "https://myuser-tei-chute.chutes.ai/embed", json={"inputs": text} ) tasks.append(task) responses = await asyncio.gather(*tasks) individual_time = time.time() - start_time print(f"\nIndividual requests:") print(f" Texts: 10") print(f" Time: {individual_time:.2f}s") print(f" Throughput: {10/individual_time:.1f} texts/sec") print(f" Speedup: {(individual_time*10)/(batch_time):.1f}x") asyncio.run(test_batch_performance()) ``` ## Integration Examples ### Semantic Search with Vector Database ```python import requests import numpy as np from pinecone import Pinecone # Initialize vector database pc = Pinecone(api_key="your-api-key") index = pc.Index("semantic-search") def embed_text(text): """Generate embedding for text.""" response = requests.post( "https://myuser-tei-chute.chutes.ai/v1/embeddings", json={ "model": "sentence-transformers/all-mpnet-base-v2", "input": text } ) return response.json()["data"][0]["embedding"] def index_documents(documents): """Index documents for search.""" vectors = [] for i, doc in enumerate(documents): embedding = embed_text(doc) vectors.append({ "id": str(i), "values": embedding, "metadata": {"text": doc} }) index.upsert(vectors) def search_documents(query, top_k=5): """Search for similar documents.""" query_embedding = embed_text(query) results = index.query( vector=query_embedding, top_k=top_k, include_metadata=True ) return [(match.score, match.metadata["text"]) for match in results.matches] # Example usage documents = [ "Python is a programming language", "Machine learning uses algorithms", "The weather is sunny today", "Neural networks are inspired by the brain" ] index_documents(documents) results = search_documents("What is artificial intelligence?") for score, text in results: print(f"Score: {score:.3f} - {text}") ``` ### Document Clustering ```python import requests import numpy as np from sklearn.cluster import KMeans from sklearn.decomposition import PCA import matplotlib.pyplot as plt def embed_documents(documents): """Generate embeddings for multiple documents.""" response = requests.post( "https://myuser-tei-chute.chutes.ai/v1/embeddings", json={ "model": "sentence-transformers/all-mpnet-base-v2", "input": documents } ) return [item["embedding"] for item in response.json()["data"]] def cluster_documents(documents, n_clusters=3): """Cluster documents based on embeddings.""" # Generate embeddings embeddings = embed_documents(documents) embeddings_array = np.array(embeddings) # Perform clustering kmeans = KMeans(n_clusters=n_clusters, random_state=42) clusters = kmeans.fit_predict(embeddings_array) # Visualize with PCA pca = PCA(n_components=2) embeddings_2d = pca.fit_transform(embeddings_array) plt.figure(figsize=(10, 8)) scatter = plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], c=clusters, cmap='viridis') plt.colorbar(scatter) plt.title('Document Clustering') plt.xlabel('PCA Component 1') plt.ylabel('PCA Component 2') # Add document labels for i, doc in enumerate(documents): plt.annotate(f"Doc {i}", (embeddings_2d[i, 0], embeddings_2d[i, 1])) plt.show() return clusters # Example usage documents = [ "Python programming language tutorial", "JavaScript web development guide", "Machine learning with neural networks", "Deep learning and artificial intelligence", "HTML and CSS for beginners", "React framework for web apps", "Natural language processing techniques", "Computer vision and image recognition" ] clusters = cluster_documents(documents) # Group documents by cluster for cluster_id in range(max(clusters) + 1): print(f"\nCluster {cluster_id}:") for i, doc in enumerate(documents): if clusters[i] == cluster_id: print(f" - {doc}") ``` ## Troubleshooting ### Common Issues **Slow embedding generation?** - Increase `max_batch_tokens` for better throughput - Use a smaller/faster model - Optimize hardware with more GPU memory **Out of memory errors?** - Reduce `max_batch_tokens` - Decrease `max_batch_requests` - Use a smaller model - Increase GPU VRAM requirements **Poor embedding quality?** - Use a larger, more sophisticated model - Ensure proper text preprocessing - Check if the model matches your domain **High latency?** - Reduce batch sizes for faster response - Use a smaller/faster model - Consider multiple smaller instances ### Performance Monitoring ```python import requests import time def monitor_performance(): """Monitor TEI chute performance.""" # Test different batch sizes batch_sizes = [1, 5, 10, 25, 50] test_text = "This is a test document for performance monitoring." for batch_size in batch_sizes: texts = [test_text] * batch_size start_time = time.time() response = requests.post( "https://myuser-tei-chute.chutes.ai/embed", json={"inputs": texts} ) end_time = time.time() if response.status_code == 200: throughput = batch_size / (end_time - start_time) print(f"Batch size {batch_size}: {throughput:.1f} texts/sec") else: print(f"Batch size {batch_size}: Error {response.status_code}") monitor_performance() ``` ## Best Practices ### 1. **Model Selection** ```python # For general text similarity model_name = "sentence-transformers/all-mpnet-base-v2" # For search applications model_name = "sentence-transformers/multi-qa-mpnet-base-dot-v1" # For code similarity model_name = "microsoft/codebert-base" # For multilingual applications model_name = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2" ``` ### 2. **Batch Size Tuning** ```python # For real-time applications (low latency) max_batch_tokens = 4096 max_batch_requests = 32 # For bulk processing (high throughput) max_batch_tokens = 32768 max_batch_requests = 512 # For balanced performance max_batch_tokens = 16384 max_batch_requests = 256 ``` ### 3. **Text Preprocessing** ```python def preprocess_text(text): """Preprocess text for better embeddings.""" # Remove excessive whitespace text = " ".join(text.split()) # Normalize length (very long texts may be truncated) if len(text) > 5000: # Adjust based on model's max length text = text[:5000] return text.strip() # Apply preprocessing before embedding texts = [preprocess_text(text) for text in raw_texts] ``` ### 4. **Error Handling** ```python import requests from tenacity import retry, stop_after_attempt, wait_exponential @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10)) def generate_embeddings(texts): """Generate embeddings with retry logic.""" try: response = requests.post( "https://myuser-tei-chute.chutes.ai/v1/embeddings", json={ "model": "sentence-transformers/all-mpnet-base-v2", "input": texts }, timeout=30 ) response.raise_for_status() return response.json() except requests.exceptions.RequestException as e: print(f"Request failed: {e}") raise ``` ## Next Steps - **[VLLM Template](/docs/templates/vllm)** - High-performance language model serving - **[Diffusion Template](/docs/templates/diffusion)** - Image generation capabilities - **[Vector Databases Guide](/docs/guides/vector-databases)** - Integration with vector stores - **[Semantic Search Example](/docs/examples/semantic-search)** - Complete search application --- ## SOURCE: https://chutes.ai/docs/templates/vllm # VLLM Template The **VLLM template** is the most popular way to deploy large language models on Chutes. It provides a high-performance, OpenAI-compatible API server powered by [vLLM](https://docs.vllm.ai/), optimized for fast inference and high throughput. ## What is VLLM? VLLM is a fast and memory-efficient inference engine for large language models that provides: - 📈 **High throughput** serving with PagedAttention - 🧠 **Memory efficiency** with optimized attention algorithms - 🔄 **Continuous batching** for better GPU utilization - 🌐 **OpenAI-compatible API** for easy integration - ⚡ **Multi-GPU support** for large models ## Quick Start ```python from chutes.chute import NodeSelector from chutes.chute.template.vllm import build_vllm_chute chute = build_vllm_chute( username="myuser", model_name="microsoft/DialoGPT-medium", revision="main", # Required: locks model to specific version node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16 ) ) ``` That's it! This creates a complete VLLM deployment with: - ✅ Automatic model downloading and caching - ✅ OpenAI-compatible `/v1/chat/completions` endpoint - ✅ Built-in streaming support - ✅ Optimized inference settings - ✅ Auto-scaling based on demand ## Function Reference ### `build_vllm_chute()` ```python def build_vllm_chute( username: str, model_name: str, node_selector: NodeSelector, revision: str, image: str | Image = VLLM, tagline: str = "", readme: str = "", concurrency: int = 32, engine_args: Dict[str, Any] = {}) -> VLLMChute ``` #### Required Parameters **`username: str`** Your Chutes username. **`model_name: str`** HuggingFace model identifier (e.g., `"microsoft/DialoGPT-medium"`). **`node_selector: NodeSelector`** Hardware requirements specification. **`revision: str`** **Required.** Git revision/commit hash to lock the model version. Use the current `main` branch commit for reproducible deployments. ```python # Get current revision from HuggingFace revision = "cb765b56fbc11c61ac2a82ec777e3036964b975c" ``` #### Optional Parameters **`image: str | Image = VLLM`** Docker image to use. Defaults to the official Chutes VLLM image. **`tagline: str = ""`** Short description for your chute. **`readme: str = ""`** Markdown documentation for your chute. **`concurrency: int = 32`** Maximum concurrent requests per instance. **`engine_args: Dict[str, Any] = {}`** VLLM engine configuration options. See [Engine Arguments](#engine-arguments). ## Engine Arguments The `engine_args` parameter allows you to configure VLLM's behavior: ### Memory and Performance ```python engine_args = { # Memory utilization (0.0-1.0) "gpu_memory_utilization": 0.95, # Maximum sequence length "max_model_len": 4096, # Maximum number of sequences to process in parallel "max_num_seqs": 256, # Enable chunked prefill for long sequences "enable_chunked_prefill": True, # Maximum number of tokens in a single chunk "max_num_batched_tokens": 8192, } ``` ### Model Loading ```python engine_args = { # Tensor parallelism (automatically set based on GPU count) "tensor_parallel_size": 2, # Pipeline parallelism "pipeline_parallel_size": 1, # Data type for model weights "dtype": "auto", # or "float16", "bfloat16", "float32" # Quantization method "quantization": "awq", # or "gptq", "squeezellm", etc. # Trust remote code (for custom models) "trust_remote_code": True, } ``` ### Advanced Features ```python engine_args = { # Enable prefix caching "enable_prefix_caching": True, # Speculative decoding "speculative_model": "microsoft/DialoGPT-small", "num_speculative_tokens": 5, # Guided generation "guided_decoding_backend": "outlines", # Disable logging for better performance "disable_log_stats": True, "disable_log_requests": True, } ``` ## Hardware Configuration ### GPU Requirements Choose hardware based on your model size: #### Small Models (< 7B parameters) ```python node_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16, include=["l40", "a6000", "a100"] ) ``` #### Medium Models (7B - 13B parameters) ```python node_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24, include=["a100", "h100"] ) ``` #### Large Models (13B - 70B parameters) ```python node_selector = NodeSelector( gpu_count=2, min_vram_gb_per_gpu=40, include=["a100", "h100"] ) ``` #### Huge Models (70B+ parameters) ```python node_selector = NodeSelector( gpu_count=4, min_vram_gb_per_gpu=80, include=["h100"] ) ``` ### GPU Type Selection **High Performance:** ```python include=["h100", "a100"] # Latest, fastest GPUs ``` **Balanced:** ```python include=["a100", "l40", "a6000"] # Good performance/cost ratio ``` **Budget:** ```python exclude=["h100"] # Exclude most expensive GPUs ``` ## API Endpoints The VLLM template provides OpenAI-compatible endpoints: ### Chat Completions **POST `/v1/chat/completions`** ```python import aiohttp async def chat_completion(): url = "https://myuser-mychute.chutes.ai/v1/chat/completions" payload = { "model": "microsoft/DialoGPT-medium", "messages": [ {"role": "user", "content": "Hello! How are you?"} ], "max_tokens": 100, "temperature": 0.7, "stream": False } async with aiohttp.ClientSession() as session: async with session.post(url, json=payload) as response: result = await response.json() print(result["choices"][0]["message"]["content"]) ``` ### Streaming Chat ```python async def streaming_chat(): url = "https://myuser-mychute.chutes.ai/v1/chat/completions" payload = { "model": "microsoft/DialoGPT-medium", "messages": [ {"role": "user", "content": "Tell me a story"} ], "max_tokens": 200, "temperature": 0.8, "stream": True } async with aiohttp.ClientSession() as session: async with session.post(url, json=payload) as response: async for line in response.content: if line.startswith(b"data: "): data = json.loads(line[6:]) if data.get("choices"): delta = data["choices"][0]["delta"] if "content" in delta: print(delta["content"], end="") ``` ### Text Completions **POST `/v1/completions`** ```python payload = { "model": "microsoft/DialoGPT-medium", "prompt": "The future of AI is", "max_tokens": 50, "temperature": 0.7 } ``` ### Tokenization **POST `/tokenize`** ```python payload = { "model": "microsoft/DialoGPT-medium", "text": "Hello, world!" } # Returns: {"tokens": [1, 2, 3, ...]} ``` **POST `/detokenize`** ```python payload = { "model": "microsoft/DialoGPT-medium", "tokens": [1, 2, 3] } # Returns: {"text": "Hello, world!"} ``` ## Complete Examples ### Basic Chat Model ```python from chutes.chute import NodeSelector from chutes.chute.template.vllm import build_vllm_chute chute = build_vllm_chute( username="myuser", model_name="microsoft/DialoGPT-medium", revision="main", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16 ), tagline="Conversational AI chatbot", readme=""" # My Chat Bot A conversational AI powered by DialoGPT. ## Usage Send POST requests to `/v1/chat/completions` with your messages. """, concurrency=16 ) ``` ### High-Performance Large Model ```python chute = build_vllm_chute( username="myuser", model_name="meta-llama/Llama-2-70b-chat-hf", revision="latest-commit-hash", node_selector=NodeSelector( gpu_count=4, min_vram_gb_per_gpu=80, include=["h100", "a100"] ), engine_args={ "gpu_memory_utilization": 0.95, "max_model_len": 4096, "max_num_seqs": 128, "enable_chunked_prefill": True, "trust_remote_code": True, }, concurrency=64 ) ``` ### Code Generation Model ```python chute = build_vllm_chute( username="myuser", model_name="Phind/Phind-CodeLlama-34B-v2", revision="main", node_selector=NodeSelector( gpu_count=2, min_vram_gb_per_gpu=40 ), engine_args={ "max_model_len": 8192, # Longer context for code "temperature": 0.1, # More deterministic for code }, tagline="Advanced code generation AI" ) ``` ### Quantized Model for Efficiency ```python chute = build_vllm_chute( username="myuser", model_name="TheBloke/Llama-2-13B-chat-AWQ", revision="main", node_selector=NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16 # Much less VRAM needed ), engine_args={ "quantization": "awq", "gpu_memory_utilization": 0.9, } ) ``` ## Testing Your Deployment ### Local Testing Before deploying, test your configuration: ```python # Add to your chute file if __name__ == "__main__": import asyncio async def test(): response = await chute.chat({ "model": "your-model-name", "messages": [ {"role": "user", "content": "Hello!"} ] }) print(response) asyncio.run(test()) ``` Run locally: ```bash chutes run my_vllm_chute:chute --dev ``` ### Production Testing After deployment: ```bash curl -X POST https://myuser-mychute.chutes.ai/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "microsoft/DialoGPT-medium", "messages": [{"role": "user", "content": "Test message"}], "max_tokens": 50 }' ``` ## Performance Optimization ### Memory Optimization ```python engine_args = { # Use maximum available memory "gpu_memory_utilization": 0.95, # Enable memory-efficient attention "enable_chunked_prefill": True, # Optimize for your typical sequence length "max_model_len": 2048, # Adjust based on your use case } ``` ### Throughput Optimization ```python engine_args = { # Increase parallel sequences "max_num_seqs": 512, # Larger batch sizes "max_num_batched_tokens": 16384, # Disable logging in production "disable_log_stats": True, "disable_log_requests": True, } ``` ### Latency Optimization ```python engine_args = { # Smaller batch sizes for lower latency "max_num_seqs": 32, # Enable prefix caching "enable_prefix_caching": True, # Use speculative decoding for faster generation "speculative_model": "smaller-model-name", "num_speculative_tokens": 5, } ``` ## Troubleshooting ### Common Issues **Out of Memory Errors** ```python # Reduce memory usage engine_args = { "gpu_memory_utilization": 0.8, # Lower from 0.95 "max_model_len": 2048, # Reduce max length "max_num_seqs": 64, # Fewer parallel sequences } ``` **Slow Model Loading** ```python # The model downloads on first startup # Check logs: chutes chutes get your-chute # Subsequent starts are fast due to caching ``` **Model Not Found** ```python # Ensure model exists and is public # Check: https://huggingface.co/microsoft/DialoGPT-medium # Use exact model name from HuggingFace ``` **Deployment Fails** ```bash # Check image build status chutes images list --name your-image # Verify configuration python -c "from my_chute import chute; print(chute.node_selector)" ``` ### Performance Issues **Low Throughput** - Increase `max_num_seqs` and `max_num_batched_tokens` - Use more GPUs with `tensor_parallel_size` - Enable `enable_chunked_prefill` **High Latency** - Reduce `max_num_seqs` for lower batching - Enable `enable_prefix_caching` - Use faster GPU types (H100 > A100 > L40) **Memory Issues** - Lower `gpu_memory_utilization` - Reduce `max_model_len` - Consider quantized models (AWQ, GPTQ) ## Best Practices ### 1. Model Selection - Use quantized models (AWQ/GPTQ) for better efficiency - Choose the smallest model that meets your quality requirements - Test with different model variants ### 2. Hardware Sizing - Start with minimum requirements and scale up - Monitor GPU utilization in the dashboard - Use `include`/`exclude` filters for cost optimization ### 3. Performance Tuning - Set `revision` to lock model versions - Tune `engine_args` for your specific use case - Enable logging initially, disable in production ### 4. Monitoring - Check the Chutes dashboard for metrics - Monitor request latency and throughput - Set up alerts for failures ## Advanced Features ### Custom Chat Templates ```python engine_args = { "chat_template": """ {%- for message in messages %} {%- if message['role'] == 'user' %} Human: {{ message['content'] }} {%- elif message['role'] == 'assistant' %} Assistant: {{ message['content'] }} {%- endif %} {%- endfor %} Assistant: """ } ``` ### Tool Calling ```python engine_args = { "tool_call_parser": "mistral", "enable_auto_tool_choice": True, } ``` ### Guided Generation ```python engine_args = { "guided_decoding_backend": "outlines", } # Then in your requests: { "guided_json": {"type": "object", "properties": {"name": {"type": "string"}}} } ``` ## Migration from Other Platforms ### From OpenAI Replace the base URL and use your model name: ```python # Before (OpenAI) client = OpenAI(api_key="sk-...") # After (Chutes) client = OpenAI( api_key="dummy", # Not needed for Chutes base_url="https://myuser-mychute.chutes.ai/v1" ) ``` ### From Hugging Face Transformers VLLM is much faster than transformers for serving: ```python # Before (Transformers) from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("model-name") # After (Chutes VLLM) chute = build_vllm_chute( username="myuser", model_name="model-name", # ... configuration ) ``` ## Next Steps - **[SGLang Template](/docs/templates/sglang)** - Alternative high-performance LLM serving - **[Custom Images](/docs/guides/custom-images)** - Build your own VLLM images - **[Streaming Guide](/docs/guides/streaming)** - Advanced streaming patterns - **[Examples](/docs/examples/llm-chat)** - Complete application examples --- ## SOURCE: https://chutes.ai/docs/miner-resources/overview # Mining on Chutes The goal of mining on chutes is to provide as much compute as possible, optimizing for cold start times (running new applications or applications that have been preempted). Everything is automated with kubernetes, and coordinated by the `gepetto.py` script to optimize for cost efficiency and maximize your share of compute. Incentives are based on total compute time (including bounties give from being first to provide inference on code app). You should probably run a wide variety of GPUs, from very cheap (a10, a5000, t4, etc.) to very powerful (8x h100 nodes). Never register more than one UID, since it will just reduce your total compute time and you'll compete with yourself pointlessly. Just add capacity to one miner. Incentives/weights are calculated from 7 day sum of compute, so be patient when you start mining. We want high quality, stable miners in it for the long haul! ## Component Overview ### Provisioning/management tools ### Ansible While not strictly necessary, we _highly_ encourage all miners to use our provided [ansible](https://github.com/ansible/ansible) scripts to provision servers. There are many nuances and requirements that are quite difficult to setup manually. _More information on using the ansible scripts in subsequent sections._ ### Wireguard Wireguard is a fast, secure VPN service that is created by ansible provisioning, which allows your nodes to communicate when they are not all on the same internal network. It is often the case that you'll want CPU instances on one provider (AWS, Google, etc.), and GPU instances on another (Latitude, Massed Compute, etc.), and you may have several providers for each due to inventory. By installing Wireguard, your kubernetes cluster can span any number of providers without issue. _**this is installed and configured automatically by ansible scripts**_ ### Kubernetes (K3s) The entirety of the chutes miner must run within a [kubernetes](https://kubernetes.io/) cluster. We use **K3s**, which is handled automatically by the ansible scripts. If you choose to not use K3s/Ansible, you must also modify or not use the provided ansible scripts. _**this is installed and configured automatically by ansible scripts**_ ### Miner Components _There are many components and moving parts to the system, so before you do anything, please familiarize yourself with each!_ ### Postgres We make heavy use of SQLAlchemy/postgres throughout chutes. All servers, GPUs, deployments, etc., are tracked in postgresql which is deployed as a statefulset with a persistent volume claim within your kubernetes cluster. _**this is installed and configured automatically when deploying via helm charts**_ ### Redis Redis is primarily used for it's pubsub functionality within the miner. Events (new chute added to validator, GPU added to the system, chute removed, etc.) trigger pubsub messages within redis, which trigger the various event handlers in code. _**this is installed and configured automatically when deploying via helm charts**_ ### GraVal bootstrap Chutes uses a custom c/CUDA library for validating graphics cards: https://github.com/chutesai/graval The TL;DR is that it uses matrix multiplications seeded by device info to verify the authenticity of a GPU, including VRAM capacity tests (95% of total VRAM must be available for matrix multiplications). All traffic sent to instances on chutes network are encrypted with keys that can only be decrypted by the GPU advertised. For a detailed explanation of GraVal and other miner verification mechanisms, see the [Security Architecture](/docs/core-concepts/security-architecture) guide. When you add a new node to your kubernetes cluster, each GPU on the server must be verified with the GraVal package, so a bootstrap server is deployed to accomplish this (automatically, no need to fret). Each time a chute starts/gets deployed, it also needs to run GraVal to calculate the decryption key that will be necessary for the GPU(s) the chute is deployed on. _**this is done automatically**_ ### Registry proxy In order to keep the chute docker images somewhat private (since not all images are public), we employ a registry proxy on each miner that injects authentication via bittensor key signature. Each docker image appears to kubelet as `[validator hotkey ss58].localregistry.chutes.ai:30500/[image username]/[image name]:[image tag]` This subdomain points to 127.0.0.1 so it always loads from the registry service proxy on each GPU server via NodePort routing and local first k8s service traffic policy. The registry proxy itself is an nginx server that performs an auth subrequest to the miner API. See the nginx configmap: https://github.com/chutesai/chutes-miner/blob/main/charts/templates/registry-cm.yaml The miner API code that injects the signatures is here: https://github.com/chutesai/chutes-miner/blob/main/api/registry/router.py Nginx then proxies the request upstream back to the validator in question (based on the hotkey as part of the subdomain), which validates the signatures and replaces those headers with basic auth that can be used with our self-hosted registry: https://github.com/chutesai/chutes-api/blob/main/api/registry/router.py _**this is installed and configured automatically when deploying via helm charts**_ ### API Each miner runs an API service, which does a variety of things including: - server/inventory management - websocket connection to the validator API - docker image registry authentication _**this is installed and configured automatically when deploying via helm charts**_ ### Gepetto Gepetto is the key component responsible for all chute (aka app) management. Among other things, it is responsible for actually provisioning chutes, scaling up/down chutes, attempting to claim bounties, etc. This is the main thing to optimize as a miner! ## Getting Started ### 1. Use ansible to provision servers The first thing you'll want to do is provision your servers/kubernetes. ALL servers must be bare metal/VM, meaning it will not work on Runpod, Vast, etc., and we do not currently support shared or dynamic IPs - the IPs must be unique, static, and provide a 1:1 port mapping. ### Important RAM note! It is very important to have as much RAM (or very close to it) per GPU as VRAM. This means, for example, if you are using a server with 4x a40 GPUs (48GB VRAM), the server must have >= 48 \* 4 = 192 GB of RAM! If you do not have at least as much RAM per GPU as VRAM, deployments are likely to fail and your servers will not be properly utilized. ### Important storage note! Some providers mount the primary storage in inconvient ways, e.g. latitude.sh when using raid 1 mounts the volume on `/home`, hyperstack mounts under `/ephemeral`, etc. Before running the ansible scripts, be sure to login to your servers and check how the storage is allocated. If you want storage space for huggingface cache, images, etc., you'll want to be sure as much as possible is allocated under `/var/snap`. You can do this with a simple bind mount, e.g. if the main storage is under `/home`, run: ```bash rsync -azv /var/snap/ /home/snap/ echo '/home/snap /var/snap none bind 0 0' >> /etc/fstab mount -a ``` ### Important networking note! Before starting, you must either disable all layers of firewalls (if you like to live dangerously), or enable the following: - allow all traffic (all ports, all protos inc. UDP) between all nodes in your inventory - allow the kubernetes ephemeral port range on all of your GPU nodes, since the ports for chute deployments will be random, in that range, and need public accessibility - the default port range is 30000-32767 - allow access to the various nodePort values in your API from whatever machine you are managing/running chutes-miner add-node/etc., or just make it public (particularly import is the API node port, which defaults to 32000) The primary CPU node, which the other nodes connect to as the wireguard primary, needs to have IP forwarding enabled -- if your node is in GCP, for example, there's a checkbox you need to enable for IP forwarding. You'll need one non-GPU server (8 cores, 64gb ram minimum) responsible for running postgres, redis, gepetto, and API components (not chutes), and **_ALL_** of the GPU servers 😄 (just kidding of course, you can use as many or as few as you wish) [The list of supported GPUs can be found here](https://github.com/chutesai/chutes-api/blob/main/api/gpu.py) Head over to the [ansible](ansible) documentation for steps on setting up your bare metal instances. Be sure to update `inventory.yml` ### 2. Configure prerequisites If you set `setup_local_kubeconfig: true` in your ansible inventory, the kubeconfig file will be automatically copied to your local machine (usually to `~/.kube/config` or similar, check the playbook output). You can verify access by running: ```bash kubectl get nodes ``` You'll need to setup a few things manually: - Create a docker hub login to avoid getting rate-limited on pulling public images (you may not need this at all, but it can't hurt): - Head over to https://hub.docker.com/ and sign up, generate a new personal access token for public read-only access, then create the secret: ``` kubectl create secret docker-registry regcred --docker-server=docker.io --docker-username=[repalce with your username] --docker-password=[replace with your access token] --docker-email=[replace with your email] ``` - **Miner Credentials**: If you set `hotkey_path` in your ansible `inventory.yml`, the secret `miner-credentials` should have been created automatically. You can verify with: ```bash kubectl get secret miner-credentials -n chutes ``` If not, create it manually: - Find the ss58Address and secretSeed from the hotkey file you'll be using for mining, e.g. `cat ~/.bittensor/wallets/default/hotkeys/hotkey` ``` kubectl create secret generic miner-credentials \ --from-literal=ss58=[replace with ss58Address value] \ --from-literal=seed=[replace with secretSeed value, removing '0x' prefix] \ -n chutes ``` ### 3. Configure your environment Be sure to thoroughly examine [values](https://github.com/chutesai/chutes-miner/blob/main/charts/values.yaml) (or similar in the repo) and update according to your particular environment. Primary sections to update: ### a. validators Unlike most subnets, the validators list for chutes must be explicitly configured rather than relying on the metagraph. Due to the extreme complexity and high expense of operating a validator on this subnet, we're hoping most validators will opt to use the child hotkey functionality rather that operating their own validators. To that end, any validators you wish to support MUST be configured in the top-level validators section: The default mainnet configuration is: ```yaml validators: defaultRegistry: registry.chutes.ai defaultApi: https://api.chutes.ai supported: - hotkey: 5Dt7HZ7Zpw4DppPxFM7Ke3Cm7sDAWhsZXmM5ZAmE7dSVJbcQ registry: registry.chutes.ai api: https://api.chutes.ai socket: wss://ws.chutes.ai ``` ### b. huggingface model cache To enable faster cold-starts, the kubernetes deployments use a hostPath mount for caching huggingface models. The default is set to purge anything over 7 days old, when > 500gb has been consumed: ```yaml cache: max_age_days: 30 max_size_gb: 850 overrides: ``` You can override per-node settings with the overrides block there, e.g.: ```yaml cache: max_age_days: 30 max_size_gb: 850 overrides: node-0: 5000 ``` In this example, the default will be 850GB, and node-0 will have 5TB. If you have lots and lots of storage space, you may want to increase this or otherwise change defaults. ### c. minerApi The defaults should do fairly nicely here, but you may want to tweak the service, namely nodePort, if you want to change ports. ```yaml minerApi: ... service: nodePort: 32000 ... ``` ### d. other Feel free to adjust redis/postgres/etc. as you wish, but probably not necessary. ### 4. Update gepetto with your optimized strategy Gepetto is the most important component as a miner. It is responsible for selecting chutes to deploy, scale up, scale down, delete, etc. You'll want to thoroughly examine this code and make any changes that you think would gain you more total compute time. Once you are satisfied with the state of the `gepetto.py` file, you'll need to create a configmap object in kubernetes that stores your file (from inside the `chutes-miner` directory, from cloning repo): ```bash kubectl create configmap gepetto-code --from-file=gepetto.py -n chutes ``` Any time you wish to make further changes to gepetto, you need to re-create the configmap: ```bash kubectl create configmap gepetto-code --from-file=gepetto.py -o yaml --dry-run=client | kubectl apply -n chutes -f - ``` You must also restart the gepetto deployment after you make changes, but this will only work AFTER you have completed the rest of the setup guide (no need to run when you initially setup your miner): ``` kubectl rollout restart deployment/gepetto -n chutes ``` ### 5. Deploy the miner within your kubernetes cluster First, and **_exactly one time_**, you'll want to generate passwords for postgres and redis - **_never run this more than once or things will break!_** Execute this from the `charts` directory (commands may vary slightly based on repo structure): ```bash helm template . --set createPasswords=true -s templates/one-time-passwords.yaml | kubectl apply -n chutes -f - ``` **Note on Charts:** The repository may split components into multiple charts (e.g., `chutes-miner`, `chutes-miner-gpu`, `chutes-monitoring`). Refer to the repository README for the exact Helm commands to install all components. Generally, you will generate your deployment manifests and apply them: ```bash helm template . -f values.yaml > miner-charts.yaml kubectl apply -f miner-charts.yaml -n chutes ``` Any time you change `values.yaml`, you will want to re-run the template command to get the updated charts! ### 6. Register Register as a miner on subnet 64. ```bash btcli subnet register --netuid 64 --wallet.name [COLDKEY] --wallet.hotkey [HOTKEY] ``` You **_should not_** announce an axon here! All communications are done via client-side initialized socket.io connections so public axons serve no purpose and are just a security risk. ### 7. Add your GPU nodes to inventory The last step in enabling a GPU node in your miner is to use the `add-node` command in the `chutes-miner` CLI. This calls the miner API, triggers spinning up graval validation services, etc. This must be run exactly once for each GPU node in order for them to be usable by your miner. Make sure you install `chutes-miner-cli` package (you can do this on the CPU node, your laptop, wherever): ```bash pip install chutes-miner-cli ``` Run this for each GPU node in your inventory: ```bash chutes-miner add-node \ --name [SERVER NAME FROM inventory.yaml] \ --validator [VALIDATOR HOTKEY] \ --hourly-cost [HOURLY COST] \ --gpu-short-ref [GPU SHORT IDENTIFIER] \ --hotkey [~/.bittensor/wallets/[COLDKEY]/hotkeys/[HOTKEY] \ --miner-api http://[MINER API SERVER IP]:[MINER API PORT] ``` - `--name` here corresponds to the short name in your ansible inventory.yaml file, it is not the entire FQDN. - `--validator` is the hotkey ss58 address of the validator that this server will be allocated to - `--hourly-cost` is how much you are paying hourly per GPU on this server; part of the optimization strategy in gepetto is to minimize cost when selecting servers to deploy chutes on - `--gpu-short-ref` is a short identifier string for the type of GPU on the server, e.g. `a6000`, `l40s`, `h100_sxm`, etc. The list of supported GPUs can be found [here](https://github.com/chutesai/chutes-api/blob/main/api/gpu.py) - `--hotkey` is the path to the hotkey file you registered with, used to sign requests to be able to manage inventory on your system via the miner API - `--miner-api` is the base URL to your miner API service, which will be http://[non-GPU node IP]:[minerAPI port, default 32000], i.e. find the public/external IP address of your CPU-only node, and whatever port you configured for the API service (which is 32000 if you didn't change the default). You can add additional GPU nodes at any time by simply updating inventory.yaml and rerunning the `site.yaml` playbook: [ansible readme](ansible#to-add-a-new-node-after-the-fact) ## Adding servers To expand your miner's inventory, you should bootstrap them with the ansible scripts, specifically the site playbook. Info for the ansible portions [here](ansible#to-add-a-new-node-after-the-fact) Then, run the `chutes-miner add-node ...` command above. --- ## SOURCE: https://chutes.ai/docs/miner-resources/ansible # Node bootstrapping To ensure the highest probability of success, you should provision your servers with `Ubuntu 22.04`, preferably with NO nvidia driver installations if possible. ### Networking note before starting!!! Before doing anything, you should check the IP addresses used by your server provider, and make sure you do not use an overlapping network for wireguard. By default, chutes uses 192.168.0.0/20 for this purpose, but that may conflict with some providers, e.g. Nebius through Shadeform sometimes uses 192.168.x.x network space. If the network overlaps, you will have conflicting entries in your route table and the machine may basically get bricked as a result. It's quite trivial to use a different network for wireguard, or even just a different non-overlapping range in the 192.168.x.x space, but only if you start initially with that network. To migrate after you've already setup the miner with a different wireguard network config is a bit of effort. To use a different range, simply update these four files: 1. `ansible/k3s/inventory.yml` your hosts will need the updated `wireguard_ip` values to match 2. `ansible/k3s/group_vars/all.yml` (or similar, depending on repo structure) usually defines the wireguard network. Check the variable `wireguard_network` or similar if exposed. I would NOT recommend changing the wireguard network if you are already running, unless you absolutely need to. And if you do, the best bet is to actually completely wipe the node and start over. #### external_ip The chutes API/validator sends traffic directly to each GPU node, and does not route through the main CPU node at all. For the system to work, this means each GPU node must have a publicly routeable IP address on each GPU node that is not behind a shared IP (since it uses kubernetes nodePort services). This IP is the public IPv4, and must not be something in the private IP range like 192.168.0.0/16, 10.0.0.0/8, etc. This public IP _must_ be dedicated, and be the same for both egress and ingress. This means, for a node to pass validation, when the validator connects to it, the IP address you advertise as a miner must match the IP address the validator sees when your node fetches a remote token, i.e. you can't use a shared IP with NAT/port-mapping if the underlying nodes route back out to the internet with some other IPs. ## 1. Install ansible (on your local system, not the miner node(s)) ### Mac If you haven't yet, setup homebrew: ```bash /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" ``` Then install ansible: ```bash brew install ansible ``` ### Ubuntu/Ubuntu (WSL)/aptitude based systems ```bash sudo apt -y update && sudo apt -y install ansible python3-pip ``` ### CentOS/RHEL/Fedora Install epel repo if you haven't (and it's not fedora) ```bash sudo dnf install epel-release -y ``` Install ansible: ```bash sudo dnf install ansible -y ``` ## 2. Install ansible collections ```bash ansible-galaxy collection install community.general ansible-galaxy collection install kubernetes.core ``` ## OPTIONAL: Performance Tweaks for Ansible ```bash wget https://files.pythonhosted.org/packages/source/m/mitogen/mitogen-0.3.22.tar.gz tar -xzf mitogen-0.3.22.tar.gz ``` Then in your ansible.cfg ``` [defaults] strategy_plugins = /path/to/mitogen-0.3.22/ansible_mitogen/plugins/strategy strategy = mitogen_linear ... leave the rest, and add this block below [ssh_connection] ssh_args = -o ControlMaster=auto -o ControlPersist=2m ``` ## 3. Update inventory configuration Clone the repository: ```bash git clone https://github.com/chutesai/chutes-miner.git cd chutes-miner/ansible/k3s ``` Using your favorite text editor (vim of course), edit `inventory.yml` to suite your needs. For example: ```yaml all: vars: # List of SSH public keys, e.g. cat ~/.ssh/id_rsa.pub ssh_keys: - "ssh-rsa AAAA... user@hostname" - "ssh-rsa BBBB... user2@hostname2" # The username you want to use to login to those machines (and your public key will be added to). user: billybob # The initial username to login with, for fresh nodes that may not have your username setup. ansible_user: ubuntu # The default validator each GPU worker node will be assigned to. validator: 5Dt7HZ7Zpw4DppPxFM7Ke3Cm7sDAWhsZXmM5ZAmE7dSVJbcQ # By default, no nodes are the primary (CPU node running all the apps, wireguard, etc.) Override this flag exactly once below. is_primary: false # We assume GPU is enabled on all nodes, but of course you need to disable this for the CPU nodes below. gpu_enabled: true # The port you'll be using for the registry proxy, MUST MATCH chart/values.yaml registry.service.nodePort! registry_port: 30500 # SSH sometimes just hangs without this... ansible_ssh_common_args: "-o ControlPath=none" # SSH retries... ansible_ssh_retries: 3 # Ubuntu major/minor version. ubuntu_major: "22" ubuntu_minor: "04" # CUDA version - leave as-is unless using h200s, in which case either use 12-5 or skip_cuda: true (if provider already pre-installed drivers) cuda_version: "12-6" # NVIDA GPU drivers - leave as-is unless using h200s, in which case it would be 555 nvidia_version: "560" # Flag to skip the cuda install entirely, if the provider already has cuda 12.x+ installed (note some chutes will not work unless 12.6+) skip_cuda: false # PATH TO YOUR HOTKEY FILE # This is used to create the miner-credentials secret in k8s automatically hotkey_path: ~/.bittensor/wallets/default/hotkeys/my-hotkey # Setup local kubeconfig? # If true, it will copy the kubeconfig from the primary node to your local machine setup_local_kubeconfig: true hosts: # This would be the main node, which runs postgres, redis, gepetto, etc. chutes-miner-cpu-0: ansible_host: 1.0.0.0 external_ip: 1.0.0.0 wireguard_ip: 192.168.0.1 gpu_enabled: false is_primary: true wireguard_mtu: 1420 # optional (default is 1380) # These are the GPU nodes, which actually run the chutes. chutes-miner-gpu-0: ansible_host: 1.0.0.1 external_ip: 1.0.0.1 wireguard_ip: 192.168.0.3 ``` ## 4. Run the playbook This playbook handles wireguard setup, k3s installation, and joining nodes to the cluster. ```bash ansible-playbook -i inventory.yml site.yml ``` ## 5. Install 3rd party helm charts This step will install nvidia GPU operator and prometheus on your servers. You need to run this one time only (although running it again shouldn't cause any problems). ```bash ansible-playbook -i inventory.yml extras.yml ``` ## To add a new node, after the fact First, update your `inventory.yml` with the new host configuration. Then, run the site playbook with `--limit` to target only the new node (and the primary, as it's needed for coordination/token generation usually, though specific instructions may vary, running on all is safest but slower). ```bash ansible-playbook -i inventory.yml site.yml --limit chutes-h200-0,chutes-miner-cpu-0 ``` (Including the primary node ensures that if any coordination is needed, it is available). Then run extras on the new node: ```bash ansible-playbook -i inventory.yml extras.yml --limit chutes-h200-0 ``` --- ## SOURCE: https://chutes.ai/docs/miner-resources/scoring # Scoring Metrics and Weights The system evaluates miners using four key metrics, each with an assigned weight: 1. **Compute Units (55%)**: Measures the total computational work performed, calculated as the sum of: - Flat sum of bounties (as compute units) - Compute time - Normalized using median performance (tokens-per-second and/or steps-per-second across miners) - Multiplied by compute multiplier (based on number and type of GPUs) - Using appropriate time measurement methods (step-based, token-based, or raw execution time) 2. **Invocation Count (25%)**: The total number of successful invocations (compute jobs) handled 3. **Unique Chute Score (15%)**: Average number of unique chutes that a miner runs simultaneously, weighted by GPU requirements 4. **Bounty Count (5%)**: The number of bounties received (not the value, just the count) ## Scoring Process The scoring algorithm follows these steps: ### 1. Data Collection Queries the database for raw metrics using SQL queries within a specified scoring interval (default: 7 days): - **Compute metrics**: Uses median computation rates (step time and token time) calculated over the last 2 days to normalize compute units - **Unique chute metrics**: Calculates GPU-weighted chute counts using the latest GPU count from chute history, with hourly snapshots over the scoring period ### 2. Normalization Process The system applies different normalization strategies for each metric: **Standard Metrics (compute_units, invocation_count, bounty_count)**: - Normalized by dividing each miner's value by the total sum across all miners **Unique Chute Score**: - Uses a sophisticated two-tier normalization system: - **Above median**: Miners with chute counts `≥` median are normalized using exponent 1.3: `(count / highest_count)^1.3` - **Below median**: Miners with chute counts `<` median are normalized using exponent 2.2: `(count / highest_count)^2.2` - After initial normalization, all unique chute scores are re-normalized to sum to 1.0 ### 3. Multi-UID Punishment Penalizes miners who run multiple nodes with the same coldkey (identity): - Ranks all miners by their preliminary scores (highest first) - For each coldkey, only the highest-scoring hotkey receives rewards - All other hotkeys sharing the same coldkey receive zero score ## GPU-Weighted Chute Calculation The unique chute score uses a sophisticated GPU-weighting system: 1. **Historical GPU Tracking**: Uses the latest GPU count from `chute_history` for each chute 2. **Hourly Snapshots**: Takes hourly snapshots of active chutes over the scoring period 3. **GPU Weighting**: Each chute contributes its GPU count (defaults to 1 if no history exists) to the miner's score 4. **Time Averaging**: Averages GPU-weighted chute counts across all time points in the scoring period ## Anti-Gaming Mechanisms The code includes several safeguards against gaming the system: 1. **Multi-UID Punishment**: Prevents miners from gaining advantage by running multiple nodes with the same coldkey 2. **Median Computation Rates**: Uses median values for step/token times calculated over 2 days to resist manipulation 3. **Error Filtering**: Only counts successful invocations (no errors, completed successfully) 4. **Report Filtering**: Excludes invocations that have been reported for issues 5. **GPU History Validation**: Uses historical GPU counts from chute history to prevent gaming through GPU count manipulation 6. **Successful Instance Filtering**: Only considers instances that have had at least one successful invocation in their lifetime 7. **Two-Tier Chute Normalization**: The unique chute score's dual-exponent system (1.3 vs 2.2) rewards miners who maintain above-median chute diversity while penalizing those below median This scoring system aims to fairly distribute rewards based on actual computational work performed, with mechanisms to prevent gaming and ensure network health. --- ## SOURCE: https://chutes.ai/docs/miner-resources/miner-maintenance # Miner Maintenance & Operations This guide covers "Day 2" operations for Chutes miners: monitoring, troubleshooting, updating, and maintaining your mining infrastructure. ## Routine Maintenance ### 1. Updating Components The Chutes ecosystem evolves rapidly. Keep your miner up to date to ensure compatibility and maximize rewards. **Updating Charts:** Use the provided Ansible playbooks to update your Helm charts. This pulls the latest miner and GPU agent images. ```bash # From your ansible/k3s directory ansible-playbook -i inventory.yml playbooks/deploy-charts.yml ``` **Updating OS & Drivers:** Periodically update your base OS and NVIDIA drivers. **Caution:** Drain the node or set it to unschedulable in Kubernetes before rebooting to avoid slashing/penalties for dropping active chutes. ### 2. Cleaning Disk Space HuggingFace models and Docker images can consume significant disk space. The `chutes-cacheclean` service usually handles this, but you can run manual cleanups if needed. **Prune Docker Images:** ```bash # On a GPU node docker system prune -a -f --filter "until=24h" ``` **Clear HuggingFace Cache:** Model weights are stored in the configured cache directory (default `/var/snap`). You can manually delete old models if space is critical, but this will force re-downloads for new deployments. ## Troubleshooting ### Common Issues **1. Node Not Joining Cluster** - **Check Wireguard**: Ensure `wg0` interface is up and has the correct IP. - `ip addr show wg0` - `systemctl status wg-quick@wg0` - **Check K3s Agent**: - `systemctl status k3s-agent` - Logs: `journalctl -u k3s-agent -f` **2. GPU Not Detected** - **NVIDIA SMI**: Run `nvidia-smi` on the node. If it fails, reinstall drivers. - **K8s Detection**: Check if the node advertises GPU resources: ```bash kubectl describe node | grep nvidia.com/gpu ``` - **GPU Operator**: Ensure the NVIDIA GPU Operator pods are running in the `gpu-operator` namespace. **3. "Gepetto" Not Scheduling Pods** - **Check Logs**: ```bash kubectl logs -l app=gepetto -n chutes -f ``` - **Check Resources**: Ensure you have enough free CPU/RAM/GPU. Gepetto won't schedule if the cluster is full. - **Check Taints**: Ensure nodes aren't tainted unexpectedly. ### Rebooting a Node Safely To reboot a node without impacting your miner score significantly (by failing active requests): 1. **Cordon the node** (stop new scheduling): ```bash kubectl cordon ``` 2. **Wait for jobs to finish** (optional, but polite). 3. **Reboot the node**. 4. **Uncordon the node** once it's back online and `nvidia-smi` works: ```bash kubectl uncordon ``` ## Monitoring ### Grafana Dashboards Your miner installation includes Grafana (default port 30080 on the control node). - **Compute Overview**: View total GPU usage, active chutes, and potential earnings. - **Node Health**: Monitor CPU, RAM, and Disk usage per node. - **Network Traffic**: critical for ensuring you aren't bottlenecked on bandwidth (especially for image/video models). ### Logs **Miner API Logs:** ```bash kubectl logs -l app=miner-api -n chutes -f ``` **Instance Logs (Specific Chute):** Find the pod name for a specific chute instance: ```bash kubectl get pods -n chutes -l chute_id= kubectl logs -n chutes -f ``` ## Security Best Practices - **Rotate Keys**: Periodically rotate your hotkey if you suspect compromise (requires re-registering or updating miner config). - **Firewall**: Ensure only the API port (32000) and Wireguard port (51820) are exposed externally. All internal traffic should route over Wireguard (wg0). - **SSH Access**: Disable password authentication and use SSH keys only. --- ## SOURCE: https://chutes.ai/docs/integrations # Integrations Chutes integrates with popular AI libraries and frameworks to make development easier. ## Available Integrations - **[Vercel AI SDK](/docs/integrations/vercel-ai-sdk)** - A production-ready provider for using open-source AI models hosted on Chutes.ai with the Vercel AI SDK. - **[Sign in with Chutes](/docs/sign-in-with-chutes/overview)** - OAuth 2.0 authentication that lets users sign into your app with their Chutes account. --- ## SOURCE: https://chutes.ai/docs/integrations/vercel-ai-sdk # Vercel AI SDK Integration The **Chutes.ai Provider for Vercel AI SDK** allows you to use open-source AI models hosted on Chutes.ai with the Vercel AI SDK. It supports a wide range of capabilities including chat, streaming, tool calling, and multimodal generation. ## Features - ✅ **Language Models**: Complete support for chat and text completion - ✅ **Streaming**: Real-time Server-Sent Events (SSE) streaming - ✅ **Tool Calling**: Full function/tool calling support - ✅ **Multimodal**: Image, Video, Audio (TTS/STT/Music) generation - ✅ **Chute Warmup**: Pre-warm chutes for instant response times - ✅ **Type-Safe**: Fully typed for excellent IDE support ## Installation Install the provider and the AI SDK: ```bash npm install @chutes-ai/ai-sdk-provider ai ``` **Note**: For Next.js projects with TypeScript, AI SDK v5 is recommended: ```bash npm install @chutes-ai/ai-sdk-provider ai@^5.0.0 ``` ## Configuration ### 1. Get API Key Get your API key from [Chutes.ai](https://chutes.ai) and set it as an environment variable: ```bash export CHUTES_API_KEY=your-api-key-here ``` ### 2. Initialize Provider You can initialize the provider with your API key. ```typescript import { createChutes } from "@chutes-ai/ai-sdk-provider"; const chutes = createChutes({ apiKey: process.env.CHUTES_API_KEY, }); ``` ## Language Models ### Text Generation Generate text using any LLM hosted on Chutes. ```typescript import { generateText } from "ai"; const model = chutes("https://chutes-deepseek-ai-deepseek-v3.chutes.ai"); const result = await generateText({ model, prompt: "Explain quantum computing in simple terms", }); console.log(result.text); ``` ### Streaming Responses Stream responses in real-time for a better user experience. ```typescript import { streamText } from "ai"; const result = await streamText({ model: chutes("https://chutes-meta-llama-llama-3-1-70b-instruct.chutes.ai"), prompt: "Write a story about a space traveler.", }); for await (const chunk of result.textStream) { process.stdout.write(chunk); } ``` ### Tool Calling Connect LLMs to external data and functions. ```typescript import { z } from "zod"; const result = await generateText({ model: chutes("https://chutes-deepseek-ai-deepseek-v3.chutes.ai"), tools: { getWeather: { description: "Get the current weather", parameters: z.object({ location: z.string().describe("City name"), }), execute: async ({ location }) => { return { temp: 72, condition: "Sunny", location }; }, }, }, prompt: "What is the weather in San Francisco?", }); ``` ## Multimodal Capabilities ### Image Generation Generate images using models like FLUX. ```typescript import * as fs from "fs"; const imageModel = chutes.imageModel("flux-dev"); const result = await imageModel.doGenerate({ prompt: "A cyberpunk city with neon lights and flying cars", size: "1024x1024", }); const base64Data = result.images[0].split(",")[1]; fs.writeFileSync("city.png", Buffer.from(base64Data, "base64")); ``` ### Text-to-Speech (TTS) Convert text to speech using over 50 available voices. ```typescript const audioModel = chutes.audioModel("your-tts-chute-id"); const result = await audioModel.textToSpeech({ text: "Welcome to the future of AI.", voice: "af_bella", // American Female - Bella }); fs.writeFileSync("output.mp3", result.audio); ``` ### Speech-to-Text (STT) Transcribe audio files. ```typescript const audioModel = chutes.audioModel("your-stt-chute-id"); const audioBuffer = fs.readFileSync("recording.mp3"); const transcription = await audioModel.speechToText({ audio: audioBuffer, language: "en", }); console.log(transcription.text); ``` ## Advanced Features ### Chute Warmup (Therm) Pre-warm chutes to eliminate cold starts. ```typescript // Warm up a chute const result = await chutes.therm.warmup("your-chute-id"); if (result.isHot) { console.log("Chute is ready!"); } else { console.log("Warming up..."); } ``` ### Embeddings Generate vector embeddings for semantic search. ```typescript import { embedMany } from "ai"; const embeddingModel = chutes.textEmbeddingModel("text-embedding-3-small"); const { embeddings } = await embedMany({ model: embeddingModel, values: ["Hello world", "Machine learning is cool"], }); ``` ## Troubleshooting ### Common Issues - **404 Not Found**: Verify the chute URL is correct and the chute is deployed. - **401 Unauthorized**: Check your `CHUTES_API_KEY`. - **429 Rate Limit**: Implement exponential backoff or request a quota increase. ### Getting Help - Check the [GitHub Repository](https://github.com/chutesai/ai-sdk-provider-chutes) for issues. - Join the [Discord Community](https://discord.gg/chutes). --- ## SOURCE: https://chutes.ai/docs/sign-in-with-chutes/nextjs # Sign in with Chutes: Next.js Guide This guide walks you through implementing "Sign in with Chutes" OAuth in a Next.js application. By the end, your users will be able to authenticate with their Chutes account and your app can make API calls on their behalf. ## Quick Start with the Official SDK The fastest way to add "Sign in with Chutes" to your Next.js app is using the official SDK repository with an AI coding assistant like Cursor: **[github.com/chutesai/Sign-in-with-Chutes](https://github.com/chutesai/Sign-in-with-Chutes)** Simply tell your AI assistant: ``` Add "Sign in with Chutes" to my Next.js app using the SDK at: https://github.com/chutesai/Sign-in-with-Chutes ``` The AI will copy the integration files, set up routes, and configure your app automatically. ### Manual SDK Setup Alternatively, use the setup wizard directly: ```bash # Clone and set up git clone https://github.com/chutesai/Sign-in-with-Chutes.git cd Sign-in-with-Chutes npm install # Run the interactive setup wizard npx tsx scripts/setup-chutes-app.ts # Copy files from packages/nextjs/ to your project ``` The wizard will register your OAuth app and generate credentials. --- The rest of this guide explains the implementation in detail if you want to understand how it works or customize the integration. ## Prerequisites - Next.js 13+ with App Router - A Chutes account with an API key - Node.js 18+ ## Installation Install the required dependencies: ```bash npm install ``` No additional OAuth libraries are required - this implementation uses native Web Crypto APIs and Next.js built-in features. ## OAuth App Registration ### Using the API Register your OAuth application with Chutes: ```bash curl -X POST "https://api.chutes.ai/idp/apps" \ -H "Authorization: Bearer $CHUTES_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "My Next.js App", "description": "My application description", "redirect_uris": ["http://localhost:3000/api/auth/chutes/callback"], "homepage_url": "http://localhost:3000", "allowed_scopes": ["openid", "profile", "chutes:invoke"] }' ``` Save the returned `client_id` and `client_secret` for the next step. **Important**: For production, add your production callback URL to `redirect_uris`: ```json { "redirect_uris": [ "http://localhost:3000/api/auth/chutes/callback", "https://yourapp.com/api/auth/chutes/callback" ] } ``` ## Environment Variables Create a `.env.local` file in your project root: ```bash # Required - OAuth Client Credentials CHUTES_OAUTH_CLIENT_ID=cid_xxx CHUTES_OAUTH_CLIENT_SECRET=csc_xxx # Optional - Override default scopes CHUTES_OAUTH_SCOPES="openid profile chutes:invoke" # Optional - Explicitly set redirect URI (auto-detected if not set) CHUTES_OAUTH_REDIRECT_URI=https://yourapp.com/api/auth/chutes/callback # Optional - App URL for redirect URI construction NEXT_PUBLIC_APP_URL=https://yourapp.com # Optional - Override IDP base URL (rarely needed) CHUTES_IDP_BASE_URL=https://api.chutes.ai ``` ## Project Structure Your authentication implementation will consist of these files: ``` app/ ├── api/ │ └── auth/ │ └── chutes/ │ ├── login/ │ │ └── route.ts # Initiates OAuth flow │ ├── callback/ │ │ └── route.ts # Handles OAuth callback │ ├── logout/ │ │ └── route.ts # Clears session │ └── session/ │ └── route.ts # Returns current session lib/ ├── chutesAuth.ts # Core OAuth utilities └── serverAuth.ts # Server-side auth helpers hooks/ └── useChutesSession.ts # React hook for auth state ``` ## Core Implementation ### OAuth Utilities (`lib/chutesAuth.ts`) This file contains the core OAuth logic: ```typescript import crypto from "crypto"; export interface OAuthConfig { clientId: string; clientSecret: string; redirectUri: string; scopes: string[]; idpBaseUrl: string; } export interface TokenResponse { access_token: string; refresh_token: string; token_type: string; expires_in: number; } export interface ChutesUser { sub: string; username: string; email?: string; name?: string; } // Get OAuth configuration from environment export function getOAuthConfig(requestOrigin?: string): OAuthConfig { const clientId = process.env.CHUTES_OAUTH_CLIENT_ID; const clientSecret = process.env.CHUTES_OAUTH_CLIENT_SECRET; if (!clientId || !clientSecret) { throw new Error("Missing CHUTES_OAUTH_CLIENT_ID or CHUTES_OAUTH_CLIENT_SECRET"); } const baseUrl = requestOrigin || process.env.NEXT_PUBLIC_APP_URL || "http://localhost:3000"; const redirectUri = process.env.CHUTES_OAUTH_REDIRECT_URI || `${baseUrl}/api/auth/chutes/callback`; const scopes = (process.env.CHUTES_OAUTH_SCOPES || "openid profile chutes:invoke") .split(" "); return { clientId, clientSecret, redirectUri, scopes, idpBaseUrl: process.env.CHUTES_IDP_BASE_URL || "https://api.chutes.ai", }; } // Generate PKCE code verifier and challenge export function generatePkce(): { verifier: string; challenge: string } { const verifier = crypto.randomBytes(32).toString("base64url"); const challenge = crypto .createHash("sha256") .update(verifier) .digest("base64url"); return { verifier, challenge }; } // Generate random state for CSRF protection export function generateState(): string { return crypto.randomBytes(16).toString("hex"); } // Build the authorization URL export function buildAuthorizeUrl(params: { state: string; codeChallenge: string; config: OAuthConfig; }): string { const { state, codeChallenge, config } = params; const url = new URL(`${config.idpBaseUrl}/idp/authorize`); url.searchParams.set("client_id", config.clientId); url.searchParams.set("redirect_uri", config.redirectUri); url.searchParams.set("response_type", "code"); url.searchParams.set("scope", config.scopes.join(" ")); url.searchParams.set("state", state); url.searchParams.set("code_challenge", codeChallenge); url.searchParams.set("code_challenge_method", "S256"); return url.toString(); } // Exchange authorization code for tokens export async function exchangeCodeForTokens(params: { code: string; codeVerifier: string; config: OAuthConfig; }): Promise { const { code, codeVerifier, config } = params; const response = await fetch(`${config.idpBaseUrl}/idp/token`, { method: "POST", headers: { "Content-Type": "application/x-www-form-urlencoded", }, body: new URLSearchParams({ grant_type: "authorization_code", client_id: config.clientId, client_secret: config.clientSecret, code, redirect_uri: config.redirectUri, code_verifier: codeVerifier, }), }); if (!response.ok) { const error = await response.text(); throw new Error(`Token exchange failed: ${error}`); } return response.json(); } // Refresh expired tokens export async function refreshTokens(params: { refreshToken: string; config: OAuthConfig; }): Promise { const { refreshToken, config } = params; const response = await fetch(`${config.idpBaseUrl}/idp/token`, { method: "POST", headers: { "Content-Type": "application/x-www-form-urlencoded", }, body: new URLSearchParams({ grant_type: "refresh_token", client_id: config.clientId, client_secret: config.clientSecret, refresh_token: refreshToken, }), }); if (!response.ok) { throw new Error("Token refresh failed"); } return response.json(); } // Fetch user info from Chutes export async function fetchUserInfo( config: OAuthConfig, accessToken: string ): Promise { const response = await fetch(`${config.idpBaseUrl}/idp/userinfo`, { headers: { Authorization: `Bearer ${accessToken}`, }, }); if (!response.ok) { throw new Error("Failed to fetch user info"); } return response.json(); } ``` ### Server-Side Helpers (`lib/serverAuth.ts`) Helper functions for accessing auth state on the server: ```typescript import { cookies } from "next/headers"; import type { ChutesUser } from "./chutesAuth"; const COOKIE_OPTIONS = { httpOnly: true, secure: process.env.NODE_ENV === "production", sameSite: "lax" as const, path: "/", }; // Get access token from cookies export async function getServerAccessToken(): Promise { const cookieStore = await cookies(); return cookieStore.get("chutes_access_token")?.value || null; } // Get refresh token from cookies export async function getServerRefreshToken(): Promise { const cookieStore = await cookies(); return cookieStore.get("chutes_refresh_token")?.value || null; } // Get cached user info from cookies export async function getServerUserInfo(): Promise { const cookieStore = await cookies(); const userCookie = cookieStore.get("chutes_user")?.value; if (!userCookie) return null; try { return JSON.parse(userCookie); } catch { return null; } } // Check if user is authenticated export async function isAuthenticated(): Promise { const token = await getServerAccessToken(); return !!token; } // Set auth cookies (for use in route handlers) export function setAuthCookies( headers: Headers, tokens: { access_token: string; refresh_token: string }, user: ChutesUser ): void { const cookieOptions = `; HttpOnly; ${ process.env.NODE_ENV === "production" ? "Secure; " : "" }SameSite=Lax; Path=/`; headers.append( "Set-Cookie", `chutes_access_token=${tokens.access_token}${cookieOptions}` ); headers.append( "Set-Cookie", `chutes_refresh_token=${tokens.refresh_token}${cookieOptions}` ); headers.append( "Set-Cookie", `chutes_user=${JSON.stringify(user)}${cookieOptions}` ); } // Clear auth cookies (for logout) export function clearAuthCookies(headers: Headers): void { const expiredOptions = "; HttpOnly; Path=/; Max-Age=0"; headers.append("Set-Cookie", `chutes_access_token=${expiredOptions}`); headers.append("Set-Cookie", `chutes_refresh_token=${expiredOptions}`); headers.append("Set-Cookie", `chutes_user=${expiredOptions}`); headers.append("Set-Cookie", `chutes_state=${expiredOptions}`); headers.append("Set-Cookie", `chutes_verifier=${expiredOptions}`); } ``` ### Login Route (`app/api/auth/chutes/login/route.ts`) Initiates the OAuth flow: ```typescript import { NextResponse } from "next/server"; import { getOAuthConfig, generatePkce, generateState, buildAuthorizeUrl, } from "@/lib/chutesAuth"; export async function GET(request: Request) { const origin = new URL(request.url).origin; const config = getOAuthConfig(origin); // Generate PKCE and state const { verifier, challenge } = generatePkce(); const state = generateState(); // Build authorization URL const authorizeUrl = buildAuthorizeUrl({ state, codeChallenge: challenge, config, }); // Create response with redirect const response = NextResponse.redirect(authorizeUrl); // Store state and verifier in cookies for callback validation const cookieOptions = `; HttpOnly; ${ process.env.NODE_ENV === "production" ? "Secure; " : "" }SameSite=Lax; Path=/; Max-Age=600`; response.headers.append("Set-Cookie", `chutes_state=${state}${cookieOptions}`); response.headers.append("Set-Cookie", `chutes_verifier=${verifier}${cookieOptions}`); return response; } ``` ### Callback Route (`app/api/auth/chutes/callback/route.ts`) Handles the OAuth callback: ```typescript import { NextResponse, type NextRequest } from "next/server"; import { cookies } from "next/headers"; import { getOAuthConfig, exchangeCodeForTokens, fetchUserInfo, } from "@/lib/chutesAuth"; import { setAuthCookies } from "@/lib/serverAuth"; export async function GET(request: NextRequest) { const searchParams = request.nextUrl.searchParams; const code = searchParams.get("code"); const state = searchParams.get("state"); const error = searchParams.get("error"); // Handle OAuth errors if (error) { return NextResponse.redirect( new URL(`/?error=${encodeURIComponent(error)}`, request.url) ); } // Validate required parameters if (!code || !state) { return NextResponse.redirect( new URL("/?error=missing_params", request.url) ); } // Get stored state and verifier from cookies const cookieStore = await cookies(); const storedState = cookieStore.get("chutes_state")?.value; const codeVerifier = cookieStore.get("chutes_verifier")?.value; // Validate state to prevent CSRF if (!storedState || state !== storedState) { return NextResponse.redirect( new URL("/?error=invalid_state", request.url) ); } if (!codeVerifier) { return NextResponse.redirect( new URL("/?error=missing_verifier", request.url) ); } try { const origin = new URL(request.url).origin; const config = getOAuthConfig(origin); // Exchange code for tokens const tokens = await exchangeCodeForTokens({ code, codeVerifier, config, }); // Fetch user info const user = await fetchUserInfo(config, tokens.access_token); // Create response with redirect to home const response = NextResponse.redirect(new URL("/", request.url)); // Set auth cookies setAuthCookies(response.headers, tokens, user); // Clear temporary cookies response.headers.append( "Set-Cookie", "chutes_state=; HttpOnly; Path=/; Max-Age=0" ); response.headers.append( "Set-Cookie", "chutes_verifier=; HttpOnly; Path=/; Max-Age=0" ); return response; } catch (error) { console.error("OAuth callback error:", error); return NextResponse.redirect( new URL("/?error=auth_failed", request.url) ); } } ``` ### Logout Route (`app/api/auth/chutes/logout/route.ts`) Clears the user's session: ```typescript import { NextResponse } from "next/server"; import { clearAuthCookies } from "@/lib/serverAuth"; export async function POST(request: Request) { const response = NextResponse.redirect(new URL("/", request.url)); clearAuthCookies(response.headers); return response; } // Also support GET for convenience export async function GET(request: Request) { return POST(request); } ``` ### Session Route (`app/api/auth/chutes/session/route.ts`) Returns the current session state: ```typescript import { NextResponse } from "next/server"; import { getServerAccessToken, getServerUserInfo, } from "@/lib/serverAuth"; export async function GET() { const token = await getServerAccessToken(); const user = await getServerUserInfo(); if (!token || !user) { return NextResponse.json({ isSignedIn: false, user: null }); } return NextResponse.json({ isSignedIn: true, user }); } ``` ### React Hook (`hooks/useChutesSession.ts`) Client-side hook for accessing auth state: ```typescript "use client"; import { useState, useEffect, useCallback } from "react"; interface ChutesUser { sub: string; username: string; email?: string; name?: string; } interface SessionState { isSignedIn: boolean; user: ChutesUser | null; loading: boolean; loginUrl: string; refresh: () => Promise; logout: () => Promise; } export function useChutesSession(): SessionState { const [isSignedIn, setIsSignedIn] = useState(false); const [user, setUser] = useState(null); const [loading, setLoading] = useState(true); const refresh = useCallback(async () => { try { const response = await fetch("/api/auth/chutes/session"); const data = await response.json(); setIsSignedIn(data.isSignedIn); setUser(data.user); } catch (error) { console.error("Failed to fetch session:", error); setIsSignedIn(false); setUser(null); } finally { setLoading(false); } }, []); const logout = useCallback(async () => { try { await fetch("/api/auth/chutes/logout", { method: "POST" }); setIsSignedIn(false); setUser(null); } catch (error) { console.error("Logout failed:", error); } }, []); useEffect(() => { refresh(); }, [refresh]); return { isSignedIn, user, loading, loginUrl: "/api/auth/chutes/login", refresh, logout, }; } ``` ## Usage Examples ### Sign In Button Component ```tsx "use client"; import { useChutesSession } from "@/hooks/useChutesSession"; export function AuthButton() { const { isSignedIn, user, loading, loginUrl, logout } = useChutesSession(); if (loading) { return ; } if (isSignedIn && user) { return (
Welcome, {user.username}!
); } return ( Sign in with Chutes ); } ``` ### Protected Server Component ```tsx import { redirect } from "next/navigation"; import { isAuthenticated, getServerUserInfo } from "@/lib/serverAuth"; export default async function DashboardPage() { const authenticated = await isAuthenticated(); if (!authenticated) { redirect("/api/auth/chutes/login"); } const user = await getServerUserInfo(); return (

Dashboard

Welcome, {user?.username}!

); } ``` ### Custom Post-Login Redirect Modify the callback route to redirect to a specific page: ```typescript // In callback/route.ts const response = NextResponse.redirect(new URL("/dashboard", request.url)); ``` Or redirect to where the user was before: ```typescript // Store the return URL before login const returnTo = cookieStore.get("return_to")?.value || "/"; const response = NextResponse.redirect(new URL(returnTo, request.url)); ``` ## Advanced Usage ### Token Refresh Access tokens expire after approximately 1 hour. Implement token refresh: ```typescript import { getServerAccessToken, getServerRefreshToken, } from "@/lib/serverAuth"; import { refreshTokens, getOAuthConfig } from "@/lib/chutesAuth"; async function getValidToken(): Promise { const token = await getServerAccessToken(); if (token) { return token; } // Try to refresh const refreshToken = await getServerRefreshToken(); if (!refreshToken) { return null; } try { const config = getOAuthConfig(); const newTokens = await refreshTokens({ refreshToken, config }); // Note: You'll need to set new cookies in a route handler return newTokens.access_token; } catch { return null; } } ``` ### Middleware Protection Protect routes with Next.js middleware: ```typescript // middleware.ts import { NextResponse } from "next/server"; import type { NextRequest } from "next/server"; export function middleware(request: NextRequest) { const token = request.cookies.get("chutes_access_token"); // Protect /dashboard routes if (request.nextUrl.pathname.startsWith("/dashboard")) { if (!token) { return NextResponse.redirect( new URL("/api/auth/chutes/login", request.url) ); } } return NextResponse.next(); } export const config = { matcher: ["/dashboard/:path*"], }; ``` ### Using with Vercel AI SDK Make AI calls using the user's token for billing: ```typescript import { createChutes } from "@chutes-ai/ai-sdk-provider"; import { generateText, streamText } from "ai"; import { getServerAccessToken } from "@/lib/serverAuth"; export async function POST(req: Request) { const token = await getServerAccessToken(); if (!token) { return Response.json({ error: "Unauthorized" }, { status: 401 }); } // Use the user's access token instead of your API key const chutes = createChutes({ apiKey: token }); const { message } = await req.json(); const { text } = await generateText({ model: chutes("deepseek-ai/DeepSeek-V3-0324"), prompt: message, }); return Response.json({ text }); } ``` For streaming responses: ```typescript import { createChutes } from "@chutes-ai/ai-sdk-provider"; import { streamText } from "ai"; import { getServerAccessToken } from "@/lib/serverAuth"; export async function POST(req: Request) { const token = await getServerAccessToken(); if (!token) { return Response.json({ error: "Unauthorized" }, { status: 401 }); } const chutes = createChutes({ apiKey: token }); const { message } = await req.json(); const result = await streamText({ model: chutes("meta-llama/Llama-3.1-70B-Instruct"), prompt: message, }); return result.toDataStreamResponse(); } ``` ## Security Best Practices ### 1. Keep Secrets Server-Side Never expose `CHUTES_OAUTH_CLIENT_SECRET` to the client. All token operations happen in API routes. ### 2. Use HttpOnly Cookies All auth cookies are set with `httpOnly: true` to prevent XSS attacks from accessing tokens. ### 3. Validate State Parameter Always validate the `state` parameter in the callback to prevent CSRF attacks. ### 4. Use PKCE PKCE prevents authorization code interception. The implementation handles this automatically. ### 5. HTTPS in Production Cookies are set with `secure: true` in production, requiring HTTPS. ### 6. Limit Scope Requests Only request the scopes you actually need: ```bash # Good - minimal scopes CHUTES_OAUTH_SCOPES="openid profile chutes:invoke" # Avoid requesting unnecessary scopes CHUTES_OAUTH_SCOPES="openid profile chutes:invoke billing:read account:read" ``` ### 7. Handle Token Expiry Implement token refresh or prompt users to re-authenticate when tokens expire. ## Troubleshooting ### "Missing client credentials" Error Ensure environment variables are set correctly: ```bash echo $CHUTES_OAUTH_CLIENT_ID echo $CHUTES_OAUTH_CLIENT_SECRET ``` ### "Invalid state" Error This occurs when the state cookie is missing or doesn't match. Causes: - Cookies blocked by browser - Session expired (cookies expire after 10 minutes) - Multiple login attempts in different tabs ### "Token exchange failed" Error Check that: - `redirect_uri` matches exactly what's registered with your OAuth app - `client_secret` is correct - The authorization code hasn't expired (codes are single-use) ### Cookies Not Being Set Ensure your callback URL matches the domain where cookies are set. In development, use `http://localhost:3000` consistently. ## Next Steps - Review the [Sign in with Chutes Overview](overview) for OAuth concepts - Explore the [Vercel AI SDK Integration](/docs/integrations/vercel-ai-sdk) for AI features - Join our [Discord community](https://discord.gg/wHrXwWkCRz) for support --- ## SOURCE: https://chutes.ai/docs/sign-in-with-chutes/overview # Sign in with Chutes **Sign in with Chutes** is an OAuth 2.0 authentication system that allows users to sign into your application using their Chutes account. This enables your app to make API calls on behalf of users, with billing automatically handled through their Chutes account. ## Why Use Sign in with Chutes? Traditional API key authentication works well for server-side applications, but for user-facing applications, OAuth provides significant advantages: - **User-Scoped Access**: Each user authenticates with their own Chutes account - **Automatic Billing**: API usage is billed to the user's account, not yours - **Granular Permissions**: Request only the scopes your app needs - **Security**: No API keys stored in client-side code - **Trust**: Users see exactly what permissions they're granting ## Official SDK Repository The fastest way to add "Sign in with Chutes" to your application is using the official SDK repository: **[github.com/chutesai/Sign-in-with-Chutes](https://github.com/chutesai/Sign-in-with-Chutes)** This repository is designed for **vibe coding** with AI assistants like Cursor, Windsurf, or GitHub Copilot. Simply point your AI assistant to the repository, and it can: - Copy the integration files into your project - Set up the OAuth flow automatically - Configure environment variables - Add sign-in components to your UI ### Using with AI Coding Assistants When working with an AI coding assistant, you can reference the SDK repository directly: ``` Add "Sign in with Chutes" to my app using the SDK at: https://github.com/chutesai/Sign-in-with-Chutes ``` The repository includes: | Directory | Contents | | -------------------------- | ----------------------------------------------- | | `packages/nextjs/` | Copy-paste integration files for Next.js | | `scripts/` | Setup wizard and OAuth app registration scripts | | `examples/nextjs-minimal/` | Working demo application | | `docs/` | Framework-specific guides and troubleshooting | ### Manual Quick Start If you prefer a manual approach: ```bash # Clone the repository git clone https://github.com/chutesai/Sign-in-with-Chutes.git # Install dependencies and run the setup wizard cd Sign-in-with-Chutes npm install npx tsx scripts/setup-chutes-app.ts ``` The setup wizard will guide you through registering your OAuth app and generating credentials. ## How It Works Sign in with Chutes implements the OAuth 2.0 Authorization Code flow with PKCE (Proof Key for Code Exchange) for enhanced security. ```mermaid sequenceDiagram participant User participant App participant ChutesIDP as Chutes IDP participant ChutesAPI as Chutes API User->>App: Click "Sign in with Chutes" App->>App: Generate PKCE verifier/challenge App->>App: Generate state for CSRF protection App->>ChutesIDP: Redirect to /idp/authorize ChutesIDP->>User: Show login/consent screen User->>ChutesIDP: Authorize app ChutesIDP->>App: Redirect with authorization code App->>ChutesIDP: Exchange code for tokens ChutesIDP->>App: Return access_token, refresh_token App->>ChutesAPI: Make API calls with user's token ChutesAPI->>App: Return user-scoped data ``` ### Flow Overview 1. **User Initiates Login**: User clicks "Sign in with Chutes" in your app 2. **Authorization Request**: Your app redirects to Chutes with a PKCE challenge 3. **User Consent**: User logs in and approves the requested permissions 4. **Authorization Code**: Chutes redirects back with a temporary code 5. **Token Exchange**: Your server exchanges the code for access/refresh tokens 6. **API Access**: Use the access token to make API calls on behalf of the user ## Available Scopes When registering your OAuth app, you specify which permissions (scopes) your app requires: | Scope | Description | Use Case | | -------------------------- | ------------------------------- | --------------------------- | | `openid` | OpenID Connect authentication | Required for all apps | | `profile` | Access to username, email, name | User profile display | | `chutes:invoke` | Make AI API calls | Apps using Chutes AI models | | `chutes:invoke:{chute_id}` | Invoke a specific chute only | Limited access to one chute | | `account:read` | Read account information | Account dashboards | | `billing:read` | Read balance and credits | Display user's balance | **Best Practice**: Only request the scopes your application actually needs. Users are more likely to trust apps that request minimal permissions. ## Quick Start ### 1. Register Your OAuth App Register your application with Chutes to receive client credentials: ```bash curl -X POST "https://api.chutes.ai/idp/apps" \ -H "Authorization: Bearer $CHUTES_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "My Application", "description": "Description of your app", "redirect_uris": ["https://yourapp.com/api/auth/callback"], "homepage_url": "https://yourapp.com", "allowed_scopes": ["openid", "profile", "chutes:invoke"] }' ``` You'll receive: - `client_id` - Your app's public identifier (e.g., `cid_xxx`) - `client_secret` - Your app's secret key (e.g., `csc_xxx`) - keep this secure! ### 2. Build the Authorization URL Redirect users to the authorization endpoint with your app details: ``` https://api.chutes.ai/idp/authorize? client_id=YOUR_CLIENT_ID& redirect_uri=https://yourapp.com/api/auth/callback& response_type=code& scope=openid+profile+chutes:invoke& state=RANDOM_STATE_VALUE& code_challenge=PKCE_CHALLENGE& code_challenge_method=S256 ``` ### 3. Handle the Callback After the user authorizes your app, they're redirected to your callback URL with an authorization code: ``` https://yourapp.com/api/auth/callback?code=AUTH_CODE&state=RANDOM_STATE_VALUE ``` ### 4. Exchange Code for Tokens Exchange the authorization code for access and refresh tokens: ```bash curl -X POST "https://api.chutes.ai/idp/token" \ -H "Content-Type: application/x-www-form-urlencoded" \ -d "grant_type=authorization_code" \ -d "client_id=YOUR_CLIENT_ID" \ -d "client_secret=YOUR_CLIENT_SECRET" \ -d "code=AUTH_CODE" \ -d "redirect_uri=https://yourapp.com/api/auth/callback" \ -d "code_verifier=PKCE_VERIFIER" ``` ### 5. Make Authenticated Requests Use the access token to make API calls: ```bash curl -H "Authorization: Bearer ACCESS_TOKEN" \ https://api.chutes.ai/users/me ``` ## API Endpoints | Endpoint | Method | Description | | ----------------------- | ------ | -------------------------------- | | `/idp/authorize` | GET | Start OAuth flow (user redirect) | | `/idp/token` | POST | Exchange code for tokens | | `/idp/userinfo` | GET | Get authenticated user's profile | | `/idp/token/introspect` | POST | Validate a token | | `/idp/apps` | POST | Register a new OAuth app | | `/users/me` | GET | Get detailed user information | ### OpenID Configuration For OpenID Connect discovery: ``` https://idp.chutes.ai/.well-known/openid-configuration ``` ## Security Considerations ### PKCE (Proof Key for Code Exchange) PKCE prevents authorization code interception attacks. Always generate a unique code verifier and challenge for each authorization request: 1. Generate a random `code_verifier` (43-128 characters) 2. Create the `code_challenge` as `BASE64URL(SHA256(code_verifier))` 3. Send the challenge with the authorization request 4. Send the verifier with the token exchange request ### State Parameter The `state` parameter prevents CSRF attacks: 1. Generate a random state value before redirecting 2. Store it in the user's session 3. Verify it matches when handling the callback ### Token Storage - **Access tokens** expire after approximately 1 hour - **Refresh tokens** can be used to obtain new access tokens - Store tokens in HttpOnly cookies to prevent XSS attacks - Never expose tokens to client-side JavaScript ### Client Secret Protection - Never expose your `client_secret` in client-side code - All token operations should happen on your server - Use environment variables for credential storage ## Token Refresh When an access token expires, use the refresh token to obtain a new one: ```bash curl -X POST "https://api.chutes.ai/idp/token" \ -H "Content-Type: application/x-www-form-urlencoded" \ -d "grant_type=refresh_token" \ -d "client_id=YOUR_CLIENT_ID" \ -d "client_secret=YOUR_CLIENT_SECRET" \ -d "refresh_token=REFRESH_TOKEN" ``` ## Framework Guides For step-by-step implementation guides, see: - **[Next.js Guide](nextjs)** - Complete implementation for Next.js applications ## Next Steps - Review the [Vercel AI SDK Integration](/docs/integrations/vercel-ai-sdk) for using authenticated tokens with AI features - Check out the [API Reference](/docs/api-reference/authentication) for detailed endpoint documentation - Join our [Discord community](https://discord.gg/wHrXwWkCRz) for support --- ## SOURCE: https://chutes.ai/docs/help/faq # Frequently Asked Questions (FAQ) Common questions and answers about Chutes SDK and platform. ## General Questions ### What is Chutes? Chutes is a serverless AI compute platform that lets you deploy and scale AI models on GPU infrastructure without managing servers. You write Python code using our SDK, and we handle the infrastructure, scaling, and deployment. **Key benefits:** - Deploy AI models in minutes, not hours - Pay only for actual compute time used - Automatic scaling from 0 to hundreds of instances - Access to latest GPU hardware (H200, MI300X, B200, etc.) - No DevOps or Kubernetes knowledge required ### How is Chutes different from other platforms? | Feature | Chutes | Traditional Cloud | Other AI Platforms | | -------------- | --------------- | ----------------- | ------------------ | | **Setup Time** | Minutes | Hours/Days | Hours | | **Scaling** | Automatic (0→∞) | Manual | Limited | | **Pricing** | Pay-per-use | Always-on | Subscription | | **GPU Access** | Latest hardware | Limited selection | Restricted | | **Code Style** | Simple Python | Complex configs | Platform-specific | ### Who should use Chutes? **Perfect for:** - AI/ML engineers building production applications - Startups needing scalable AI infrastructure - Researchers requiring powerful GPU compute - Companies wanting serverless AI deployment **Use cases:** - LLM chat applications - Image/video generation services - Real-time AI APIs - Batch processing workflows - Model inference at scale ### Is Chutes suitable for production? Yes! Chutes is designed for production workloads with: - 99.9% uptime SLA - Enterprise security and compliance - Confidential Compute with Trusted Execution Environments (TEE) - Global edge deployment - Automatic failover and recovery - 24/7 monitoring and support ## Getting Started ### How do I get started with Chutes? 1. **Install the SDK** ```bash pip install chutes ``` 2. **Create account and authenticate** ```bash chutes auth login ``` 3. **Deploy your first chute** ```python from chutes.chute import Chute chute = Chute(username="myuser", name="hello-world") @chute.cord(public_api_path="/hello") async def hello(): return {"message": "Hello, World!"} ``` ```bash chutes deploy ``` ### Do I need Docker experience? No! Chutes handles containerization automatically. However, if you need custom dependencies, you can optionally use our `Image` class: ```python from chutes.image import Image # Simple dependency installation image = ( Image(username="myuser", name="my-app", tag="1.0") .from_base("nvidia/cuda:12.4.1-runtime-ubuntu22.04") .run_command("pip install transformers torch") ) chute = Chute( username="myuser", name="my-app", image=image ) ``` ### What programming languages are supported? Currently, Chutes supports **Python only**. We're considering other languages based on user demand. **Python versions supported:** - Python 3.8+ - Recommended: Python 3.10 or 3.11 ### Can I use my existing Python code? Yes! Chutes is designed to work with existing Python codebases. You typically just need to: 1. Wrap your functions with `@chute.cord` decorators 2. Add any dependencies to an `Image` if needed 3. Deploy with `chutes deploy` ## Deployment & Usage ### How long does deployment take? - **First deployment**: 5-15 minutes (includes image building) - **Code-only updates**: 1-3 minutes - **No-code config updates**: 30 seconds ### Can I deploy multiple versions? Yes! Each deployment creates a new version: ```bash # Deploy new version chutes deploy # List versions chutes chutes versions # Rollback to previous version chutes chutes rollback --version v1.2.3 ``` ### How does scaling work? Chutes automatically scales based on traffic: - **Scale to zero**: No requests = no costs - **Auto-scaling**: Handles traffic spikes automatically - **Global load balancing**: Requests routed to optimal regions - **Cold start optimization**: Fast instance startup ```python # Configure scaling behavior chute = Chute( username="myuser", name="my-app", min_replicas=0, # Scale to zero max_replicas=100 # Scale up to 100 instances ) ``` ### Can I deploy the same model multiple times? Yes! You can have multiple deployments: ```python # Production deployment prod_chute = Chute( username="myuser", name="llm-prod", node_selector=NodeSelector() ) # Development deployment dev_chute = Chute( username="myuser", name="llm-dev", node_selector=NodeSelector() ) ``` ### How do I handle different environments? Use environment variables and different chute names: ```python import os environment = os.getenv("ENVIRONMENT", "dev") chute_name = f"my-app-{environment}" chute = Chute(username="myuser", name=chute_name) ``` ## Performance & Optimization ### How can I optimize performance? **Model optimization:** ```python # Use optimized engines from chutes.chute.template.vllm import build_vllm_chute chute = build_vllm_chute( username="myuser", name="fast-llm", model_name="microsoft/DialoGPT-medium", engine_args={ "gpu_memory_utilization": 0.9, "enable_chunked_prefill": True, "use_v2_block_manager": True } ) ``` **Hardware selection:** ```python # Choose appropriate hardware from chutes.chute import NodeSelector node_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24, include=["h100", "a100"] # High-performance GPUs ) ``` **Caching strategies:** ```python from functools import lru_cache @lru_cache(maxsize=1000) def expensive_computation(input_hash): return compute_result(input_hash) ``` ### What's the latency for API calls? Typical latencies: - **Warm instances**: 50-200ms - **Cold start**: 5-30 seconds (depending on model size) - **Global edge**: <100ms additional routing overhead ### How do I minimize cold starts? ```python # Keep minimum replicas warm chute = Chute( username="myuser", name="low-latency-app", min_replicas=1 # Always keep 1 instance warm ) ``` ```python # Optimize startup time @chute.on_startup() async def setup(self): # Load models efficiently self.model = load_model_optimized() ``` ### Can I use multiple GPUs? Yes! Specify multiple GPUs in your node selector: ```python # Multi-GPU setup node_selector = NodeSelector( gpu_count=4, # Use 4 GPUs min_vram_gb_per_gpu=40 ) # Distribute model across GPUs @chute.on_startup() async def setup(self): self.model = load_model_distributed(device_map="auto") ``` ## Pricing & Billing ### How does pricing work? Chutes uses **pay-per-use** pricing: - **Compute**: Per GPU-second of actual usage - **Memory**: Per GB-second of RAM usage - **Network**: Per GB of data transfer - **Storage**: Per GB of persistent storage **No charges for:** - Idle time (scaled to zero) - Failed requests ### How can I control costs? **Use spot instances:** ```python node_selector = NodeSelector() ``` **Scale to zero:** ```python chute = Chute( username="myuser", name="cost-optimized", min_replicas=0 # No idle costs ) ``` **Choose appropriate hardware:** ```python # Cost-effective GPUs for development node_selector = NodeSelector( include=["l40", "a6000"], # Less expensive than H100 exclude=["h100", "h200"] ) ``` **Monitor usage:** ```bash # Check current usage chutes account usage # Set billing alerts chutes account alerts --threshold 100 ``` ### Do you offer volume discounts? Yes! We offer: - **Startup credits**: Up to $10,000 for qualifying startups - **Enterprise pricing**: Custom rates for large usage - **Volume discounts**: Automatic discounts at usage tiers Contact `support@chutes.ai` to discuss sales options. ## Features & Capabilities ### What AI frameworks are supported? **Officially supported:** - **PyTorch**: Full support with CUDA optimization - **Transformers**: Hugging Face models and pipelines - **VLLM**: High-performance LLM inference - **SGLang**: Structured generation for LLMs - **Diffusers**: Image/video generation models **Community supported:** - TensorFlow/Keras - JAX/Flax - ONNX Runtime - OpenCV - scikit-learn ### Can I use custom models? Absolutely! Upload your models several ways: ```python # From Hugging Face Hub model_name = "your-username/custom-model" # From local files image = Image().copy("./my-model/", "/opt/model/") # From cloud storage image = Image().run([ "wget https://storage.example.com/model.bin -O /opt/model.bin" ]) ``` ### Do you support streaming responses? Yes! Perfect for LLM chat applications: ```python from typing import AsyncGenerator @chute.cord(public_api_path="/stream") async def stream_generate(self, prompt: str) -> AsyncGenerator[str, None]: async for token in self.model.stream_generate(prompt): yield f"data: {token}\n\n" ``` ### Can I run background jobs? Yes! Use the `@chute.job` decorator: ```python @chute.job() async def process_batch(self, batch_data: List[str]): results = [] for item in batch_data: result = await self.process_item(item) results.append(result) return results # Trigger job @chute.cord(public_api_path="/submit_batch") async def submit_batch(self, data: List[str]): job_id = await self.process_batch(data) return {"job_id": job_id} ``` ### Is there a Python client library? Yes! Use the generated client or standard HTTP: ```python # Generated client (coming soon) from chutes.client import ChuteClient client = ChuteClient("https://your-chute.chutes.ai") result = await client.predict(text="Hello world") # Standard HTTP requests import httpx async with httpx.AsyncClient() as client: response = await client.post( "https://your-chute.chutes.ai/predict", json={"text": "Hello world"} ) result = response.json() ``` ## Technical Details ### What regions are available? **Current regions:** - **US**: us-west-2 (Oregon), us-east-1 (Virginia) - **Europe**: eu-west-1 (Ireland), eu-central-1 (Frankfurt) - **Asia**: ap-southeast-1 (Singapore), ap-northeast-1 (Tokyo) **Coming soon:** - us-central-1, eu-west-2, ap-south-1 ### What GPU types are available? | GPU | VRAM | Best For | Pricing Tier | | --------- | --------- | ------------------------ | ------------ | | **T4** | 16GB | Small models, dev | $ | | **V100** | 16GB/32GB | Training, medium models | $$ | | **A6000** | 48GB | Production inference | $$$ | | **L40** | 48GB | Cost-effective inference | $$$ | | **A100** | 40GB/80GB | Large models, training | $$$$ | | **H100** | 80GB | Latest generation | $$$$$ | | **H200** | 141GB | Massive models | $$$$$ | ### How does networking work? - **Public endpoints**: HTTPS with automatic SSL certificates - **Private endpoints**: VPC peering for enterprise customers - **Load balancing**: Automatic traffic distribution - **CDN**: Global content delivery for static assets ### What about data persistence? **Temporary storage** (included): - Container filesystem - Cleared on restart/redeploy **Persistent storage** (optional): ```python chute = Chute( username="myuser", name="persistent-app", storage_gb=100 # 100GB persistent disk ) # Access at /opt/storage/ @chute.cord(public_api_path="/save") async def save_data(self, data: str): with open("/opt/storage/data.txt", "w") as f: f.write(data) ``` ### Can I access the underlying infrastructure? Chutes is serverless, so direct infrastructure access isn't available. However, you get: - **System info**: CPU, memory, GPU details via APIs - **Metrics**: Performance monitoring and alerts - **Logs**: Comprehensive application and system logs - **Debug endpoints**: Custom debugging interfaces ## Troubleshooting ### My deployment is failing. What should I check? 1. **Validate configuration:** ```bash chutes chutes validate --file chute.py ``` 2. **Check build logs:** ```bash chutes chutes logs --build-logs ``` 3. **Verify resource availability:** ```bash chutes nodes list --available ``` 4. **Common fixes:** - Reduce GPU requirements - Enable spot instances - Use more flexible node selector - Check dependency versions ### I'm getting out of memory errors. How do I fix this? **Immediate fixes:** ```python # Request more VRAM node_selector = NodeSelector(min_vram_gb_per_gpu=48) # Or reduce batch size engine_args = {"max_num_batched_tokens": 1024} # Enable memory optimization engine_args = {"gpu_memory_utilization": 0.85} ``` See the [Troubleshooting Guide](troubleshooting) for more details. ### How do I debug performance issues? ```python # Add performance monitoring import time @chute.cord(public_api_path="/predict") async def predict(self, input_data): start_time = time.time() result = await self.model.predict(input_data) duration = time.time() - start_time self.logger.info(f"Prediction took {duration:.2f}s") return result # Check resource usage @chute.cord(public_api_path="/stats") async def get_stats(self): return { "gpu_memory": torch.cuda.memory_allocated(), "cpu_percent": psutil.cpu_percent() } ``` ## Integrations ### Can I integrate with my existing CI/CD? Yes! Chutes works with any CI/CD system: **GitHub Actions:** ```yaml name: Deploy to Chutes on: push: branches: [main] jobs: deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Setup Python uses: actions/setup-python@v4 with: python-version: "3.10" - name: Install Chutes run: pip install chutes - name: Deploy run: chutes deploy --name my-app-prod env: CHUTES_API_KEY: ${{ secrets.CHUTES_API_KEY }} ``` ### Does it work with monitoring tools? Yes! Export metrics to your preferred tools: ```python # Prometheus metrics @chute.cord(public_api_path="/metrics") async def metrics(self): return generate_prometheus_metrics() # Custom webhooks @chute.cord(public_api_path="/predict") async def predict(self, input_data): result = await self.model.predict(input_data) # Send to monitoring await send_to_datadog(metric="prediction_count", value=1) return result ``` ### Can I use it with databases? Absolutely! Connect to any database: ```python # PostgreSQL example import asyncpg @chute.on_startup() async def setup(self): self.db = await asyncpg.connect( host=os.getenv("DB_HOST"), user=os.getenv("DB_USER"), password=os.getenv("DB_PASSWORD"), database=os.getenv("DB_NAME") ) @chute.cord(public_api_path="/query") async def query_data(self, query: str): rows = await self.db.fetch("SELECT * FROM table WHERE condition = $1", query) return [dict(row) for row in rows] ``` ## Security & Privacy ### How secure is my data? **Infrastructure security:** - SOC 2 Type II compliance - End-to-end encryption (TLS 1.3) - Network isolation between deployments - Regular security audits and penetration testing **Data handling:** - No persistent storage of request/response data - Optional data encryption at rest - GDPR and CCPA compliant - Customer data never used for training ### Can I use private models? Yes! Several options for private models: ```python # Private Hugging Face models (requires token) os.environ["HUGGINGFACE_HUB_TOKEN"] = "your_token" # Upload during build image = Image().copy("./private-model/", "/opt/model/") # Download from private S3 image = Image().run([ "aws s3 cp s3://private-bucket/model.bin /opt/model.bin" ]).env("AWS_ACCESS_KEY_ID", "your_key") ``` ## Still have questions? - **Community**: Join our [Discord](https://discord.gg/wHrXwWkCRz) for community support - **Documentation**: Check our [comprehensive docs](/docs) - **Support**: Email `support@chutes.ai` for technical assistance - **Sales**: Contact `support@chutes.ai` We're constantly updating this FAQ based on user feedback. If you have a question not covered here, please let us know! --- ## SOURCE: https://chutes.ai/docs/help/troubleshooting # Troubleshooting Guide This guide helps you diagnose and resolve common issues when developing and deploying with Chutes. ## Deployment Issues ### Build Failures #### Python Package Installation Errors **Problem**: Packages fail to install during image build ```bash ERROR: Could not find a version that satisfies the requirement torch==2.1.0 ``` **Solutions**: ```python from chutes.image import Image # Use compatible base images image = Image( username="myuser", name="my-image", tag="1.0" ).from_base("nvidia/cuda:12.4.1-runtime-ubuntu22.04") # Specify compatible package versions image.run_command("pip install torch>=2.4.0 torchvision --index-url https://download.pytorch.org/whl/cu124") # Alternative: Use conda for complex dependencies image.run_command("conda install pytorch torchvision pytorch-cuda=12.4 -c pytorch -c nvidia") ``` #### Docker Build Context Issues **Problem**: Large files causing slow uploads ```bash Uploading build context... 2.3GB ``` **Solutions**: ```python # Create .dockerignore to exclude unnecessary files # .dockerignore content: """ __pycache__/ *.pyc .git/ .pytest_cache/ large_datasets/ *.mp4 *.avi """ # Or use specific file inclusion image.add("app.py", "/app/app.py") image.add("requirements.txt", "/app/requirements.txt") ``` #### Permission Errors **Problem**: Permission denied during build ```bash Permission denied: '/usr/local/bin/pip' ``` **Solutions**: ```python # Run commands as root when needed image.set_user("root") image.run_command("apt-get update && apt-get install -y curl") # Set proper ownership image.run_command("chown -R chutes:chutes /app") # Use USER directive correctly image.set_user("chutes") ``` ### Deployment Timeouts **Problem**: Deployment hangs or times out **Solutions**: ```python # Optimize startup time @chute.on_startup() async def setup(self): # Move heavy operations to background asyncio.create_task(self.load_model_async()) async def load_model_async(self): """Load model in background to avoid startup timeout.""" self.model = load_large_model() self.ready = True @chute.cord(public_api_path="/health") async def health_check(self): """Health check endpoint.""" return {"status": "ready" if hasattr(self, 'ready') else "loading"} ``` ## Runtime Errors ### Out of Memory Errors #### GPU Out of Memory **Problem**: CUDA out of memory errors ```python RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB ``` **Solutions**: ```python import torch import gc # Clear GPU cache torch.cuda.empty_cache() gc.collect() # Use gradient checkpointing model.gradient_checkpointing_enable() # Reduce batch size @chute.cord(public_api_path="/generate") async def generate(self, request: GenerateRequest): # Process in smaller batches batch_size = min(request.batch_size, 4) # Use mixed precision with torch.cuda.amp.autocast(): outputs = model.generate(**inputs) return outputs # Optimize node selector node_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24, # Increase VRAM requirement include=["a100", "h100"] ) ``` #### System RAM Issues **Problem**: System runs out of RAM ```python MemoryError: Unable to allocate array ``` **Solutions**: ```python # Increase RAM in node selector node_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=24 ) # Use memory-efficient data loading import torch.utils.data as data class MemoryEfficientDataset(data.Dataset): def __init__(self, file_paths): self.file_paths = file_paths def __getitem__(self, idx): # Load data on-demand instead of pre-loading return load_data(self.file_paths[idx]) ``` ### Model Loading Errors #### Missing Model Files **Problem**: Model files not found ```python FileNotFoundError: Model file not found: /models/pytorch_model.bin ``` **Solutions**: ```python from huggingface_hub import snapshot_download import os @chute.on_startup() async def setup(self): """Download model if not present.""" model_path = "/models/my-model" if not os.path.exists(model_path): # Download model during startup snapshot_download( repo_id="microsoft/DialoGPT-medium", local_dir=model_path, token=os.getenv("HF_TOKEN") # If private model ) self.model = load_model(model_path) ``` #### Model Compatibility Issues **Problem**: Model format incompatible with library version ```python ValueError: Unsupported model format ``` **Solutions**: ```python # Pin compatible versions image.run_command("pip install transformers==4.36.0 torch==2.1.0 safetensors==0.4.0") # Use format conversion from transformers import AutoModel import torch # Convert to compatible format model = AutoModel.from_pretrained("model-name") torch.save(model.state_dict(), "/models/converted_model.pt") ``` ## Performance Problems ### Slow Inference **Problem**: Inference takes too long **Diagnosis**: ```python import time import torch @chute.cord(public_api_path="/generate") async def generate(self, request: GenerateRequest): start_time = time.time() # Profile different stages load_time = time.time() inputs = prepare_inputs(request.text) prep_time = time.time() - load_time # Inference timing inference_start = time.time() with torch.no_grad(): outputs = self.model.generate(**inputs) inference_time = time.time() - inference_start # Post-processing timing post_start = time.time() result = postprocess_outputs(outputs) post_time = time.time() - post_start total_time = time.time() - start_time self.logger.info(f"Timing - Prep: {prep_time:.2f}s, Inference: {inference_time:.2f}s, Post: {post_time:.2f}s, Total: {total_time:.2f}s") return result ``` **Solutions**: ```python # Enable optimizations model.eval() model = torch.compile(model) # PyTorch 2.0+ optimization # Use efficient data types model = model.half() # Use FP16 # Batch processing @chute.cord(public_api_path="/batch_generate") async def batch_generate(self, requests: List[GenerateRequest]): # Process multiple requests together batch_inputs = [prepare_inputs(req.text) for req in requests] batch_outputs = self.model.generate_batch(batch_inputs) return [postprocess_outputs(output) for output in batch_outputs] ``` ### High Latency **Problem**: First request is very slow (cold start) **Solutions**: ```python @chute.on_startup() async def setup(self): """Warm up model to reduce cold start.""" self.model = load_model() # Warm-up inference dummy_input = "Hello world" _ = self.model.generate(dummy_input) self.logger.info("Model warmed up successfully") # Use model caching @chute.cord(public_api_path="/generate") async def generate(self, request: GenerateRequest): # Cache compiled model if not hasattr(self, '_compiled_model'): self._compiled_model = torch.compile(self.model) return self._compiled_model.generate(request.text) ``` ## Authentication Issues ### API Key Problems **Problem**: Authentication failures ```bash HTTPException: 401 Unauthorized ``` **Solutions**: ```bash # Check API key configuration chutes account info # Set API key correctly chutes auth login # or export CHUTES_API_KEY="your-api-key" # Verify key is working chutes chutes list ``` ### Permission Errors **Problem**: Insufficient permissions for operations ```bash HTTPException: 403 Forbidden ``` **Solutions**: ```bash # Check account permissions chutes account info # Contact support if you need additional permissions # Ensure you're using the correct username in deployments ``` ## Debugging Techniques ### Logging and Monitoring ```python import logging from chutes.chute import Chute # Configure detailed logging logging.basicConfig(level=logging.DEBUG) chute = Chute( username="myuser", name="debug-app" ) @chute.on_startup() async def setup(self): self.logger.info("Application starting up") # Log system information import torch if torch.cuda.is_available(): for i in range(torch.cuda.device_count()): props = torch.cuda.get_device_properties(i) self.logger.info(f"GPU {i}: {props.name} ({props.total_memory // (1024**3)}GB)") @chute.cord(public_api_path="/debug") async def debug_info(self): """Debug endpoint for system information.""" import psutil import torch info = { "cpu_percent": psutil.cpu_percent(), "memory_percent": psutil.virtual_memory().percent, "gpu_memory": {} } if torch.cuda.is_available(): for i in range(torch.cuda.device_count()): allocated = torch.cuda.memory_allocated(i) total = torch.cuda.get_device_properties(i).total_memory info["gpu_memory"][f"gpu_{i}"] = { "allocated_gb": allocated / (1024**3), "total_gb": total / (1024**3), "utilization": (allocated / total) * 100 } return info ``` ### Remote Debugging ```python # Enable remote debugging for development import os if os.getenv("DEBUG_MODE"): import debugpy debugpy.listen(("0.0.0.0", 5678)) print("Waiting for debugger to attach...") debugpy.wait_for_client() ``` ### Error Tracking ```python import traceback from fastapi import HTTPException @chute.cord(public_api_path="/generate") async def generate(self, request: GenerateRequest): try: result = self.model.generate(request.text) return result except torch.cuda.OutOfMemoryError: self.logger.error("GPU out of memory", exc_info=True) raise HTTPException( status_code=503, detail="Service temporarily unavailable due to memory constraints" ) except Exception as e: self.logger.error(f"Unexpected error: {str(e)}", exc_info=True) raise HTTPException( status_code=500, detail="Internal server error" ) ``` ## Resource Issues ### Node Selection Problems **Problem**: No available nodes matching requirements **Solutions**: ```python # Make node selector more flexible node_selector = NodeSelector( gpu_count=1, min_vram_gb_per_gpu=16, # Reduce if too restrictive # Don't restrict VRAM to allow larger GPUs include=["a100", "l40", "a6000"], # Include more GPU types exclude=[] # Remove exclusions ) ``` ### Scaling Issues **Problem**: Chute can't handle high load **Solutions**: ```python # Optimize for concurrency node_selector = NodeSelector( gpu_count=2, # Multiple GPUs for parallel processing min_vram_gb_per_gpu=24 ) # Implement request queuing import asyncio from asyncio import Semaphore class RateLimitedChute(Chute): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.semaphore = Semaphore(5) # Limit concurrent requests @chute.cord(public_api_path="/generate") async def generate(self, request: GenerateRequest): async with self.semaphore: return await self._generate_impl(request) ``` ## Networking Problems ### Connection Issues **Problem**: Cannot reach deployed chute **Solutions**: ```bash # Check chute status chutes chutes get myuser/my-chute # Check logs for errors chutes chutes logs myuser/my-chute # Test health endpoint curl https://your-chute-url/health ``` ### Timeout Issues **Problem**: Requests timing out **Solutions**: ```python # Implement async processing for long-running tasks @chute.job() async def process_long_task(self, task_id: str, input_data: dict): """Background job for long-running tasks.""" try: result = await long_running_process(input_data) # Store result in database or file system store_result(task_id, result) except Exception as e: self.logger.error(f"Task {task_id} failed: {e}") store_error(task_id, str(e)) @chute.cord(public_api_path="/start_task") async def start_task(self, request: TaskRequest): """Start a background task and return task ID.""" task_id = generate_task_id() await self.process_long_task(task_id, request.data) return {"task_id": task_id, "status": "started"} @chute.cord(public_api_path="/task_status/{task_id}") async def get_task_status(self, task_id: str): """Get status of a background task.""" return get_task_status(task_id) ``` --- ## SOURCE: https://chutes.ai/docs/index # Chutes SDK Documentation Welcome to the complete documentation for the **Chutes SDK** - a powerful Python framework for building and deploying serverless AI applications on GPU-accelerated infrastructure. ## What is Chutes? Chutes is a serverless AI compute platform that allows you to: - 🚀 Deploy AI models and applications instantly! - 💰 Pay only for GPU time you actually use - 🔧 Build custom Docker images or use pre-built templates - 📊 Scale automatically based on demand - 🎯 Focus on your AI logic, not infrastructure management ## Quick Start ```bash # Install the Chutes SDK pip install chutes # Register your account chutes register # Deploy your first chute chutes deploy my_chute:chute ``` ## Key Features ### 🎯 **Simple Decorator-Based API** Define your AI endpoints with simple Python decorators: ```python @chute.cord(public_api_path="/generate") async def generate_text(self, prompt: str) -> str: return await self.model.generate(prompt) ``` ### 🔧 **Flexible Templates** Get started quickly with pre-built templates for popular AI frameworks: ```python from chutes.chute.template.vllm import build_vllm_chute chute = build_vllm_chute( username="myuser", model_name="microsoft/DialoGPT-medium", node_selector=NodeSelector(gpu_count=1) ) ``` ### 🏗️ **Custom Image Building** Build sophisticated Docker environments with a fluent API: ```python image = ( Image(username="myuser", name="custom-ai", tag="1.0") .from_base("nvidia/cuda:12.2-devel-ubuntu22.04") .with_python("3.11") .run_command("pip install torch transformers") .with_env("MODEL_PATH", "/app/models") ) ``` ### ⚡ **Hardware Optimization** Specify exactly the hardware you need: ```python node_selector = NodeSelector( gpu_count=4, min_vram_gb_per_gpu=80, exclude=["old_gpus"] ) ``` ## Architecture Overview ``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Your Code │ │ Chutes SDK │ │ Chutes Platform │ │ │ │ │ │ │ │ @chute.cord │───▶│ Build & Deploy │───▶│ GPU Clusters │ │ def generate() │ │ │ │ │ │ │ │ HTTP APIs │ │ Auto-scaling │ └─────────────────┘ └─────────────────┘ └─────────────────┘ ``` ## Security & Trust Chutes is built on a "don't trust, verify" philosophy. We employ advanced security measures including: - 🔒 **End-to-End Encryption** - 🛡️ **Trusted Execution Environments (TEEs)** using Intel TDX - 🔍 **Cryptographic Verification** of code and models - 🛑 **Hardware Attestation** for GPUs Learn more about our [Security Architecture](core-concepts/security-architecture). ## Integrations Chutes integrates with popular AI frameworks to make development easier: - 🔗 **[Vercel AI SDK](integrations/vercel-ai-sdk)** - Use Chutes with the Vercel AI SDK for streaming, tool calling, and more - 🔐 **[Sign in with Chutes](sign-in-with-chutes/overview)** - Add OAuth authentication to let users sign in with their Chutes account ## Community & Support - 📖 **Documentation**: You're here! - 💬 **Discord**: [Join our community](https://discord.gg/wHrXwWkCRz) - 🐛 **Issues**: [GitHub Issues](https://github.com/chutesai/chutes) --- Ready to get started? Head to the [Installation Guide](getting-started/installation) to begin your Chutes journey! ---