TEI Template
The TEI (Text Embeddings Inference) template provides optimized text embedding generation using Hugging Face's high-performance inference server. Perfect for semantic search, similarity detection, and RAG applications.
What is TEI?
Text Embeddings Inference (TEI) is a specialized inference server for embedding models that provides:
- ⚡ Optimized performance with Rust-based implementation
- 📊 Batch processing for efficient throughput
- 🔄 Automatic batching and request queuing
- 📏 Embedding normalization and pooling options
- 🎯 Production-ready with health checks and metrics
Quick Start
from chutes.chute import NodeSelector
from chutes.chute.template.tei import build_tei_chute
chute = build_tei_chute(
username="myuser",
model_name="sentence-transformers/all-MiniLM-L6-v2",
revision="main",
node_selector=NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=8
)
)This creates a complete TEI deployment with:
- ✅ Optimized embedding inference server
- ✅ OpenAI-compatible embeddings API
- ✅ Automatic request batching
- ✅ Built-in normalization
- ✅ Auto-scaling based on demand
Function Reference
def build_tei_chute(
username: str,
model_name: str,
revision: str = "main",
node_selector: NodeSelector = None,
image: str | Image = None,
tagline: str = "",
readme: str = "",
concurrency: int = 1,
# TEI-specific parameters
max_batch_tokens: int = 16384,
max_batch_requests: int = 512,
max_concurrent_requests: int = 512,
pooling: str = "mean",
normalize: bool = True,
trust_remote_code: bool = False,
**kwargs
) -> Chute:Required Parameters
- : Your Chutes username
- : HuggingFace embedding model identifier
TEI Configuration
- : Maximum tokens per batch (default: 16384)
- : Maximum requests per batch (default: 512)
- : Maximum concurrent requests (default: 512)
- : Pooling strategy - "mean", "cls", or "max" (default: "mean")
- : Whether to normalize embeddings (default: True)
- : Allow custom model code execution (default: False)
Complete Example
from chutes.chute import NodeSelector
from chutes.chute.template.tei import build_tei_chute
# Build TEI chute for embedding generation
chute = build_tei_chute(
username="myuser",
model_name="sentence-transformers/all-MiniLM-L6-v2",
revision="main",
node_selector=NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=8
),
tagline="High-performance text embeddings",
readme="""
# Text Embeddings Service
Fast and efficient text embedding generation using TEI.
## Features
- OpenAI-compatible embeddings API
- Automatic batching and optimization
- Normalized embeddings for similarity search
- Production-ready performance
## API Endpoints
- `/v1/embeddings` - Generate embeddings
- `/embed` - Alternative embedding endpoint
- `/health` - Health check
""",
# TEI optimization
max_batch_tokens=32768,
max_batch_requests=256,
pooling="mean",
normalize=True
)API Endpoints
OpenAI-Compatible Embeddings
curl -X POST https://myuser-tei-chute.chutes.ai/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "sentence-transformers/all-MiniLM-L6-v2",
"input": [
"The quick brown fox jumps over the lazy dog",
"Machine learning is transforming technology"
]
}'Single Text Embedding
curl -X POST https://myuser-tei-chute.chutes.ai/embed \
-H "Content-Type: application/json" \
-d '{
"inputs": "This is a sample text for embedding generation"
}'Batch Processing
curl -X POST https://myuser-tei-chute.chutes.ai/embed \
-H "Content-Type: application/json" \
-d '{
"inputs": [
"First document to embed",
"Second document for embedding",
"Third text for similarity search"
]
}'Model Recommendations
Small & Fast Models
# Lightweight, fast inference
NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=4,
include=["rtx3090", "rtx4090"]
)
# Recommended models:
# - sentence-transformers/all-MiniLM-L6-v2 (384 dim)
# - sentence-transformers/all-MiniLM-L12-v2 (384 dim)
# - microsoft/codebert-base (768 dim)Balanced Performance Models
# Good balance of speed and quality
NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=8,
include=["rtx4090", "a100"]
)
# Recommended models:
# - sentence-transformers/all-mpnet-base-v2 (768 dim)
# - sentence-transformers/multi-qa-mpnet-base-dot-v1 (768 dim)
# - thenlper/gte-base (768 dim)High-Quality Models
# Best embedding quality
NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=12,
include=["a100", "h100"]
)
# Recommended models:
# - sentence-transformers/all-mpnet-base-v2 (768 dim)
# - intfloat/e5-large-v2 (1024 dim)
# - BAAI/bge-large-en-v1.5 (1024 dim)Use Cases
1. Semantic Search
search_chute = build_tei_chute(
username="myuser",
model_name="sentence-transformers/multi-qa-mpnet-base-dot-v1",
tagline="Semantic search embeddings",
max_batch_tokens=32768, # Handle large documents
normalize=True # Important for similarity search
)2. Document Similarity
similarity_chute = build_tei_chute(
username="myuser",
model_name="sentence-transformers/all-mpnet-base-v2",
tagline="Document similarity service",
pooling="mean",
normalize=True
)3. Code Embeddings
code_chute = build_tei_chute(
username="myuser",
model_name="microsoft/codebert-base",
tagline="Code similarity and search",
max_batch_tokens=16384, # Typical code snippet length
trust_remote_code=True # May be needed for code models
)4. Multilingual Embeddings
multilingual_chute = build_tei_chute(
username="myuser",
model_name="sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
tagline="Multilingual text embeddings",
max_batch_requests=1024 # Handle diverse languages efficiently
)Performance Optimization
Throughput Optimization
# Maximize throughput for batch processing
chute = build_tei_chute(
username="myuser",
model_name="sentence-transformers/all-MiniLM-L6-v2",
max_batch_tokens=65536, # Large batches
max_batch_requests=1024, # Many requests
max_concurrent_requests=2048, # High concurrency
concurrency=8 # Multiple chute instances
)Latency Optimization
# Minimize latency for real-time applications
chute = build_tei_chute(
username="myuser",
model_name="sentence-transformers/all-MiniLM-L6-v2",
max_batch_tokens=4096, # Smaller batches
max_batch_requests=32, # Fewer requests per batch
max_concurrent_requests=128 # Lower concurrency
)Memory Optimization
# Optimize for memory usage
chute = build_tei_chute(
username="myuser",
model_name="sentence-transformers/all-MiniLM-L6-v2",
max_batch_tokens=8192, # Moderate batch size
max_batch_requests=256, # Moderate requests
node_selector=NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=6 # Conservative memory
)
)Testing Your TEI Chute
Python Client
import requests
import numpy as np
# Generate embeddings
response = requests.post(
"https://myuser-tei-chute.chutes.ai/v1/embeddings",
json={
"model": "sentence-transformers/all-MiniLM-L6-v2",
"input": [
"The quick brown fox",
"A fast brown animal",
"The weather is nice today"
]
}
)
result = response.json()
embeddings = [item["embedding"] for item in result["data"]]
# Calculate similarity
emb1 = np.array(embeddings[0])
emb2 = np.array(embeddings[1])
emb3 = np.array(embeddings[2])
similarity_1_2 = np.dot(emb1, emb2) # Should be high
similarity_1_3 = np.dot(emb1, emb3) # Should be low
print(f"Similarity fox vs animal: {similarity_1_2:.3f}")
print(f"Similarity fox vs weather: {similarity_1_3:.3f}")OpenAI Client
from openai import OpenAI
# Use OpenAI client with your chute
client = OpenAI(
api_key="dummy", # Not needed for Chutes
base_url="https://myuser-tei-chute.chutes.ai/v1"
)
# Generate embeddings
response = client.embeddings.create(
model="sentence-transformers/all-MiniLM-L6-v2",
input=[
"Document for semantic search",
"Query for finding similar content"
]
)
for i, item in enumerate(response.data):
print(f"Embedding {i}: {len(item.embedding)} dimensions")Batch Processing Test
import asyncio
import aiohttp
import time
async def test_batch_performance():
"""Test batch processing performance."""
# Generate test texts
texts = [f"This is test document number {i} for embedding generation."
for i in range(100)]
# Test batch processing
start_time = time.time()
async with aiohttp.ClientSession() as session:
async with session.post(
"https://myuser-tei-chute.chutes.ai/embed",
json={"inputs": texts}
) as response:
result = await response.json()
batch_time = time.time() - start_time
print(f"Batch processing:")
print(f" Texts: {len(texts)}")
print(f" Time: {batch_time:.2f}s")
print(f" Throughput: {len(texts)/batch_time:.1f} texts/sec")
# Test individual requests
start_time = time.time()
async with aiohttp.ClientSession() as session:
tasks = []
for text in texts[:10]: # Test subset for fairness
task = session.post(
"https://myuser-tei-chute.chutes.ai/embed",
json={"inputs": text}
)
tasks.append(task)
responses = await asyncio.gather(*tasks)
individual_time = time.time() - start_time
print(f"\nIndividual requests:")
print(f" Texts: 10")
print(f" Time: {individual_time:.2f}s")
print(f" Throughput: {10/individual_time:.1f} texts/sec")
print(f" Speedup: {(individual_time*10)/(batch_time):.1f}x")
asyncio.run(test_batch_performance())Integration Examples
Semantic Search with Vector Database
import requests
import numpy as np
from pinecone import Pinecone
# Initialize vector database
pc = Pinecone(api_key="your-api-key")
index = pc.Index("semantic-search")
def embed_text(text):
"""Generate embedding for text."""
response = requests.post(
"https://myuser-tei-chute.chutes.ai/v1/embeddings",
json={
"model": "sentence-transformers/all-mpnet-base-v2",
"input": text
}
)
return response.json()["data"][0]["embedding"]
def index_documents(documents):
"""Index documents for search."""
vectors = []
for i, doc in enumerate(documents):
embedding = embed_text(doc)
vectors.append({
"id": str(i),
"values": embedding,
"metadata": {"text": doc}
})
index.upsert(vectors)
def search_documents(query, top_k=5):
"""Search for similar documents."""
query_embedding = embed_text(query)
results = index.query(
vector=query_embedding,
top_k=top_k,
include_metadata=True
)
return [(match.score, match.metadata["text"])
for match in results.matches]
# Example usage
documents = [
"Python is a programming language",
"Machine learning uses algorithms",
"The weather is sunny today",
"Neural networks are inspired by the brain"
]
index_documents(documents)
results = search_documents("What is artificial intelligence?")
for score, text in results:
print(f"Score: {score:.3f} - {text}")Document Clustering
import requests
import numpy as np
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
def embed_documents(documents):
"""Generate embeddings for multiple documents."""
response = requests.post(
"https://myuser-tei-chute.chutes.ai/v1/embeddings",
json={
"model": "sentence-transformers/all-mpnet-base-v2",
"input": documents
}
)
return [item["embedding"] for item in response.json()["data"]]
def cluster_documents(documents, n_clusters=3):
"""Cluster documents based on embeddings."""
# Generate embeddings
embeddings = embed_documents(documents)
embeddings_array = np.array(embeddings)
# Perform clustering
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
clusters = kmeans.fit_predict(embeddings_array)
# Visualize with PCA
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings_array)
plt.figure(figsize=(10, 8))
scatter = plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1],
c=clusters, cmap='viridis')
plt.colorbar(scatter)
plt.title('Document Clustering')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
# Add document labels
for i, doc in enumerate(documents):
plt.annotate(f"Doc {i}", (embeddings_2d[i, 0], embeddings_2d[i, 1]))
plt.show()
return clusters
# Example usage
documents = [
"Python programming language tutorial",
"JavaScript web development guide",
"Machine learning with neural networks",
"Deep learning and artificial intelligence",
"HTML and CSS for beginners",
"React framework for web apps",
"Natural language processing techniques",
"Computer vision and image recognition"
]
clusters = cluster_documents(documents)
# Group documents by cluster
for cluster_id in range(max(clusters) + 1):
print(f"\nCluster {cluster_id}:")
for i, doc in enumerate(documents):
if clusters[i] == cluster_id:
print(f" - {doc}")Troubleshooting
Common Issues
Slow embedding generation?
- Increase for better throughput
- Use a smaller/faster model
- Optimize hardware with more GPU memory
Out of memory errors?
- Reduce
- Decrease
- Use a smaller model
- Increase GPU VRAM requirements
Poor embedding quality?
- Use a larger, more sophisticated model
- Ensure proper text preprocessing
- Check if the model matches your domain
High latency?
- Reduce batch sizes for faster response
- Use a smaller/faster model
- Consider multiple smaller instances
Performance Monitoring
import requests
import time
def monitor_performance():
"""Monitor TEI chute performance."""
# Test different batch sizes
batch_sizes = [1, 5, 10, 25, 50]
test_text = "This is a test document for performance monitoring."
for batch_size in batch_sizes:
texts = [test_text] * batch_size
start_time = time.time()
response = requests.post(
"https://myuser-tei-chute.chutes.ai/embed",
json={"inputs": texts}
)
end_time = time.time()
if response.status_code == 200:
throughput = batch_size / (end_time - start_time)
print(f"Batch size {batch_size}: {throughput:.1f} texts/sec")
else:
print(f"Batch size {batch_size}: Error {response.status_code}")
monitor_performance()Best Practices
1. Model Selection
# For general text similarity
model_name = "sentence-transformers/all-mpnet-base-v2"
# For search applications
model_name = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
# For code similarity
model_name = "microsoft/codebert-base"
# For multilingual applications
model_name = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"2. Batch Size Tuning
# For real-time applications (low latency)
max_batch_tokens = 4096
max_batch_requests = 32
# For bulk processing (high throughput)
max_batch_tokens = 32768
max_batch_requests = 512
# For balanced performance
max_batch_tokens = 16384
max_batch_requests = 2563. Text Preprocessing
def preprocess_text(text):
"""Preprocess text for better embeddings."""
# Remove excessive whitespace
text = " ".join(text.split())
# Normalize length (very long texts may be truncated)
if len(text) > 5000: # Adjust based on model's max length
text = text[:5000]
return text.strip()
# Apply preprocessing before embedding
texts = [preprocess_text(text) for text in raw_texts]4. Error Handling
import requests
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def generate_embeddings(texts):
"""Generate embeddings with retry logic."""
try:
response = requests.post(
"https://myuser-tei-chute.chutes.ai/v1/embeddings",
json={
"model": "sentence-transformers/all-mpnet-base-v2",
"input": texts
},
timeout=30
)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
raiseNext Steps
- VLLM Template - High-performance language model serving
- Diffusion Template - Image generation capabilities
- Vector Databases Guide - Integration with vector stores
- Semantic Search Example - Complete search application