# AudioDojo

> One chute, 12 models, 13 endpoints — covering text-to-speech, voice cloning, voice design, transcription, denoising, source separation, speaker verification, VAD, and language detection. Including Kokoro-82M, Qwen3-TTS 1.7B, Whisper large-v3-turbo, NVIDIA C...

This file is intended for AI coding agents wiring an application to this specific Chutes model.

## Important URLs

- Model page: https://chutes.ai/app/chute/vonkaiser-audiodojo
- Model llms.txt: https://chutes.ai/app/chute/vonkaiser-audiodojo/llms.txt
- Global Chutes llms.txt: https://chutes.ai/llms.txt
- Full Chutes docs export: https://chutes.ai/llms-full.txt
- OpenAPI schema: https://api.chutes.ai/openapi.json

## Model Identifiers

- Name: `AudioDojo`
- Chute ID: `d6c66e0a-3e04-595b-825c-f3a65e50b2c7`
- Slug: `vonkaiser-audiodojo`
- Owner: `vonkaiser`

## Authentication

- Use `Authorization: Bearer $CHUTES_API_KEY` for inference calls.
- Send JSON request bodies with `Content-Type: application/json` unless the endpoint documentation says otherwise.
- The request body is FLAT — send the request fields at the top level (no `input_args`/`args` wrapper).
- This model is served on its own host (`https://vonkaiser-audiodojo.chutes.ai`). Call the endpoint paths listed below directly.
- For base64 media fields, send raw base64 strings in API requests. The web playground may use temporary Blob upload references internally for large files before proxying to Chutes.

## Playground Notes

- Text-to-speech endpoints return audio directly; speech-to-text and analysis endpoints return JSON.
- Reference-audio and STT inputs accept common audio formats as base64; large uploads use temporary Blob storage automatically.
- For cloning, keep the reference sample short and clean unless trim_sample is enabled.

## Endpoints

### 1. POST /speak

- Base URL: `https://vonkaiser-audiodojo.chutes.ai`
- Output content type: `audio/wav`
- Playground note: Type your text, tweak the voice settings if you like, then press Run to hear the result.

Request fields:

- `text` (string, required)
- `speed` (number, optional) - Voice output speed. Default: 1.
- `voice` (string, optional) - Voice selection for text-to-speech Default: "af_heart". Options: af_heart, af_alloy, af_aoede, af_bella, af_jessica, af_kore, af_nicole, af_nova, af_river, af_sarah, af_sky, am_adam, am_echo, am_eric, am_fenrir, am_liam, am_michael, am_onyx, am_puck, am_santa, bf_alice, bf_emma, bf_isabella, bf_lily, bm_daniel, bm_fable, bm_george, bm_lewis, ef_dora, em_alex, em_santa, ff_siwis, hf_alpha, hf_beta, hm_omega, hm_psi, if_sara, im_nicola, jf_alpha, jf_gongitsune, jf_nezumi, jf_tebukuro, jm_kumo, pf_dora, pm_alex, pm_santa, zf_xiaobei, zf_xiaoni, zf_xiaoxiao, zf_xiaoyi, zm_yunjian, zm_yunxi, zm_yunxia, zm_yunyang.

Example call:

```bash
curl -X POST "https://vonkaiser-audiodojo.chutes.ai/speak" \
  -H "Authorization: Bearer $CHUTES_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "text": "Your text here",
  "speed": 1,
  "voice": "af_heart"
}' \
  --output output.wav
```

Playground presets:

- Warm Kokoro narration

### 2. POST /tts/kokoro

- Base URL: `https://vonkaiser-audiodojo.chutes.ai`
- Output content type: `audio/wav`
- Playground note: Type your text, tweak the voice settings if you like, then press Run to hear the result.

Request fields:

- `text` (string, required)
- `speed` (number, optional) - Playback speed multiplier (0.1–3.0). Default: 1.
- `voice` (string, optional) - Built-in voice ID. See KokoroVoice enum for the full list. Default: "af_heart". Options: af_heart, af_alloy, af_aoede, af_bella, af_jessica, af_kore, af_nicole, af_nova, af_river, af_sarah, af_sky, am_adam, am_echo, am_eric, am_fenrir, am_liam, am_michael, am_onyx, am_puck, am_santa, bf_alice, bf_emma, bf_isabella, bf_lily, bm_daniel, bm_fable, bm_george, bm_lewis, ef_dora, em_alex, em_santa, ff_siwis, hf_alpha, hf_beta, hm_omega, hm_psi, if_sara, im_nicola, jf_alpha, jf_gongitsune, jf_nezumi, jf_tebukuro, jm_kumo, pf_dora, pm_alex, pm_santa, zf_xiaobei, zf_xiaoni, zf_xiaoxiao, zf_xiaoyi, zm_yunjian, zm_yunxi, zm_yunxia, zm_yunyang.

Example call:

```bash
curl -X POST "https://vonkaiser-audiodojo.chutes.ai/tts/kokoro" \
  -H "Authorization: Bearer $CHUTES_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "text": "Your text here",
  "speed": 1,
  "voice": "af_heart"
}' \
  --output output.wav
```

Playground presets:

- Warm Kokoro narration

### 3. POST /tts/qwen3clone

- Base URL: `https://vonkaiser-audiodojo.chutes.ai`
- Output content type: `audio/wav`
- Playground note: Upload a short, clean voice sample, type the new text, then press Run.

Request fields:

- `text` (string, required)
- `language` (string, optional) - Target language name (e.g. 'English', 'Chinese'). Default: "English".
- `ref_text` (string, optional) - Transcript of the reference audio. If omitted, Whisper auto-transcribes it. Default: "".
- `trim_sample` (boolean, optional) - If True, automatically trim ref audio to the first 6s. If False and the audio exceeds 6s, the request is rejected with an error. Default: false.
- `auto_denoise` (boolean, optional) - If True, run MetricGAN+ speech enhancement on the reference audio before cloning. Improves clone quality when the reference recording is noisy. Default: false.
- `high_fidelity` (boolean, optional) - Controls the voice cloning mode. False (default): speaker-embedding-only mode (xvec). The model extracts a speaker embedding from the reference audio and conditions generation purely on that vector — clean output with no risk of reference audio bleed-through. Recommended for most use cases. True: in-context learning (ICL) mode. The reference audio is prepended as acoustic context so the model can mimic fine-grained prosody and timbre directly. Higher voice similarity is possible, but the last word or syllable of your reference clip may bleed into the start of the generated output if the reference does not end on a clean silence. Default: false.
- `ref_audio_b64` (string, required) - Base64-encoded reference audio for voice cloning (max 6s).

Example call:

```bash
# NOTE: replace the ref_audio_b64 placeholder(s) with real base64 data first.
curl -X POST "https://vonkaiser-audiodojo.chutes.ai/tts/qwen3clone" \
  -H "Authorization: Bearer $CHUTES_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "text": "Your text here",
  "language": "English",
  "ref_text": "",
  "trim_sample": false,
  "auto_denoise": false,
  "high_fidelity": false,
  "ref_audio_b64": "<base64-encoded audio>"
}' \
  --output output.wav
```

### 4. POST /tts/qwen3custom

- Base URL: `https://vonkaiser-audiodojo.chutes.ai`
- Output content type: `audio/wav`
- Playground note: Type your text, tweak the voice settings if you like, then press Run to hear the result.

Request fields:

- `text` (string, required)
- `speaker` (string, optional) - Preset speaker identity: Vivian, Serena, Uncle Fu, Dylan, Eric, Ryan, Aiden, Ono Anna, Sohee. Defaults to Ryan. Default: null.
- `instruct` (string, optional) - Free-text delivery style, e.g. 'Very happy', 'calm husky', 'whispering'. Leave blank for neutral. Default: null.
- `language` (string, optional) - Target language name. Default: "English".

Example call:

```bash
curl -X POST "https://vonkaiser-audiodojo.chutes.ai/tts/qwen3custom" \
  -H "Authorization: Bearer $CHUTES_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "text": "Your text here",
  "language": "English"
}' \
  --output output.wav
```

Playground presets:

- Expressive custom speaker

### 5. POST /tts/qwen3design

- Base URL: `https://vonkaiser-audiodojo.chutes.ai`
- Output content type: `audio/wav`
- Playground note: Describe the voice you want, type the text to speak, then press Run to hear it.

Request fields:

- `text` (string, required)
- `instruct` (string, required) - Voice description, e.g. 'A warm, confident male narrator with a slight British accent'.
- `language` (string, optional) - Target language name. Default: "English".

Example call:

```bash
curl -X POST "https://vonkaiser-audiodojo.chutes.ai/tts/qwen3design" \
  -H "Authorization: Bearer $CHUTES_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "text": "Your text here",
  "instruct": "example-string",
  "language": "English"
}' \
  --output output.wav
```

Playground presets:

- Designed voice

### 6. POST /stt/whisper

- Base URL: `https://vonkaiser-audiodojo.chutes.ai`
- Output content type: `application/json`
- Playground note: Upload or record the audio to transcribe, then press Run to get the transcript.

Request fields:

- `language` (string, optional) - BCP-47 language code, e.g. 'en', 'fr', 'zh'. Auto-detected if omitted. Default: null.
- `audio_b64` (string, required) - Base64-encoded audio file (any common format).
- `return_timestamps` (boolean, optional) - Include word-level timestamp chunks in the response. Default: true.

Example call:

```bash
# NOTE: replace the audio_b64 placeholder(s) with real base64 data first.
curl -X POST "https://vonkaiser-audiodojo.chutes.ai/stt/whisper" \
  -H "Authorization: Bearer $CHUTES_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "audio_b64": "<base64-encoded audio>",
  "return_timestamps": true
}'
```

Playground presets:

- English transcription

### 7. POST /stt/canary

- Base URL: `https://vonkaiser-audiodojo.chutes.ai`
- Output content type: `application/json`
- Playground note: Upload or record the audio to transcribe, then press Run to get the transcript.

Request fields:

- `audio_b64` (string, required) - Base64-encoded audio file.
- `max_new_tokens` (integer, optional) - Maximum tokens to generate for the transcript. Default: 256.

Example call:

```bash
# NOTE: replace the audio_b64 placeholder(s) with real base64 data first.
curl -X POST "https://vonkaiser-audiodojo.chutes.ai/stt/canary" \
  -H "Authorization: Bearer $CHUTES_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "audio_b64": "<base64-encoded audio>",
  "max_new_tokens": 256
}'
```

Playground presets:

- English transcription

### 8. POST /stt/parakeet

- Base URL: `https://vonkaiser-audiodojo.chutes.ai`
- Output content type: `application/json`
- Playground note: Upload or record the audio to transcribe, then press Run to get the transcript.

Request fields:

- `audio_b64` (string, required) - Base64-encoded audio file.
- `timestamps` (boolean, optional) - Include word and segment timestamps in the response. Default: true.

Example call:

```bash
# NOTE: replace the audio_b64 placeholder(s) with real base64 data first.
curl -X POST "https://vonkaiser-audiodojo.chutes.ai/stt/parakeet" \
  -H "Authorization: Bearer $CHUTES_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "audio_b64": "<base64-encoded audio>",
  "timestamps": true
}'
```

Playground presets:

- English transcription

### 9. POST /audio/vad

- Base URL: `https://vonkaiser-audiodojo.chutes.ai`
- Output content type: `application/json`
- Playground note: Upload the clip to analyze; results come back as JSON in the output panel.

Request fields:

- `audio_b64` (string, required) - Base64-encoded audio file (any common format).

Example call:

```bash
# NOTE: replace the audio_b64 placeholder(s) with real base64 data first.
curl -X POST "https://vonkaiser-audiodojo.chutes.ai/audio/vad" \
  -H "Authorization: Bearer $CHUTES_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "audio_b64": "<base64-encoded audio>"
}' \
  --output output.wav
```

### 10. POST /audio/diarize

- Base URL: `https://vonkaiser-audiodojo.chutes.ai`
- Output content type: `application/json`
- Playground note: Upload the clip to analyze; results come back as JSON in the output panel.

Request fields:

- `audio_b64` (string, required) - Base64-encoded audio file (any common format).

Example call:

```bash
# NOTE: replace the audio_b64 placeholder(s) with real base64 data first.
curl -X POST "https://vonkaiser-audiodojo.chutes.ai/audio/diarize" \
  -H "Authorization: Bearer $CHUTES_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "audio_b64": "<base64-encoded audio>"
}' \
  --output output.wav
```

### 11. POST /audio/verifyspeaker

- Base URL: `https://vonkaiser-audiodojo.chutes.ai`
- Output content type: `application/json`
- Playground note: Upload the clip to analyze; results come back as JSON in the output panel.

Request fields:

- `audio_a_b64` (string, required) - Base64-encoded first audio clip.
- `audio_b_b64` (string, required) - Base64-encoded second audio clip.

Example call:

```bash
# NOTE: replace the audio_a_b64, audio_b_b64 placeholder(s) with real base64 data first.
curl -X POST "https://vonkaiser-audiodojo.chutes.ai/audio/verifyspeaker" \
  -H "Authorization: Bearer $CHUTES_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "audio_a_b64": "<base64-encoded audio>",
  "audio_b_b64": "<base64-encoded audio>"
}' \
  --output output.wav
```

### 12. POST /audio/denoise

- Base URL: `https://vonkaiser-audiodojo.chutes.ai`
- Output content type: `audio/wav`
- Playground note: Upload the recording to clean up; the processed audio appears in the output panel.

Request fields:

- `audio_b64` (string, required) - Base64-encoded audio file (any common format).

Example call:

```bash
# NOTE: replace the audio_b64 placeholder(s) with real base64 data first.
curl -X POST "https://vonkaiser-audiodojo.chutes.ai/audio/denoise" \
  -H "Authorization: Bearer $CHUTES_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "audio_b64": "<base64-encoded audio>"
}' \
  --output output.wav
```

### 13. POST /audio/separate

- Base URL: `https://vonkaiser-audiodojo.chutes.ai`
- Output content type: `audio/wav`
- Playground note: Upload the recording to clean up; the processed audio appears in the output panel.

Request fields:

- `audio_b64` (string, required) - Base64-encoded audio file (any common format).

Example call:

```bash
# NOTE: replace the audio_b64 placeholder(s) with real base64 data first.
curl -X POST "https://vonkaiser-audiodojo.chutes.ai/audio/separate" \
  -H "Authorization: Bearer $CHUTES_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "audio_b64": "<base64-encoded audio>"
}' \
  --output output.wav
```

### 14. POST /audio/detectlanguage

- Base URL: `https://vonkaiser-audiodojo.chutes.ai`
- Output content type: `application/json`
- Playground note: Upload the clip to analyze; results come back as JSON in the output panel.

Request fields:

- `audio_b64` (string, required) - Base64-encoded audio file (any common format).

Example call:

```bash
# NOTE: replace the audio_b64 placeholder(s) with real base64 data first.
curl -X POST "https://vonkaiser-audiodojo.chutes.ai/audio/detectlanguage" \
  -H "Authorization: Bearer $CHUTES_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "audio_b64": "<base64-encoded audio>"
}' \
  --output output.wav
```

## Agent Integration Checklist

- Pick the endpoint path that matches the desired task.
- Set `CHUTES_API_KEY` in the server-side environment only.
- Validate required fields before sending requests.
- For media models, keep file upload, base64 conversion, and output preview/download handling explicit in the app UI.