Sogni Voice API Documentation

REST API for audio transcription and text-to-speech synthesis

Base URL: https://voice.sogni.ai

Authentication (Optional)

API key authentication can be enabled for production deployments. When enabled, all API endpoints (including Kokoro TTS, Pocket TTS, Qwen3 TTS, and Transcription) require authentication, except health check and auth status.

Check Auth Status

GET /auth/status

Check if authentication is enabled on this server.

Response

{
  "authEnabled": true
}

Authenticating Requests

When authentication is enabled, include your API key using one of these methods:

Option 1: X-API-Key Header (recommended)

curl -X POST https://voice.sogni.ai/tts \
  -H "X-API-Key: your_api_key_here" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world"}'

Option 2: Authorization Bearer Header

curl -X POST https://voice.sogni.ai/tts \
  -H "Authorization: Bearer your_api_key_here" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world"}'

JavaScript Example

const API_KEY = 'your_api_key_here';

const response = await fetch('https://voice.sogni.ai/tts', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'X-API-Key': API_KEY
  },
  body: JSON.stringify({
    text: 'Hello, world!'
  })
});

Python Example

import requests

API_KEY = 'your_api_key_here'

response = requests.post(
    'https://voice.sogni.ai/tts',
    headers={'X-API-Key': API_KEY},
    json={'text': 'Hello, world!'}
)

with open('output.wav', 'wb') as f:
    f.write(response.content)

Public Endpoints

These endpoints are always accessible without authentication:

Error Response

When authentication fails, the API returns a 401 Unauthorized response:

{
  "statusCode": 401,
  "error": "Unauthorized",
  "message": "Missing API key. Provide X-API-Key header or Authorization: Bearer <key>"
}

1. STT (Speech-to-Text) - Parakeet

Upload an audio file and receive a text transcript.

POST /transcribe

Request

Send audio as multipart/form-data.

Parameter Type Description
file * File Audio file to transcribe (supports common formats: mp3, wav, webm, m4a, etc.)
timestamps (optional) string Set to "true" to include sentence-level timings with start/end times for each segment
wordTimestamps (optional) string Set to "true" to include word-level timings with start/end times for each word (overrides timestamps)

Response (default)

{
  "success": true,
  "transcript": "The transcribed text appears here.",
  "filename": "recording.mp3"
}

Response (with sentence timestamps)

When timestamps=true, the response includes sentence-level timing data for subtitle generation:

{
  "success": true,
  "timestamps": [
    { "start": 0.00, "end": 2.34, "text": "Hello and welcome" },
    { "start": 2.34, "end": 5.67, "text": "to our presentation today" },
    { "start": 5.67, "end": 8.90, "text": "we will cover several topics" }
  ]
}

Response (with word timestamps)

When wordTimestamps=true, the response includes word-level timing data for precise subtitle synchronization:

{
  "success": true,
  "timestamps": [
    { "start": 0.00, "end": 0.48, "text": "Hello" },
    { "start": 0.48, "end": 0.72, "text": "and" },
    { "start": 0.72, "end": 1.20, "text": "welcome" },
    { "start": 1.20, "end": 1.44, "text": "to" },
    { "start": 1.44, "end": 1.68, "text": "our" },
    { "start": 1.68, "end": 2.34, "text": "presentation" }
  ]
}

Each timestamp object contains:

cURL Examples

# Basic transcription
curl -X POST https://voice.sogni.ai/transcribe \
  -F "file=@/path/to/audio.mp3"

# With sentence-level timings
curl -X POST https://voice.sogni.ai/transcribe \
  -F "file=@/path/to/audio.mp3" \
  -F "timestamps=true"

# With word-level timings
curl -X POST https://voice.sogni.ai/transcribe \
  -F "file=@/path/to/audio.mp3" \
  -F "wordTimestamps=true"

JavaScript Example

const formData = new FormData();
formData.append('file', audioFile);

const response = await fetch('https://voice.sogni.ai/transcribe', {
  method: 'POST',
  body: formData
});

const data = await response.json();
console.log(data.transcript);

Python Example

import requests

with open('audio.mp3', 'rb') as f:
    response = requests.post(
        'https://voice.sogni.ai/transcribe',
        files={'file': f}
    )

data = response.json()
print(data['transcript'])

2. TTS (Text-to-Speech) - Kokoro

Convert text to spoken audio using Kokoro TTS. Returns a WAV audio file by default.

POST /tts

Request

Send JSON with the text and optional parameters.

Parameter Type Default Description
text * string - Text to convert to speech (1-10,000 characters)
voice (optional) string af_heart Voice to use (see /tts/voices for available voices)
speed (optional) number 1.0 Speech speed multiplier (0.5 to 2.0)
format (optional) string wav Output format: "wav", "opus" (returns audio file) or "buffer" (returns base64 JSON)
timestamps (optional) boolean false Include word-level timestamps for subtitle generation (forces JSON response)

Response (format: wav)

Returns binary WAV audio data with Content-Type: audio/wav

Response (format: buffer)

{
  "success": true,
  "audio": "UklGRiQAAABXQVZFZm10IBAA...",  // base64 encoded WAV
  "voice": "af_heart",
  "speed": 1.0,
  "format": "wav"
}

cURL Example

curl -X POST https://voice.sogni.ai/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello, world!", "voice": "af_heart", "speed": 1.0}' \
  --output output.wav

JavaScript Example

const response = await fetch('https://voice.sogni.ai/tts', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    text: 'Hello, world!',
    voice: 'af_heart',
    speed: 1.0
  })
});

const audioBlob = await response.blob();
const audioUrl = URL.createObjectURL(audioBlob);
const audio = new Audio(audioUrl);
audio.play();

Python Example

import requests

response = requests.post(
    'https://voice.sogni.ai/tts',
    json={'text': 'Hello, world!', 'voice': 'af_heart', 'speed': 1.0}
)

with open('output.wav', 'wb') as f:
    f.write(response.content)

List Available Voices

GET /tts/voices

Response

{
  "voices": ["af_heart", "af_bella", "am_adam", ...],
  "default": "af_heart"
}

3. TTS (Text-to-Speech) - Pocket

Kyutai Pocket TTS is a lightweight 100M-parameter, CPU-only, English-only TTS with ~200ms latency and voice cloning support. Requires POCKET_TTS_ENABLED=true.

Generate Speech

POST /pocket-tts
Parameter Type Default Description
text * string - Text to convert to speech (1-10,000 characters)
voice (optional) string alba Built-in voice: alba, marius, javert, jean, fantine, cosette, eponine, azelma
format (optional) string wav Output format: "wav", "opus" (returns audio file) or "buffer" (returns base64 JSON)

cURL Example

curl -X POST https://voice.sogni.ai/pocket-tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello, world!", "voice": "alba"}' \
  --output output.wav

List Voices

GET /pocket-tts/voices

Response

{
  "voices": ["alba", "marius", "javert", "jean", "fantine", "cosette", "eponine", "azelma"],
  "clones": ["my_clone"],
  "default": "alba"
}

Create Voice Clone

POST /pocket-tts/voices/clone

Upload a reference audio file to create a voice clone. No transcript needed.

Parameter Type Description
audio * File Reference audio file (WAV, MP3, OGG)
cloneId (optional) string Custom name for the clone (alphanumeric, underscore, hyphen)

cURL Example

curl -X POST https://voice.sogni.ai/pocket-tts/voices/clone \
  -F "audio=@/path/to/reference.wav" \
  -F "cloneId=my_voice"

Response

{
  "success": true,
  "cloneId": "my_voice",
  "message": "Voice clone created successfully"
}

Generate with Cloned Voice

POST /pocket-tts/voices/clone/{cloneId}/generate

cURL Example

curl -X POST https://voice.sogni.ai/pocket-tts/voices/clone/my_voice/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello from my cloned voice!"}' \
  --output output.wav

Delete Voice Clone

DELETE /pocket-tts/voices/clone/{cloneId}

cURL Example

curl -X DELETE https://voice.sogni.ai/pocket-tts/voices/clone/my_voice

Download Voice Clone

GET /pocket-tts/voices/clone/{cloneId}/download

Download a voice clone as a ZIP file containing the reference audio and metadata. Useful for backup or transferring clones between servers.

Response

Returns a ZIP file (Content-Type: application/zip) containing:

cURL Example

curl https://voice.sogni.ai/pocket-tts/voices/clone/my_voice/download \
  --output my_voice.zip

JavaScript Example

const response = await fetch(
  'https://voice.sogni.ai/pocket-tts/voices/clone/my_voice/download'
);

const blob = await response.blob();
const url = URL.createObjectURL(blob);
const a = document.createElement('a');
a.href = url;
a.download = 'my_voice.zip';
a.click();

Python Example

import requests

response = requests.get(
    'https://voice.sogni.ai/pocket-tts/voices/clone/my_voice/download'
)

with open('my_voice.zip', 'wb') as f:
    f.write(response.content)

Import Voice Clone

POST /pocket-tts/voices/clone/import

Import a previously exported voice clone from a ZIP file. The ZIP must contain a reference.wav file.

Parameter Type Description
file * File ZIP file containing the voice clone
cloneId (optional) string Custom name for the imported clone. If omitted, uses the name from metadata.

cURL Example

curl -X POST https://voice.sogni.ai/pocket-tts/voices/clone/import \
  -F "file=@my_voice.zip" \
  -F "cloneId=restored_voice"

Response

{
  "success": true,
  "cloneId": "restored_voice",
  "message": "Voice clone imported successfully"
}

JavaScript Example

const formData = new FormData();
formData.append('file', zipFile);
formData.append('cloneId', 'my_imported_voice');

const response = await fetch('https://voice.sogni.ai/pocket-tts/voices/clone/import', {
  method: 'POST',
  body: formData
});

const data = await response.json();
console.log('Imported clone:', data.cloneId);

Python Example

import requests

with open('my_voice.zip', 'rb') as f:
    response = requests.post(
        'https://voice.sogni.ai/pocket-tts/voices/clone/import',
        files={'file': f},
        data={'cloneId': 'restored_voice'}
    )

data = response.json()
print(f"Imported clone: {data['cloneId']}")

4. TTS (Text-to-Speech) - Qwen3

Qwen3-TTS is a powerful multilingual TTS model with voice cloning, emotion control, and voice design capabilities. Requires QWEN_TTS_ENABLED=true. Supports 11 languages including English, Chinese, Japanese, Korean, French, German, Spanish, and more.

Generate Speech

POST /qwen-tts
Parameter Type Default Description
text * string - Text to convert to speech (1-10,000 characters)
voice (optional) string Chelsie Voice to use (see /qwen-tts/voices for available voices)
language (optional) string English Language for synthesis
format (optional) string wav Output format: "wav", "opus", or "buffer" (base64 JSON)

cURL Example

curl -X POST https://voice.sogni.ai/qwen-tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello, world!", "voice": "Chelsie"}' \
  --output output.wav

Custom Voice (Emotion/Style Control)

POST /qwen-tts/custom-voice

Generate speech with emotion and style instructions. Requires the CustomVoice model variant.

Parameter Type Description
text * string Text to convert to speech
speaker (optional) string Speaker voice to use (default: Chelsie)
instruct * string Emotion/style instruction (e.g., "Very happy and excited", "Speak slowly with a calm tone")
format (optional) string Output format: "wav", "opus", or "buffer"

cURL Example

curl -X POST https://voice.sogni.ai/qwen-tts/custom-voice \
  -H "Content-Type: application/json" \
  -d '{"text": "I am so excited!", "speaker": "Chelsie", "instruct": "Very happy and enthusiastic"}' \
  --output excited.wav

Voice Design (Create Voice from Description)

POST /qwen-tts/voice-design

Generate speech using a voice created from a text description. Requires the VoiceDesign model variant.

Parameter Type Description
text * string Text to convert to speech
instruct * string Voice description (e.g., "A deep male voice with a warm, calm tone")
format (optional) string Output format: "wav", "opus", or "buffer"

cURL Example

curl -X POST https://voice.sogni.ai/qwen-tts/voice-design \
  -H "Content-Type: application/json" \
  -d '{"text": "Welcome to our service.", "instruct": "A deep male voice with calm, professional tone"}' \
  --output designed_voice.wav

List Voices

GET /qwen-tts/voices

Response

{
  "voices": ["Chelsie", "Ethan", "Serena", "Vivian", "Ryan", "Aiden", "Eric", "Dylan"],
  "clones": ["my_clone"],
  "default": "Chelsie",
  "defaultLanguage": "English",
  "modelVariants": {
    "base": "base-0.6b",
    "customVoice": "custom-voice"
  },
  "features": ["voice_cloning", "custom_voice"]
}

Create Voice Clone

POST /qwen-tts/voices/clone

Upload a reference audio file with its transcript to create a voice clone. Requires the Base model variant.

Parameter Type Description
audio * File Reference audio file (3-10 seconds, WAV/MP3/OGG)
transcript * string Exact text spoken in the reference audio
cloneId (optional) string Custom name for the clone (alphanumeric, underscore, hyphen)

cURL Example

curl -X POST https://voice.sogni.ai/qwen-tts/voices/clone \
  -F "audio=@/path/to/reference.wav" \
  -F "transcript=Hello, this is my voice sample." \
  -F "cloneId=my_voice"

Response

{
  "success": true,
  "cloneId": "my_voice",
  "message": "Voice clone created successfully"
}

Generate with Cloned Voice

POST /qwen-tts/voices/clone/{cloneId}/generate
Parameter Type Description
text * string Text to convert to speech
language (optional) string Language for synthesis (default: English)
format (optional) string Output format: "wav", "opus", or "buffer"

cURL Example

curl -X POST https://voice.sogni.ai/qwen-tts/voices/clone/my_voice/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello from my cloned voice!"}' \
  --output cloned_output.wav

Rename Voice Clone

PATCH /qwen-tts/voices/clone/{cloneId}
Parameter Type Description
newCloneId * string New name for the voice clone

cURL Example

curl -X PATCH https://voice.sogni.ai/qwen-tts/voices/clone/my_voice \
  -H "Content-Type: application/json" \
  -d '{"newCloneId": "renamed_voice"}'

Delete Voice Clone

DELETE /qwen-tts/voices/clone/{cloneId}

cURL Example

curl -X DELETE https://voice.sogni.ai/qwen-tts/voices/clone/my_voice

Response

{
  "success": true,
  "cloneId": "my_voice",
  "message": "Voice clone deleted successfully"
}

5. Health Check

Check if the API server is running and healthy.

GET /health

Response

{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00.000Z",
  "uptime": 3600
}

cURL Example

curl https://voice.sogni.ai/health