API Documentation

Complete reference for the TTSFM Text-to-Speech API. Free, simple, and powerful.

Overview

The TTSFM API provides a modern, OpenAI-compatible interface for text-to-speech generation. It supports multiple voices, audio formats, and includes advanced features like text length validation and intelligent auto-combine functionality.

Base URL: http://ttsapi.site/api/

Key Features

  • 🎀 11 different voice options - Choose from alloy, echo, nova, and more
  • 🎡 Multiple audio formats - MP3, WAV, OPUS, AAC, FLAC, PCM support
  • πŸ€– OpenAI compatibility - Drop-in replacement for OpenAI's TTS API
  • ✨ Auto-combine feature - Automatically handles long text (>1000 chars) by splitting and combining audio
  • πŸ“Š Text length validation - Smart validation with configurable limits
  • πŸ“ˆ Real-time monitoring - Status endpoints and health checks
New in v3.3.4: Runtime images now ship with ffmpeg so MP3 auto-combine succeeds immediately, and the default long-text limit is trimmed to 1000 characters for predictable playback.

Operational Notes

  • Requests above 1000 characters are automatically split when auto_combine is enabled; disable validation to manage chunking yourself.
  • MP3 requests return MP3. OPUS, AAC, FLAC, WAV, and PCM map to WAV for reliable playback.
  • Audio comes from the third-party openai.fm service; availability may change without noticeβ€”add graceful fallbacks.
  • The Docker image bundles ffmpeg so combined MP3 responses work immediately without extra setup.

Authentication

Currently, the API supports optional API key authentication. If configured, include your API key in the request headers.

Authorization: Bearer YOUR_API_KEY

Text Length Validation

TTSFM includes built-in text length validation to ensure compatibility with TTS models. The default maximum length is 1000 characters, but this can be customized.

Important: Text exceeding the maximum length will be rejected unless validation is disabled or the text is split into chunks.

Validation Options

  • max_length: Maximum allowed characters (default: 1000)
  • validate_length: Enable/disable validation (default: true)
  • preserve_words: Avoid splitting words when chunking (default: true)

API Endpoints

GET /api/voices

Get list of available voices.

Response Example:
{
  "voices": [
    {
      "id": "alloy",
      "name": "Alloy",
      "description": "Alloy voice"
    },
    {
      "id": "echo",
      "name": "Echo", 
      "description": "Echo voice"
    }
  ],
  "count": 6
}

GET /api/formats

Get available audio formats for speech generation.

Available Formats

We support multiple format requests, but internally:

  • mp3 - Returns actual MP3 format
  • All other formats (opus, aac, flac, wav, pcm) - Mapped to WAV format
Note: When you request opus, aac, flac, wav, or pcm, you'll receive WAV audio data.
Response Example:
{
  "formats": [
    {
      "id": "mp3",
      "name": "MP3",
      "mime_type": "audio/mp3",
      "description": "MP3 audio format"
    },
    {
      "id": "opus", 
      "name": "Opus",
      "mime_type": "audio/wav",
      "description": "Returns WAV format"
    },
    {
      "id": "aac",
      "name": "AAC", 
      "mime_type": "audio/wav",
      "description": "Returns WAV format"
    },
    {
      "id": "flac",
      "name": "FLAC",
      "mime_type": "audio/wav", 
      "description": "Returns WAV format"
    },
    {
      "id": "wav",
      "name": "WAV",
      "mime_type": "audio/wav",
      "description": "WAV audio format"
    },
    {
      "id": "pcm",
      "name": "PCM",
      "mime_type": "audio/wav",
      "description": "Returns WAV format"
    }
  ],
  "count": 6
}

POST /api/validate-text

Validate text length and get splitting suggestions.

Request Body:
{
  "text": "Your text to validate",
  "max_length": 1000
}
Response Example:
{
  "text_length": 5000,
  "max_length": 1000,
  "is_valid": false,
  "needs_splitting": true,
  "suggested_chunks": 2,
  "chunk_preview": [
    "First chunk preview...",
    "Second chunk preview..."
  ]
}

POST /api/generate

Generate speech from text.

Request Body:
{
  "text": "Hello, world!",
  "voice": "alloy",
  "format": "mp3",
  "instructions": "Speak cheerfully",
  "max_length": 1000,
  "validate_length": true
}
Parameters:
  • text (required): Text to convert to speech
  • voice (optional): Voice ID (default: "alloy")
  • format (optional): Audio format (default: "mp3")
  • instructions (optional): Voice modulation instructions
  • max_length (optional): Maximum text length (default: 1000)
  • validate_length (optional): Enable validation (default: true)
Response:

Returns audio file with appropriate Content-Type header.

Python Package

Long Text Support

The TTSFM Python package includes built-in long text splitting functionality for developers who need fine-grained control:

from ttsfm import TTSClient, Voice, AudioFormat

# Create client
client = TTSClient()

# Generate speech from long text (automatically splits into separate files)
responses = client.generate_speech_long_text(
    text="Very long text that exceeds 1000 characters...",
    voice=Voice.ALLOY,
    response_format=AudioFormat.MP3,
    max_length=2000,
    preserve_words=True
)

# Save each chunk as separate files
for i, response in enumerate(responses, 1):
    response.save_to_file(f"part_{i:03d}.mp3")
Developer Features:
  • Manual Splitting: Full control over text chunking for advanced use cases
  • Word Preservation: Maintains word boundaries for natural speech
  • Separate Files: Each chunk saved as individual audio file
  • CLI Support: Use `--split-long-text` flag for command-line usage
Note: For web users, the auto-combine feature in `/v1/audio/speech` is recommended as it automatically handles long text and returns a single seamless audio file.

POST /api/generate-combined

Generate a single combined audio file from long text. Automatically splits text into chunks, generates speech for each chunk, and combines them into one seamless audio file.

Request Body:
{
  "text": "Very long text that exceeds the limit...",
  "voice": "alloy",
  "format": "mp3",
  "instructions": "Optional voice instructions",
  "max_length": 1000,
  "preserve_words": true
}
Response:

Returns a single audio file containing all chunks combined seamlessly.

Response Headers:
  • X-Chunks-Combined: Number of chunks that were combined
  • X-Original-Text-Length: Original text length in characters
  • X-Audio-Size: Final audio file size in bytes

POST /v1/audio/speech

Enhanced OpenAI-compatible endpoint with auto-combine feature. Automatically handles long text by splitting and combining audio chunks when needed.

Request Body:
{
  "model": "gpt-4o-mini-tts",
  "input": "Text of any length...",
  "voice": "alloy",
  "response_format": "mp3",
  "instructions": "Optional voice instructions",
  "speed": 1.0,
  "auto_combine": true,
  "max_length": 1000
}
Enhanced Parameters:
  • auto_combine (boolean, default: true):
    • true: Automatically split long text and combine audio chunks into a single file
    • false: Return error if text exceeds max_length (standard OpenAI behavior)
  • max_length (integer, default: 1000): Maximum characters per chunk when splitting
Response Headers:
  • X-Auto-Combine: Whether auto-combine was enabled (true/false)
  • X-Chunks-Combined: Number of audio chunks combined (1 for short text)
  • X-Original-Text-Length: Original text length (for long text processing)
  • X-Audio-Format: Audio format of the response
  • X-Audio-Size: Audio file size in bytes
docs.examples_title
# Short text (works normally)
curl -X POST http://ttsapi.site/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini-tts",
    "input": "Hello world!",
    "voice": "alloy"
  }'

# Long text with auto-combine (default)
curl -X POST http://ttsapi.site/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini-tts",
    "input": "Very long text...",
    "voice": "alloy",
    "auto_combine": true
  }'

# Long text without auto-combine (will error)
curl -X POST http://ttsapi.site/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini-tts",
    "input": "Very long text...",
    "voice": "alloy",
    "auto_combine": false
  }'
Audio Combination: Uses advanced audio processing (PyDub) when available, with intelligent fallbacks for different environments. Supports all audio formats.
Use Cases:
  • Long Articles: Convert blog posts or articles to single audio files
  • Audiobooks: Generate chapters as single audio files
  • Podcasts: Create podcast episodes from scripts
  • Educational Content: Convert learning materials to audio
Example Usage:
# Python example
import requests

response = requests.post(
    "http://ttsapi.site/api/generate-combined",
    json={
        "text": "Your very long text content here...",
        "voice": "nova",
        "format": "mp3",
        "max_length": 2000
    }
)

if response.status_code == 200:
    with open("combined_audio.mp3", "wb") as f:
        f.write(response.content)

    chunks = response.headers.get('X-Chunks-Combined')
    print(f"Combined {chunks} chunks into single file")

WebSocket Streaming

Real-time audio streaming for enhanced user experience. Get audio chunks as they're generated instead of waiting for the complete file.

WebSocket streaming provides lower perceived latency and real-time progress tracking for TTS generation.

Connection

// JavaScript WebSocket client
const client = new WebSocketTTSClient({
    socketUrl: 'http://ttsapi.site',
    debug: true
});

// Connection events
client.onConnect = () => console.log('Connected');
client.onDisconnect = () => console.log('Disconnected');

Streaming TTS Generation

// Generate speech with real-time streaming
const result = await client.generateSpeech('Hello, WebSocket world!', {
    voice: 'alloy',
    format: 'mp3',
    chunkSize: 1024,  // Characters per chunk
    
    // Progress callback
    onProgress: (progress) => {
        console.log(`Progress: ${progress.progress}%`);
        console.log(`Chunks: ${progress.chunksCompleted}/${progress.totalChunks}`);
    },
    
    // Receive audio chunks in real-time
    onChunk: (chunk) => {
        console.log(`Received chunk ${chunk.chunkIndex + 1}`);
        // Process or play audio chunk immediately
        processAudioChunk(chunk.audioData);
    },
    
    // Completion callback
    onComplete: (result) => {
        console.log('Streaming complete!');
        // result.audioData contains the complete audio
    }
});

WebSocket Events

Client β†’ Server Events
Event Description Payload
generate_stream Start TTS generation {text, voice, format, chunk_size}
cancel_stream Cancel active stream {request_id}
Server β†’ Client Events
Event Description Payload
stream_started Stream initiated {request_id, timestamp}
audio_chunk Audio chunk ready {request_id, chunk_index, audio_data, duration}
stream_progress Progress update {progress, chunks_completed, total_chunks}
stream_complete Generation complete {request_id, total_chunks, status}
stream_error Error occurred {request_id, error, timestamp}

Benefits

  • Real-time feedback: Users see progress as audio generates
  • Lower latency: First audio chunk arrives quickly
  • Cancellable: Stop generation mid-stream if needed
  • Efficient: Process chunks as they arrive

Example: Streaming Audio Player

// Create a streaming audio player
const audioChunks = [];
let isPlaying = false;

const streamingPlayer = await client.generateSpeech(longText, {
    voice: 'nova',
    format: 'mp3',
    
    onChunk: (chunk) => {
        // Store chunk
        audioChunks.push(chunk.audioData);
        
        // Start playing after first chunk
        if (!isPlaying && audioChunks.length >= 3) {
            startStreamingPlayback(audioChunks);
            isPlaying = true;
        }
    },
    
    onComplete: (result) => {
        // Ensure all chunks are played
        finishPlayback(result.audioData);
    }
});
Try It Out!

Experience WebSocket streaming in action at the WebSocket Demo or enable streaming mode in the Playground.