Text-to-speech (TTS)
Introduction to Hume’s TTS API, including its features, usage limits, and key concepts for integration.
Octave TTS, the first text-to-speech system built on LLM intelligence. Unlike conventional TTS that merely “reads” words, Octave is a “speech-language model” that understands what words mean in context, unlocking a new level of expressiveness and nuance.
At Octave’s core it is a state-of-the-art large language model (LLM) that Hume AI trained to understand and synthesize speech. This speech-language model can predict the tune, rhythm and timbre of speech, knowing when to whisper secrets, shout triumphantly, or calmly explain a fact. This combined approach lets Octave interpret plot twists, emotional cues, and character traits within a script or prompt, then transform that understanding into lifelike speech.
You retain full ownership of any audio content you generate using Octave. For complete details on ownership rights, please see Hume’s Terms of Use.
Features
Quickstart
Accelerate your project setup with our comprehensive quickstart guides, designed to integrate Octave TTS into your TypeScript or Python applications. Each guide walks you through API integration and demonstrates text-to-speech synthesis, helping you get up and running quickly.
Integrate Octave TTS into web and Node.js applications using our TypeScript SDK.
Use our Python SDK to integrate Octave TTS into your Python applications.
Get started synthesizing text-to-speech with our command-line tool.
Using the Text-to-Speech API
Use the Octave TTS API to generate expressive, human-like speech from text. Optionally include voice prompts or acting instructions to control how the speech is delivered, and receive synthesized audio in your chosen format. The API supports synchronous and streaming response types:
- Synchronous responses: Receive the complete audio result in a single HTTP response.
- Streaming responses: Receive audio in real-time chunks as it’s generated.
Authentication is required for all API requests. Learn more in the Authentication Strategies guide.
Glossary
The following terms are fundamental to using the TTS API:
Voice Selection
You can control which voice is used in one of two ways:
- Dynamic generation – Leave the
voice
field blank and provide adescription
to guide generation. - Predefined voice – Specify a voice by
name
orid
from your Custom Voices or Hume’s Voice Library.
For more on creating, saving, and managing voices, see the Voices Guide.
Request and response workflow
An API request contains one or more Utterances. The response includes one or more Generations (controlled by the
num_generations
parameter). Each Generation contains a snippets
array, which groups the resulting Snippet(s) corresponding to the input Utterances.
Segmentation
By default, Octave automatically segments each Generation into multiple Snippets to optimize for natural speech flow. These snippets represent coherent segments of audio output, which may not directly map to the boundaries of the input Utterances.
To disable this behavior and enforce a strict 1:1 mapping—one Snippet per Utterance group—set split_utterances
to false
in your request. This ensures that each input utterance results in a single, unsegmented audio unit in the response.
For a complete list of all request parameters, response fields, endpoint formats, and technical specifications, consult the API Reference.
Response Types
Octave TTS supports two response types: synchronous and streaming. Choose the one that best fits your latency requirements and how you plan to handle audio output.
Synchronous Requests
Synchronous endpoints return the complete audio result after the request is fully processed. These are best for use cases where latency is less important and you want the full result before playback or storage.
-
JSON response (
/v0/tts
)Returns a JSON object containing the full audio as a base64-encoded string.
Best for previewing short clips or handling audio immediately in your application. -
File response (
/v0/tts/file
)Returns audio as a downloadable file (e.g.,
audio/mpeg
).
Best for saving audio or serving it from your backend.
Streaming Requests
Streaming endpoints return audio in real time as it’s generated, allowing playback to begin sooner and reducing perceived latency. These responses are delivered over an HTTP connection using chunked transfer encoding.
-
Streamed JSON response (
/v0/tts/stream/json
)Returns a stream of JSON objects, each containing a chunk of the synthesized audio (base64-encoded) and associated metadata. Best for real-time applications that need audio and metadata together.
-
Streamed file response (
/v0/tts/stream/file
)Returns a continuous stream of raw audio chunks (e.g.,
audio/mpeg
).
Best for media pipelines or players that support HTTP audio streaming.
Ultra Low Latency Streaming: Instant Mode
For the lowest possible latency, use instant mode—a specialized streaming mode designed for real-time applications that need audio playback to start immediately.
In standard streaming mode, the server introduces a short lead time before sending the first audio chunk. This delay gives the model a brief head start before streaming begins, helping ensure smooth and consistent delivery.
When instant mode is enabled:
- No lead time is introduced. The server begins streaming generated audio as soon as it’s ready.
- Audio is streamed in smaller sub-snippet chunks (~1 second each).
- First audio is typically ready in ~200ms, depending on system load and input complexity.
The structure of streamed responses remains unchanged—each chunk includes the same metadata fields as standard streaming. The only difference is in the timing and size of the audio payloads.
To use instant mode:
- Set
instant_mode
totrue
in your request. - Specify a predefined
voice
byname
orid
(dynamic voice generation is not supported). - Ensure only a single generation is requested (
num_generations
must be1
or omitted).
This mode incurs a 10% higher cost per request due to additional compute requirements.
API limits
- Request rate limit: 100 requests per minute
- Maximum text length: 5,000 characters
- Maximum description length: 1,000 characters
- Maximum generations per request: 5
- Supported audio formats:
MP3
,WAV
,PCM