Text-to-Speech (TTS)
Octave 2 (preview) and EVI 4-mini are live! Expanded language support and lower latency for faster, more natural responses. Learn more.
Octave TTS is the first text-to-speech system built on LLM intelligence. Octave understands the text it speaks, both emotionally and semantically. It knows when to whisper secrets, when to shout in triumph, and when to calmly state facts. It produces industry-leading voice quality and expressiveness at real-time speeds. Create any voice you can imagine on Octave through prompting, or use Octave to create a state-of-the-art clone of your own voice.
You retain full ownership of any audio content you generate using Octave. For complete details on ownership rights, please see Hume’s Terms of Use.
Features
Key capabilities
- Industry-leading expression: Octave uses LLM intelligence to recognizes nuanced meanings, it adapts pronunciation, pitch, tempo, and emphasis to match each word’s emotional intent.
- Real-time speeds: Octave 2 (preview) generates high-quality audio with latencies as low as
~100ms
(not including network transit), suitable for conversational and interactive applications. - Design any voice you can imagine: From describing a “patient, empathetic counselor” to requesting a “dramatic medieval knight,” Octave instantly creates a fitting voice. See Voice Design.
- State-of-the-art Voice Cloning: Octave can create a high-quality voice clone using as little as 15 seconds of audio. See Voice Cloning.
- Long-form versatility: Perfect for audiobooks, podcasts, or voiceover work, Octave preserves emotional consistency across chapters or scene changes—even when characters shift from joy to despair. (See TTS Projects)
Octave versions
Quickstart
Accelerate your project setup with our comprehensive quickstart guides, designed to integrate Octave TTS into your TypeScript or Python applications. Each guide walks you through API integration and demonstrates text-to-speech synthesis, helping you get up and running quickly.
Integrate Octave TTS into web and Node.js applications using our TypeScript SDK.
Use our Python SDK to integrate Octave TTS into your Python applications.
Use our .NET SDK to integrate Octave TTS into your .NET applications.
Get started synthesizing text-to-speech with our command-line tool.
Glossary
Streaming and non-streaming
The TTS API supports both streaming and non-streaming (synchronous) responses.
Streaming endpoints return audio as it is generated so playback can begin quickly, while non-streaming endpoints return the full result after processing completes.
Unidirectional streaming (HTTP)
-
Streamed JSON →
/v0/tts/stream/json
Emits a sequence of JSON objects, each including a base64 audio and metadata. -
Streamed file →
/v0/tts/stream/file
Sends a continuous stream of raw audio bytes (for exampleaudio/mpeg
).
Bidirectional streaming (WebSocket)
- WebSocket streaming →
/v0/tts/stream/input
Send text incrementally and receive audio continuously over the same connection.
Non-streaming (HTTP)
-
Synchronous JSON →
/v0/tts
Returns a JSON payload with the entire audio as a base64 string. -
Synchronous File →
/v0/tts/file
Returns a downloadable audio file such asaudio/mpeg
.
Choosing which response type
- Use streaming for user-facing playback and lower perceived latency.
- Use streamed JSON when you need per-chunk metadata with the audio.
- Use streamed file when your player expects a continuous HTTP audio stream.
- Use WebSocket streaming to send input progressively and receive continuous audio.
- Use non-streaming for simple request–response flows or when you prefer a single completed file.
Ultra low latency streaming: instant mode
Instant mode is a low-latency streaming mode designed for real-time applications where audio playback should begin as quickly as possible. Unlike standard streaming—which introduces a brief lead time before the first audio chunk is sent—instant mode begins streaming audio as soon as generation starts. Instant mode is enabled by default.
How instant mode works
- No lead time is introduced—the server streams audio as soon as it’s available.
- Audio is delivered in smaller sub-snippet chunks (
~1
second each). - First audio is typically ready within
~200ms
, depending on system load and input complexity.
Instant mode does not change the format of streamed responses—each chunk includes the same metadata; however chunks in instant mode will be smaller and begin to arrive more quickly.
Enabling/disabling instant mode
- Use the
instant_mode
field to explicitly enable or disable instant mode. - Specify a predefined
voice
byname
orid
—this is required when using instant mode. - Set
num_generations
to1
or omit it.
When to disable instant mode
- For voice design workflows—where no predefined voice is specified—disable instant mode to enable dynamic voice generation.
- When generating multiple candidates in a single request (
num_generations > 1
), disable instant mode to support comparative or exploratory generation.
Developer tools
Hume provides a suite of developer tools for integrating TTS.
See our API reference for TTS streaming and non-streaming endpoints.
Open source SDKs for streaming and non-streaming. Stream audio, handle files, and integrate quickly.
A command-line tool that allows direct interaction with Hume’s TTS API, ideal for testing, automation, and rapid prototyping.
Run the Hume’s TTS MCP server to expose TTS tools to compatible clients.
Open source examples you can copy, run, and adapt to get started quickly.
API limits
The following limits apply to Hume’s Text-to-Speech API.