Text-to-speech (TTS)

Octave TTS, the first text-to-speech system built on LLM intelligence. Unlike conventional TTS that merely “reads” words, Octave is a “speech-language model” that understands what words mean in context, unlocking a new level of expressiveness and nuance.

At Octave’s core it is a state-of-the-art large language model (LLM) that Hume AI trained to understand and synthesize speech. This speech-language model can predict the tune, rhythm and timbre of speech, knowing when to whisper secrets, shout triumphantly, or calmly explain a fact. This combined approach lets Octave interpret plot twists, emotional cues, and character traits within a script or prompt, then transform that understanding into lifelike speech.

You retain full ownership of any audio content you generate using Octave. For complete details on ownership rights, please see Hume’s Terms of Use.

Features

Key capabilities	Context-aware expression	Because Octave’s LLM recognizes nuanced meanings, it adapts pitch, tempo, and emphasis to match each word’s emotional intent.
	Design any voice you can imagine	From describing a “patient, empathetic counselor” to requesting a “dramatic medieval knight,” Octave instantly creates a fitting voice. See Prompting.
	Nuanced expression control	Want a sentence spoken in a particular way with the right emphasis? Octave can emulate any emotions or styles you describe from “righteous indignation” to “hurried whispering.” See Acting Instructions.
	Long-form versatility	Perfect for audiobooks, podcasts, or voiceover work, Octave preserves emotional consistency across chapters or scene changes—even when characters shift from joy to despair.
Developer tools	REST API	A RESTful API that enables text-to-speech (TTS) integration with Octave. Use this API to synthesize speech, customize voice parameters, and create and store reusable voice profiles.
	Python SDK	A wrapper for Octave’s TTS services that simplifies voice synthesis in Python applications.
	TypeScript SDK	A strongly-typed library that streamlines Octave TTS integration in TypeScript and JavaScript applications.
	CLI	A command-line tool that allows direct interaction with Octave’s TTS API, ideal for testing, automation, and rapid prototyping.
	Open source examples	Example projects to serve as reference code to jump-start your development.

Quickstart

Accelerate your project setup with our comprehensive quickstart guides, designed to integrate Octave TTS into your TypeScript or Python applications. Each guide walks you through API integration and demonstrates text-to-speech synthesis, helping you get up and running quickly.

TypeScript

Integrate Octave TTS into web and Node.js applications using our TypeScript SDK.

Python

Use our Python SDK to integrate Octave TTS into your Python applications.

CLI

Get started synthesizing text-to-speech with our command-line tool.

Using the Text-to-Speech API

Use the Octave TTS API to generate expressive, human-like speech from text. Optionally include voice prompts or acting instructions to control how the speech is delivered, and receive synthesized audio in your chosen format. The API supports synchronous and streaming response types:

Synchronous responses: Receive the complete audio result in a single HTTP response.
Streaming responses: Receive audio in real-time chunks as it’s generated.

Authentication is required for all API requests. Learn more in the Authentication Strategies guide.

Glossary

The following terms are fundamental to using the TTS API:

Term	Definition
`Utterance`	The primary input unit. Contains required `text`, optional `voice` (`name`/`id`), `description` (voice prompt if `voice` omitted; acting instructions if `voice` provided), `speed`, `trailing_silence`.
`Generation`	Represents the synthesized audio output. Includes `audio` data, `generation_id`, `duration`, `encoding` details, `file_size`, and the `snippets` array.
`Snippet`	An individual segment of synthesized audio within a `Generation`. Contains its own `audio` chunk, the corresponding source `text` segment from the input, and a unique `id`.

Voice Selection

You can control which voice is used in one of two ways:

Dynamic generation – Leave the voice field blank and provide a description to guide generation.
Predefined voice – Specify a voice by name or id from your Custom Voices or Hume’s Voice Library.

For more on creating, saving, and managing voices, see the Voices Guide.

Request and response workflow

An API request contains one or more Utterances. The response includes one or more Generations (controlled by the num_generations parameter). Each Generation contains a snippets array, which groups the resulting Snippet(s) corresponding to the input Utterances.

Segmentation

By default, Octave automatically segments each Generation into multiple Snippets to optimize for natural speech flow. These snippets represent coherent segments of audio output, which may not directly map to the boundaries of the input Utterances.

To disable this behavior and enforce a strict 1:1 mapping—one Snippet per Utterance group—set split_utterances to false in your request. This ensures that each input utterance results in a single, unsegmented audio unit in the response.

For a complete list of all request parameters, response fields, endpoint formats, and technical specifications, consult the API Reference.

Response Types

Octave TTS supports two response types: synchronous and streaming. Choose the one that best fits your latency requirements and how you plan to handle audio output.

Synchronous Requests

Synchronous endpoints return the complete audio result after the request is fully processed. These are best for use cases where latency is less important and you want the full result before playback or storage.

JSON response (/v0/tts)

Returns a JSON object containing the full audio as a base64-encoded string.
Best for previewing short clips or handling audio immediately in your application.
File response (/v0/tts/file)

Returns audio as a downloadable file (e.g., audio/mpeg).
Best for saving audio or serving it from your backend.

Streaming Requests

Streaming endpoints return audio in real time as it’s generated, allowing playback to begin sooner and reducing perceived latency. These responses are delivered over an HTTP connection using chunked transfer encoding.

Streamed JSON response (/v0/tts/stream/json)

Returns a stream of JSON objects, each containing a chunk of the synthesized audio (base64-encoded) and associated metadata. Best for real-time applications that need audio and metadata together.

By default, each audio chunk will be a whole audio file complete with headers. Use the strip_headers field to instead receive the streamed bytes of a single, long audio file.

If you hear audio artifacts or “glitches” while using streamed JSON, you may be playing back the concatenated audio chunks as a single audio file. To remove the artifacts, either pass "strip_headers": true in the request body, play back each audio chunk as a separate file, or use the streamed file response instead.
Streamed file response (/v0/tts/stream/file)

Returns a continuous stream of raw audio chunks (e.g., audio/mpeg).
Best for media pipelines or players that support HTTP audio streaming.

Ultra low latency streaming: instant mode

Instant mode is a low-latency streaming mode designed for real-time applications where audio playback should begin as quickly as possible. Unlike standard streaming—which introduces a brief lead time before the first audio chunk is sent—instant mode begins streaming audio as soon as generation starts. Instant mode is enabled by default.

How instant mode works

No lead time is introduced—the server streams audio as soon as it’s available.
Audio is delivered in smaller sub-snippet chunks (~1 second each).
First audio is typically ready within ~200ms, depending on system load and input complexity.

Instant mode does not change the format of streamed responses—each chunk includes the same metadata; however chunks in instant mode will be smaller and begin to arrive more quickly.

Configuring instant mode

Use the instant_mode field to explicitly enable or disable instant mode.
Specify a predefined voice by name or id—this is required when using instant mode.
Set num_generations to 1 or omit it.

When to disable instant mode

For voice design workflows—where no predefined voice is specified—disable instant mode to enable dynamic voice generation.
When generating multiple candidates in a single request (num_generations > 1), disable instant mode to support comparative or exploratory generation.

API limits

Request rate limit: 100 requests per minute
Maximum text length: 5,000 characters
Maximum description length: 1,000 characters
Maximum generations per request: 5
Supported audio formats: MP3, WAV, PCM