Text-to-speech (TTS)

Introduction to Hume's TTS API, including its features, usage limits, and key concepts for integration.

Octave TTS, the first text-to-speech system built on LLM intelligence. Unlike conventional TTS that merely “reads” words, Octave is a “speech-language model” that understands what words mean in context, unlocking a new level of expressiveness and nuance.

At Octave’s core it is a state-of-the-art large language model (LLM) that Hume AI trained to understand and synthesize speech. This speech-language model can predict the tune, rhythm and timbre of speech, knowing when to whisper secrets, shout triumphantly, or calmly explain a fact. This combined approach lets Octave interpret plot twists, emotional cues, and character traits within a script or prompt, then transform that understanding into lifelike speech.

You retain full ownership of any audio content you generate using Octave. For complete details on ownership rights, please see Hume’s Terms of Use.

Features

Key capabilities Context-aware expression

Because Octave’s LLM recognizes nuanced meanings, it adapts pitch, tempo, and emphasis to match each word’s emotional intent.

Design any voice you can imagine

From describing a “patient, empathetic counselor” to requesting a “dramatic medieval knight,” Octave instantly creates a fitting voice. See Prompting.

Nuanced expression control

Want a sentence spoken in a particular way with the right emphasis? Octave can emulate any emotions or styles you describe from “righteous indignation” to “hurried whispering.” See Acting Instructions.

Long-form versatility

Perfect for audiobooks, podcasts, or voiceover work, Octave preserves emotional consistency across chapters or scene changes—even when characters shift from joy to despair.

Quickstart

Accelerate your project setup with our comprehensive quickstart guides, designed to integrate Octave TTS into your TypeScript or Python applications. Each guide walks you through API integration and demonstrates text-to-speech synthesis, helping you get up and running quickly.

Glossary

Term Definition
Utterance A unit of input for Octave. Contains text, voice, description, speed, and trailing_silence.
Generation The total generated audio output, referenced by generation_id.
Snippet A segment of the total generated audio output, referenced by snippet_id.

Streaming and non-streaming

The TTS API supports both streaming and non-streaming (synchronous) responses.

Streaming endpoints return audio as it is generated so playback can begin quickly, while non-streaming endpoints return the full result after processing completes.

Mode Direction Endpoints Typical use cases
Streaming (HTTP) Output only

/v0/tts/stream/json, /v0/tts/stream/file

Real-time playback, low perceived latency, pipelines that process chunks.
Streaming (WebSocket) Input & output /v0/tts/stream/input Interactive UIs that send text incrementally and receive continuous audio.
Non-streaming Single response

/v0/tts, /v0/tts/file

Simple integrations, saving files, predictable end-to-end timing.

Unidirectional streaming (HTTP)

  • Streamed JSON/v0/tts/stream/json
    Emits a sequence of JSON objects, each including a base64 audio and metadata.

  • Streamed file/v0/tts/stream/file
    Sends a continuous stream of raw audio bytes (for example audio/mpeg).

Bidirectional streaming (WebSocket)

  • WebSocket streaming/v0/tts/stream/input
    Send text incrementally and receive audio continuously over the same connection.

Non-streaming (HTTP)

  • Synchronous JSON/v0/tts
    Returns a JSON payload with the entire audio as a base64 string.

  • Synchronous File/v0/tts/file
    Returns a downloadable audio file such as audio/mpeg.

Choosing which response type

  • Use streaming for user-facing playback and lower perceived latency.
  • Use streamed JSON when you need per-chunk metadata with the audio.
  • Use streamed file when your player expects a continuous HTTP audio stream.
  • Use WebSocket streaming to send input progressively and receive continuous audio.
  • Use non-streaming for simple request–response flows or when you prefer a single completed file.

Ultra low latency streaming: instant mode

Instant mode is a low-latency streaming mode designed for real-time applications where audio playback should begin as quickly as possible. Unlike standard streaming—which introduces a brief lead time before the first audio chunk is sent—instant mode begins streaming audio as soon as generation starts. Instant mode is enabled by default.

How instant mode works

  • No lead time is introduced—the server streams audio as soon as it’s available.
  • Audio is delivered in smaller sub-snippet chunks (~1 second each).
  • First audio is typically ready within ~200ms, depending on system load and input complexity.

Instant mode does not change the format of streamed responses—each chunk includes the same metadata; however chunks in instant mode will be smaller and begin to arrive more quickly.

Enabling/disabling instant mode

  • Use the instant_mode field to explicitly enable or disable instant mode.
  • Specify a predefined voice by name or id—this is required when using instant mode.
  • Set num_generations to 1 or omit it.

When to disable instant mode

  • For voice design workflows—where no predefined voice is specified—disable instant mode to enable dynamic voice generation.
  • When generating multiple candidates in a single request (num_generations > 1), disable instant mode to support comparative or exploratory generation.

Developer tools

Hume provides a suite of developer tools for integrating TTS.

API limits

The following limits apply to Hume’s Text-to-Speech API.

LimitValue
Request rate limit (HTTP)Defined by your subscription tier
Maximum text length5,000 characters per Utterance
Maximum description length1,000 characters per Utterance
Maximum generations per request5
Supported audio formatsMP3, WAV, PCM