Text-to-speech (TTS)

Introduction to Hume’s TTS API, including its features, usage limits, and key concepts for integration.

OCTAVE TTS, the first text-to-speech system built on LLM intelligence. Unlike conventional TTS that merely “reads” words, OCTAVE is a “speech-language model” that understands what words mean in context, unlocking a new level of expressiveness and nuance.

At OCTAVE’s core it is a state-of-the-art large language model (LLM) that Hume AI trained to understand and synthesize speech. This speech-language model can predict the tune, rhythm and timbre of speech, knowing when to whisper secrets, shout triumphantly, or calmly explain a fact. This combined approach lets OCTAVE interpret plot twists, emotional cues, and character traits within a script or prompt, then transform that understanding into lifelike speech.

Features

Key capabilities Context-aware expression

Because OCTAVE’s LLM recognizes nuanced meanings, it adapts pitch, tempo, and emphasis to match each word’s emotional intent.

Any voice you can imagine

From describing a “patient, empathetic counselor” to requesting a “dramatic medieval knight,” OCTAVE instantly creates a fitting voice.

Expression control through instruction following

Want a sentence spoken in a particular way with the right emphasis? OCTAVE can emulate any emotions or styles you describe from “righteous indignation” to “hurried whispering.”

Long-form versatility

Perfect for audiobooks, podcasts, or voiceover work, OCTAVE preserves emotional consistency across chapters or scene changes—even when characters shift from joy to despair.

Developer tools REST API

A RESTful API that enables text-to-speech (TTS) integration with OCTAVE. Use this API to synthesize speech, customize voice parameters, and create and store reusable voice profiles.

TypeScript SDK

A strongly-typed library that streamlines OCTAVE TTS integration in TypeScript and JavaScript applications.

Python SDK

A wrapper for OCTAVE’s TTS services that simplifies voice synthesis in Python applications.

CLI

A command-line tool that allows direct interaction with OCTAVE’s TTS API, ideal for testing, automation, and rapid prototyping. See the CLI quickstart guide.

Open source examples

Example repositories provide a starting point for developers and demonstrate OCTAVE’s capabilities.

Using text-to-speech

The TTS API provides a RESTful interface for generating expressive speech from text. You send text to synthesize along with optional voice specifications and descriptions, and receive audio in your chosen format.

All requests to the API require authentication. Learn more about our supported authentication strategies here.

Basic speech synthesis

At its simplest, you can generate speech by sending text with an optional voice description. The description helps shape the voice’s characteristics and expression:

1curl -X POST "https://api.hume.ai/v0/tts" \
2 -H "Content-Type: application/json" \
3 -H "X-Hume-Api-Key: $HUME_API_KEY" \
4 -d '{
5 "utterances": [
6 {
7 "text": "Welcome to my application!",
8 "description": "A friendly and professional voice"
9 }
10 ]
11 }'

Using saved voices

If you have saved voices in your library, reference them by name or ID instead of the model generating the voice. You can still provide descriptions to adjust how the voice performs the text:

1curl -X POST "https://api.hume.ai/v0/tts" \
2 -H "Content-Type: application/json" \
3 -H "X-Hume-Api-Key: $HUME_API_KEY" \
4 -d '{
5 "utterances": [
6 {
7 "text": "Welcome to my application!",
8 "voice": {
9 "name": "MyCustomVoice"
10 }
11 }
12 ]
13 }'

Advanced options

Speech consistency

For longer content or multiple requests, use the context parameter to maintain consistent speech style. You can provide previous utterances or a generation ID as context:

1curl -X POST "https://api.hume.ai/v0/tts" \
2 -H "Content-Type: application/json" \
3 -H "X-Hume-Api-Key: $HUME_API_KEY" \
4 -d '{
5 "utterances": [
6 {
7 "text": "The next chapter begins."
8 }
9 ],
10 "context": {
11 "generation_id": "91f7c154-frz2-4617-8edb-e9n1pg6sce64"
12 }
13 }'

Multiple generations

Request up to 5 variations of synthesized speech by setting num_generations. This is useful when you want to explore different interpretations of your voice description:

1curl -X POST "https://api.hume.ai/v0/tts" \
2 -H "Content-Type: application/json" \
3 -H "X-Hume-Api-Key: $HUME_API_KEY" \
4 -d '{
5 "utterances": [
6 {
7 "text": "Welcome to my application!"
8 }
9 ],
10 "num_generations": 3
11 }'

Response format

The API returns a JSON response containing:

1{
2 "generations": [
3 {
4 "audio": "//PExAAspDoWXjDNQgk3HJJJZNbaEPZMmTJk7QIECBA...",
5 "duration": 1.589667,
6 "encoding": {
7 "format": "mp3",
8 "sample_rate": 24000,
9 },
10 "file_size": 26496,
11 "generation_id": "41f7c154-fbb2-4372-8ecc-e6b7bf6ace01",
12 "snippets": [
13 {
14 "audio": "//PExAAspDoWXjDNQgk3HJJJZNbaEPZMmTJk7QIECBA...",
15 "id": "37a108c4-5de7-4507-8a54-0521f5cb0383",
16 "text": "Welcome to my application!"
17 }
18 ]
19 }
20 ],
21 "request_id": "7903e4a7-6642-491a-aa96-c6b359dd1042707439"
22}

Each response includes:

  • Base64-encoded audio in your specified format (MP3, WAV, or PCM).
  • A unique generation_id for saving voices or maintaining consistency.
  • Audio metadata including duration, file size, and audio encoding.
  • A list of segmented utterances (segments) divided into natural sounding units.
  • A request_id for tracking and debugging.

Quickstart

Accelerate your project setup with our comprehensive quickstart guides, designed to integrate OCTAVE TTS into your TypeScript or Python applications. Each guide walks you through API integration and demonstrates text-to-speech synthesis, helping you get up and running quickly.

API limits

  • Request rate limit: 50 requests per minute
  • Maximum text length: 5,000 characters
  • Maximum description length: 1,000 characters
  • Maximum generations per request: 5
  • Supported audio formats: MP3, WAV, PCM

Built with