Text-to-Speech (Streamed JSON)

Streams synthesized speech using the specified voice. If no voice is provided, a novel voice will be generated dynamically. Optionally, additional context can be included to influence the speech's style and prosody. The response is a stream of JSON objects including audio encoded in base64.

Authentication

X-Hume-Api-Keystring

API Key authentication via header

Request

This endpoint expects an object.

utteranceslist of objectsRequired

A list of Utterances to be converted to speech output.

An Utterance is a unit of input for Octave, and includes input text, an optional description to serve as the prompt for how the speech should be delivered, an optional voice specification, and additional controls to guide delivery for speed and trailing_silence.

A list of **Utterances** to be converted to speech output. An **Utterance** is a unit of input for [Octave](/docs/text-to-speech-tts/overview), and includes input `text`, an optional `description` to serve as the prompt for how the speech should be delivered, an optional `voice` specification, and additional controls to guide delivery for `speed` and `trailing_silence`.

contextobject or nullOptional

Utterances to use as context for generating consistent speech style and prosody across multiple requests. These will not be converted to speech output.

formatobjectOptional

Specifies the output audio file format.

include_timestamp_typeslist of enumsOptional

The set of timestamp types to include in the response. Only supported for Octave 2 requests.

Allowed values:

num_generationsintegerOptional1-5Defaults to 1

Number of audio generations to produce from the input utterances.

Using num_generations enables faster processing than issuing multiple sequential requests. Additionally, specifying num_generations allows prosody continuation across all generations without repeating context, ensuring each generation sounds slightly different while maintaining contextual consistency.

Number of audio generations to produce from the input utterances. Using `num_generations` enables faster processing than issuing multiple sequential requests. Additionally, specifying `num_generations` allows prosody continuation across all generations without repeating context, ensuring each generation sounds slightly different while maintaining contextual consistency.

split_utterancesbooleanOptionalDefaults to true

Controls how audio output is segmented in the response.

When enabled (true), input utterances are automatically split into natural-sounding speech segments.
When disabled (false), the response maintains a strict one-to-one mapping between input utterances and output snippets.

This setting affects how the snippets array is structured in the response, which may be important for applications that need to track the relationship between input text and generated audio segments. When setting to false, avoid including utterances with long text, as this can result in distorted output.

Controls how audio output is segmented in the response. - When **enabled** (`true`), input utterances are automatically split into natural-sounding speech segments. - When **disabled** (`false`), the response maintains a strict one-to-one mapping between input utterances and output snippets. This setting affects how the `snippets` array is structured in the response, which may be important for applications that need to track the relationship between input text and generated audio segments. When setting to `false`, avoid including utterances with long `text`, as this can result in distorted output.

strip_headersbooleanOptionalDefaults to false

If enabled, the audio for all the chunks of a generation, once concatenated together, will constitute a single audio file. Otherwise, if disabled, each chunk’s audio will be its own audio file, each with its own headers (if applicable).

versionenumOptional

Use 2 to opt into the latest Octave capabilities. When you specify version 2, you must also provide a voice. Requests that set version: 2 without a voice will be rejected.

For a comparison of Octave versions, see the Octave versions section in the TTS overview.

Selects the Octave model version used to synthesize speech for this request. If you omit this field, Hume automatically routes the request to the most appropriate model. Setting a specific version ensures stable and repeatable behavior across requests. Use `2` to opt into the latest Octave capabilities. When you specify version `2`, you must also provide a `voice`. Requests that set `version: 2` without a voice will be rejected. For a comparison of Octave versions, see the [Octave versions](/docs/text-to-speech-tts/overview#octave-versions) section in the TTS overview.

Allowed values:

instant_modebooleanOptionalDefaults to true

Response

Successful Response

OctaveOutputTimestampobject

SnippetAudioChunkobject

1	curl -X POST https://api.hume.ai/v0/tts/stream/json \
2	-H "X-Hume-Api-Key: <apiKey>" \
3	-H "Content-Type: application/json" \
4	-d '{
5	"utterances": [
6	{
7	"text": "Beauty is no quality in things themselves: It exists merely in the mind which contemplates them.",
8	"voice": {
9	"name": "Male English Actor",
10	"provider": "HUME_AI"
11	}
12	}
13	]
14	}'

Authentication

Request

Response

Errors