Text-to-speech (Streamed JSON)

Streams synthesized speech using the specified voice. If no voice is provided, a novel voice will be generated dynamically. Optionally, additional context can be included to influence the speech’s style and prosody.

The response is a stream of JSON objects including audio encoded in base64.

Headers

X-Hume-Api-KeystringRequired

Request

This endpoint expects an object.
utteranceslist of objectsRequired

A list of Utterances to be converted to speech output.

An Utterance is a unit of input for Octave, and includes input text, an optional description to serve as the prompt for how the speech should be delivered, an optional voice specification, and additional controls to guide delivery for speed and trailing_silence.

contextobjectOptional

Utterances to use as context for generating consistent speech style and prosody across multiple requests. These will not be converted to speech output.

formatobjectOptional

Specifies the output audio file format.

num_generationsintegerOptional>=1<=5Defaults to 1

Number of generations of the audio to produce.

split_utterancesbooleanOptionalDefaults to true

Controls how audio output is segmented in the response.

  • When enabled (true), input utterances are automatically split into natural-sounding speech segments.

  • When disabled (false), the response maintains a strict one-to-one mapping between input utterances and output snippets.

This setting affects how the snippets array is structured in the response, which may be important for applications that need to track the relationship between input text and generated audio segments. When setting to false, avoid including utterances with long text, as this can result in distorted output.

instant_modebooleanOptionalDefaults to false

Enables ultra-low latency streaming, significantly reducing the time until the first audio chunk is received. Recommended for real-time applications requiring immediate audio playback. For further details, see our documentation on instant mode.

  • Dynamic voice generation is not supported with this mode; a predefined voice must be specified in your request.
  • This mode is only supported for streaming endpoints (e.g., /v0/tts/stream/json, /v0/tts/stream/file).
  • Ensure only a single generation is requested (num_generations must be 1 or omitted).
  • With instant_mode enabled, requests incur a 10% higher cost due to increased compute and resource requirements.

Response

Successful Response

audiostring

The generated audio output chunk in the requested format.

chunk_indexinteger

The index of the audio chunk in the snippet.

generation_idstring

The generation ID of the parent snippet that this chunk corresponds to.

is_last_chunkboolean

Whether or not this is the last chunk streamed back from the decoder for one input snippet.

snippet_idstring

The ID of the parent snippet that this chunk corresponds to.

textstring

The text of the parent snippet that this chunk corresponds to.

utterance_indexintegerOptional

The index of the utterance in the request that the parent snippet of this chunk corresponds to.

Errors