Tts Conversion Stream File

Authentication

X-Hume-Api-Keystring
API Key authentication via header

Query parameters

access_tokenstringOptionalDefaults to
Access token used for authenticating the client. If not provided, an `api_key` must be provided to authenticate. The access token is generated using both an API key and a Secret key, which provides an additional layer of security compared to using just an API key. For more details, refer to the [Authentication Strategies Guide](/docs/introduction/api-key#authentication-strategies).

Request

This endpoint expects a multipart form containing an optional file.
context[generation_id]stringOptionalformat: "uuid4"
The ID of a prior TTS generation to use as context for generating consistent speech style and prosody across multiple requests. Including context may increase audio generation times.
context[utterances][<n>][description]string or nullOptional<=1000 characters
Natural language instructions describing how the synthesized speech should sound, including but not limited to tone, intonation, pacing, and accent. **This field behaves differently depending on whether a voice is specified**: - **Voice specified**: the description will serve as acting directions for delivery. Keep directions concise—100 characters or fewer—for best results. See our guide on [acting instructions](/docs/text-to-speech-tts/acting-instructions). - **Voice not specified**: the description will serve as a voice prompt for generating a voice. See our [prompting guide](/docs/text-to-speech-tts/prompting) for design tips.
context[utterances][<n>][speed]doubleOptional0.25-3
Speed multiplier for the synthesized speech. Extreme values below 0.75 and above 1.5 may sometimes cause instability to the generated output.
context[utterances][<n>][text]stringOptional<=5000 characters
The input text to be synthesized into speech.
context[utterances][<n>][trailing_silence]doubleOptional0-5

Duration of trailing silence (in seconds) to add to this utterance

context[utterances][<n>][voice][id]stringOptional

The unique ID associated with the Voice.

context[utterances][<n>][voice][name]stringOptional

The name of a Voice.

context[utterances][<n>][voice][provider]enumOptional
Specifies the source provider associated with the chosen voice. - **`HUME_AI`**: Select voices from Hume's [Voice Library](https://app.hume.ai/tts/voice-library), containing a variety of preset, shared voices. - **`CUSTOM_VOICE`**: Select from voices you've personally generated and saved in your account. If no provider is explicitly set, the default provider is `CUSTOM_VOICE`. When using voices from Hume's **Voice Library**, you must explicitly set the provider to `HUME_AI`. Preset voices from Hume's **Voice Library** are accessible by all users. In contrast, your custom voices are private and accessible only via requests authenticated with your API key.
Allowed values:
expand_descriptionbooleanOptional
If enabled, enhances the provided description prompt to improve voice generation quality.
filter_generationsbooleanOptional

If enabled, additional generations will be made, and the best num_generations of them all will be returned.

format[type]"mp3"Optional
Format for the output audio.
include_timestamp_types[<n>]enumOptional
The set of timestamp types to include in the response. Only supported for Octave 2 requests.
Allowed values:
instant_modebooleanOptional
Enables ultra-low latency streaming, significantly reducing the time until the first audio chunk is received. Recommended for real-time applications requiring immediate audio playback. For further details, see our documentation on [instant mode](/docs/text-to-speech-tts/overview#ultra-low-latency-streaming-instant-mode). - A [voice](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.utterances.voice) must be specified when instant mode is enabled. Dynamic voice generation is not supported with this mode. - Instant mode is only supported for streaming endpoints (e.g., [/v0/tts/stream/json](/reference/text-to-speech-tts/synthesize-json-streaming), [/v0/tts/stream/file](/reference/text-to-speech-tts/synthesize-file-streaming)). - Ensure only a single generation is requested ([num_generations](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.num_generations) must be `1` or omitted).
model"octave"Optional
The TTS model to use for speech generations.
multi_speakerbooleanOptional
If enabled, consecutive utterances with the different voices will be generated with compounding context that takes into account the previous utterances.
no_binarybooleanOptional
If enabled, no binary websocket messages will be sent to the client.
num_generationsintegerOptional1-5
Number of audio generations to produce from the input utterances. Using `num_generations` enables faster processing than issuing multiple sequential requests. Additionally, specifying `num_generations` allows prosody continuation across all generations without repeating context, ensuring each generation sounds slightly different while maintaining contextual consistency.
split_utterancesbooleanOptional
Controls how audio output is segmented in the response. - When **enabled** (`true`), input utterances are automatically split into natural-sounding speech segments. - When **disabled** (`false`), the response maintains a strict one-to-one mapping between input utterances and output snippets. This setting affects how the `snippets` array is structured in the response, which may be important for applications that need to track the relationship between input text and generated audio segments. When setting to `false`, avoid including utterances with long `text`, as this can result in distorted output.
strip_headersbooleanOptional

If enabled, the audio for all the chunks of a generation, once concatenated together, will constitute a single audio file. Otherwise, if disabled, each chunk’s audio will be its own audio file, each with its own headers (if applicable).

utterances[<n>][audio]fileOptional
Reference audio to use for voice conversion for this utterance.
utterances[<n>][description]string or nullOptional<=1000 characters
Natural language instructions describing how the synthesized speech should sound, including but not limited to tone, intonation, pacing, and accent. **This field behaves differently depending on whether a voice is specified**: - **Voice specified**: the description will serve as acting directions for delivery. Keep directions concise—100 characters or fewer—for best results. See our guide on [acting instructions](/docs/text-to-speech-tts/acting-instructions). - **Voice not specified**: the description will serve as a voice prompt for generating a voice. See our [prompting guide](/docs/text-to-speech-tts/prompting) for design tips.
utterances[<n>][speed]doubleOptional0.25-3
Speed multiplier for the synthesized speech. Extreme values below 0.75 and above 1.5 may sometimes cause instability to the generated output.
utterances[<n>][text]stringOptional<=5000 characters
The input text to be synthesized into speech.
utterances[<n>][trailing_silence]doubleOptional0-5

Duration of trailing silence (in seconds) to add to this utterance

utterances[<n>][voice][id]stringOptional

The unique ID associated with the Voice.

utterances[<n>][voice][name]stringOptional

The name of a Voice.

utterances[<n>][voice][provider]enumOptional
Specifies the source provider associated with the chosen voice. - **`HUME_AI`**: Select voices from Hume's [Voice Library](https://app.hume.ai/tts/voice-library), containing a variety of preset, shared voices. - **`CUSTOM_VOICE`**: Select from voices you've personally generated and saved in your account. If no provider is explicitly set, the default provider is `CUSTOM_VOICE`. When using voices from Hume's **Voice Library**, you must explicitly set the provider to `HUME_AI`. Preset voices from Hume's **Voice Library** are accessible by all users. In contrast, your custom voices are private and accessible only via requests authenticated with your API key.
Allowed values:
versionenumOptional
Selects the Octave model version used to synthesize speech for this request. If you omit this field, Hume automatically routes the request to the most appropriate model. Setting a specific version ensures stable and repeatable behavior across requests. Use `2` to opt into the latest Octave capabilities. When you specify version `2`, you must also provide a `voice`. Requests that set `version: 2` without a voice will be rejected. For a comparison of Octave versions, see the [Octave versions](/docs/text-to-speech-tts/overview#octave-versions) section in the TTS overview.
Allowed values:

Response

Successful Response

Errors