Tts Conversion Stream File

Authentication

X-Hume-Api-Keystring

API Key authentication via header

Query parameters

access_tokenstringOptionalDefaults to

Access token used for authenticating the client. If not provided, an api_key must be provided to authenticate.

The access token is generated using both an API key and a Secret key, which provides an additional layer of security compared to using just an API key.

For more details, refer to the Authentication Strategies Guide.

Access token used for authenticating the client. If not provided, an `api_key` must be provided to authenticate. The access token is generated using both an API key and a Secret key, which provides an additional layer of security compared to using just an API key. For more details, refer to the [Authentication Strategies Guide](/docs/introduction/api-key#authentication-strategies).

Request

This endpoint expects a multipart form containing an optional file.

context[generation_id]stringOptionalformat: "uuid4"

The ID of a prior TTS generation to use as context for generating consistent speech style and prosody across multiple requests. Including context may increase audio generation times.

context[utterances][<n>][description]string or nullOptional<=1000 characters

Natural language instructions describing how the synthesized speech should sound, including but not limited to tone, intonation, pacing, and accent.

This field behaves differently depending on whether a voice is specified:

Voice specified: the description will serve as acting directions for delivery. Keep directions concise—100 characters or fewer—for best results. See our guide on acting instructions.
Voice not specified: the description will serve as a voice prompt for generating a voice. See our prompting guide for design tips.

Natural language instructions describing how the synthesized speech should sound, including but not limited to tone, intonation, pacing, and accent. **This field behaves differently depending on whether a voice is specified**: - **Voice specified**: the description will serve as acting directions for delivery. Keep directions concise—100 characters or fewer—for best results. See our guide on [acting instructions](/docs/text-to-speech-tts/acting-instructions). - **Voice not specified**: the description will serve as a voice prompt for generating a voice. See our [prompting guide](/docs/text-to-speech-tts/prompting) for design tips.

context[utterances][<n>][speed]doubleOptional0.25-3

Speed multiplier for the synthesized speech. Extreme values below 0.75 and above 1.5 may sometimes cause instability to the generated output.

context[utterances][<n>][text]stringOptional<=5000 characters

The input text to be synthesized into speech.

context[utterances][<n>][trailing_silence]doubleOptional0-5

Duration of trailing silence (in seconds) to add to this utterance

context[utterances][<n>][voice][id]stringOptional

The unique ID associated with the Voice.

context[utterances][<n>][voice][name]stringOptional

The name of a Voice.

context[utterances][<n>][voice][provider]enumOptional

Specifies the source provider associated with the chosen voice.

HUME_AI: Select voices from Hume’s Voice Library, containing a variety of preset, shared voices.
CUSTOM_VOICE: Select from voices you’ve personally generated and saved in your account.

If no provider is explicitly set, the default provider is CUSTOM_VOICE. When using voices from Hume’s Voice Library, you must explicitly set the provider to HUME_AI.

Preset voices from Hume’s Voice Library are accessible by all users. In contrast, your custom voices are private and accessible only via requests authenticated with your API key.

Specifies the source provider associated with the chosen voice. - **`HUME_AI`**: Select voices from Hume's [Voice Library](https://app.hume.ai/tts/voice-library), containing a variety of preset, shared voices. - **`CUSTOM_VOICE`**: Select from voices you've personally generated and saved in your account. If no provider is explicitly set, the default provider is `CUSTOM_VOICE`. When using voices from Hume's **Voice Library**, you must explicitly set the provider to `HUME_AI`. Preset voices from Hume's **Voice Library** are accessible by all users. In contrast, your custom voices are private and accessible only via requests authenticated with your API key.

Allowed values:

expand_descriptionbooleanOptional

If enabled, enhances the provided description prompt to improve voice generation quality.

filter_generationsbooleanOptional

If enabled, additional generations will be made, and the best num_generations of them all will be returned.

format[type]"mp3"Optional

Format for the output audio.

include_timestamp_types[<n>]enumOptional

The set of timestamp types to include in the response. Only supported for Octave 2 requests.

Allowed values:

instant_modebooleanOptional

A voice must be specified when instant mode is enabled. Dynamic voice generation is not supported with this mode.
Instant mode is only supported for streaming endpoints (e.g., /v0/tts/stream/json, /v0/tts/stream/file).
Ensure only a single generation is requested (num_generations must be 1 or omitted).

Enables ultra-low latency streaming, significantly reducing the time until the first audio chunk is received. Recommended for real-time applications requiring immediate audio playback. For further details, see our documentation on [instant mode](/docs/text-to-speech-tts/overview#ultra-low-latency-streaming-instant-mode). - A [voice](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.utterances.voice) must be specified when instant mode is enabled. Dynamic voice generation is not supported with this mode. - Instant mode is only supported for streaming endpoints (e.g., [/v0/tts/stream/json](/reference/text-to-speech-tts/synthesize-json-streaming), [/v0/tts/stream/file](/reference/text-to-speech-tts/synthesize-file-streaming)). - Ensure only a single generation is requested ([num_generations](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.num_generations) must be `1` or omitted).

model"octave"Optional

The TTS model to use for speech generations.

multi_speakerbooleanOptional

If enabled, consecutive utterances with the different voices will be generated with compounding context that takes into account the previous utterances.

no_binarybooleanOptional

If enabled, no binary websocket messages will be sent to the client.

num_generationsintegerOptional1-5

Number of audio generations to produce from the input utterances.

Using num_generations enables faster processing than issuing multiple sequential requests. Additionally, specifying num_generations allows prosody continuation across all generations without repeating context, ensuring each generation sounds slightly different while maintaining contextual consistency.

Number of audio generations to produce from the input utterances. Using `num_generations` enables faster processing than issuing multiple sequential requests. Additionally, specifying `num_generations` allows prosody continuation across all generations without repeating context, ensuring each generation sounds slightly different while maintaining contextual consistency.

split_utterancesbooleanOptional

Controls how audio output is segmented in the response.

When enabled (true), input utterances are automatically split into natural-sounding speech segments.
When disabled (false), the response maintains a strict one-to-one mapping between input utterances and output snippets.

This setting affects how the snippets array is structured in the response, which may be important for applications that need to track the relationship between input text and generated audio segments. When setting to false, avoid including utterances with long text, as this can result in distorted output.

Controls how audio output is segmented in the response. - When **enabled** (`true`), input utterances are automatically split into natural-sounding speech segments. - When **disabled** (`false`), the response maintains a strict one-to-one mapping between input utterances and output snippets. This setting affects how the `snippets` array is structured in the response, which may be important for applications that need to track the relationship between input text and generated audio segments. When setting to `false`, avoid including utterances with long `text`, as this can result in distorted output.

strip_headersbooleanOptional

If enabled, the audio for all the chunks of a generation, once concatenated together, will constitute a single audio file. Otherwise, if disabled, each chunk’s audio will be its own audio file, each with its own headers (if applicable).

utterances[<n>][audio]fileOptional

Reference audio to use for voice conversion for this utterance.

utterances[<n>][description]string or nullOptional<=1000 characters

Natural language instructions describing how the synthesized speech should sound, including but not limited to tone, intonation, pacing, and accent.

This field behaves differently depending on whether a voice is specified:

Voice specified: the description will serve as acting directions for delivery. Keep directions concise—100 characters or fewer—for best results. See our guide on acting instructions.
Voice not specified: the description will serve as a voice prompt for generating a voice. See our prompting guide for design tips.

utterances[<n>][speed]doubleOptional0.25-3

Speed multiplier for the synthesized speech. Extreme values below 0.75 and above 1.5 may sometimes cause instability to the generated output.

utterances[<n>][text]stringOptional<=5000 characters

The input text to be synthesized into speech.

utterances[<n>][trailing_silence]doubleOptional0-5

Duration of trailing silence (in seconds) to add to this utterance

utterances[<n>][voice][id]stringOptional

The unique ID associated with the Voice.

utterances[<n>][voice][name]stringOptional

The name of a Voice.

utterances[<n>][voice][provider]enumOptional

Specifies the source provider associated with the chosen voice.

HUME_AI: Select voices from Hume’s Voice Library, containing a variety of preset, shared voices.
CUSTOM_VOICE: Select from voices you’ve personally generated and saved in your account.

If no provider is explicitly set, the default provider is CUSTOM_VOICE. When using voices from Hume’s Voice Library, you must explicitly set the provider to HUME_AI.

Preset voices from Hume’s Voice Library are accessible by all users. In contrast, your custom voices are private and accessible only via requests authenticated with your API key.

Allowed values:

versionenumOptional

Use 2 to opt into the latest Octave capabilities. When you specify version 2, you must also provide a voice. Requests that set version: 2 without a voice will be rejected.

For a comparison of Octave versions, see the Octave versions section in the TTS overview.

Selects the Octave model version used to synthesize speech for this request. If you omit this field, Hume automatically routes the request to the most appropriate model. Setting a specific version ensures stable and repeatable behavior across requests. Use `2` to opt into the latest Octave capabilities. When you specify version `2`, you must also provide a `voice`. Requests that set `version: 2` without a voice will be rejected. For a comparison of Octave versions, see the [Octave versions](/docs/text-to-speech-tts/overview#octave-versions) section in the TTS overview.

Allowed values:

Response

Successful Response

1	curl -X POST https://api.hume.ai/v0/tts/stream/file/multipart \
2	-H "X-Hume-Api-Key: <apiKey>" \
3	-H "Content-Type: multipart/form-data" \
4	-F utterances[<n>][audio]=@<file1>

Authentication

Query parameters

Request

Response

Errors