Start buildingGet support
DocumentationAPI ReferenceChangelogDiscord
  • Voices
    • POSTCreate voice
    • GETList voices
    • DELDelete voice
  • Text-to-Speech (TTS)
    • WSSStream Input
    • STREAMText-to-Speech (Streamed JSON)
    • POSTText-to-Speech (Streamed File)
    • POSTText-to-Speech (Json)
    • POSTText-to-Speech (File)
  • Speech-to-speech (EVI)
    • WSSChat
  • Expression Measurement API
Start buildingGet support
LogoLogo
LogoLogo
Text-to-Speech (TTS)

POST
https://api.hume.ai/v0/tts/stream/file
POST
/v0/tts/stream/file
1from hume import HumeClient
2from hume.tts import PostedUtterance, PostedUtteranceVoiceWithName
3
4client = HumeClient(
5 api_key="YOUR_API_KEY",
6)
7client.tts.synthesize_file_streaming(
8 utterances=[
9 PostedUtterance(
10 text="Beauty is no quality in things themselves: It exists merely in the mind which contemplates them.",
11 voice=PostedUtteranceVoiceWithName(
12 name="Male English Actor",
13 provider="HUME_AI",
14 ),
15 )
16 ],
17)
Try it
Streams synthesized speech using the specified voice. If no voice is provided, a novel voice will be generated dynamically. Optionally, additional context can be included to influence the speech's style and prosody.
Was this page helpful?
Previous

Text-to-Speech (Json)

Next
Built with
Streams synthesized speech using the specified voice. If no voice is provided, a novel voice will be generated dynamically. Optionally, additional context can be included to influence the speech's style and prosody.

Authentication

X-Hume-Api-Keystring
API Key authentication via header

Request

This endpoint expects an object.
utteranceslist of objectsRequired
A list of **Utterances** to be converted to speech output. An **Utterance** is a unit of input for [Octave](/docs/text-to-speech-tts/overview), and includes input `text`, an optional `description` to serve as the prompt for how the speech should be delivered, an optional `voice` specification, and additional controls to guide delivery for `speed` and `trailing_silence`.
contextobject or nullOptional
Utterances to use as context for generating consistent speech style and prosody across multiple requests. These will not be converted to speech output.
formatobjectOptional
Specifies the output audio file format.
include_timestamp_typeslist of enumsOptional
The set of timestamp types to include in the response.
Allowed values:
num_generationsintegerOptional>=1<=5Defaults to 1
Number of generations of the audio to produce.
split_utterancesbooleanOptionalDefaults to true
Controls how audio output is segmented in the response. - When **enabled** (`true`), input utterances are automatically split into natural-sounding speech segments. - When **disabled** (`false`), the response maintains a strict one-to-one mapping between input utterances and output snippets. This setting affects how the `snippets` array is structured in the response, which may be important for applications that need to track the relationship between input text and generated audio segments. When setting to `false`, avoid including utterances with long `text`, as this can result in distorted output.
strip_headersbooleanOptionalDefaults to false
If enabled, the audio for all the chunks of a generation, once concatenated together, will constitute a single audio file. Otherwise, if disabled, each chunk's audio will be its own audio file, each with its own headers (if applicable).
versionenumOptional
The version of the Octave Model to use. 1 for the legacy model, 2 for the new model.
Allowed values:
instant_modebooleanOptionalDefaults to true
Enables ultra-low latency streaming, significantly reducing the time until the first audio chunk is received. Recommended for real-time applications requiring immediate audio playback. For further details, see our documentation on [instant mode](/docs/text-to-speech-tts/overview#ultra-low-latency-streaming-instant-mode). - A [voice](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.utterances.voice) must be specified when instant mode is enabled. Dynamic voice generation is not supported with this mode. - Instant mode is only supported for streaming endpoints (e.g., [/v0/tts/stream/json](/reference/text-to-speech-tts/synthesize-json-streaming), [/v0/tts/stream/file](/reference/text-to-speech-tts/synthesize-file-streaming)). - Ensure only a single generation is requested ([num_generations](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.num_generations) must be `1` or omitted).

Response

OK

Errors

The set of timestamp types to include in the response.
Number of generations of the audio to produce.
The version of the Octave Model to use. 1 for the legacy model, 2 for the new model.
API Key authentication via header
OK

Text-to-Speech (Json)

A list of Utterances to be converted to speech output.

An Utterance is a unit of input for Octave, and includes input text, an optional description to serve as the prompt for how the speech should be delivered, an optional voice specification, and additional controls to guide delivery for speed and trailing_silence.

Text-to-Speech (Streamed File)

Controls how audio output is segmented in the response.

  • When enabled (true), input utterances are automatically split into natural-sounding speech segments.

  • When disabled (false), the response maintains a strict one-to-one mapping between input utterances and output snippets.

This setting affects how the snippets array is structured in the response, which may be important for applications that need to track the relationship between input text and generated audio segments. When setting to false, avoid including utterances with long text, as this can result in distorted output.

Enables ultra-low latency streaming, significantly reducing the time until the first audio chunk is received. Recommended for real-time applications requiring immediate audio playback. For further details, see our documentation on instant mode.

  • A voice must be specified when instant mode is enabled. Dynamic voice generation is not supported with this mode.
  • Instant mode is only supported for streaming endpoints (e.g., /v0/tts/stream/json, /v0/tts/stream/file).
  • Ensure only a single generation is requested (num_generations must be 1 or omitted).

If enabled, the audio for all the chunks of a generation, once concatenated together, will constitute a single audio file. Otherwise, if disabled, each chunk’s audio will be its own audio file, each with its own headers (if applicable).