Text-to-speech (Streamed JSON)
Streams synthesized speech using the specified voice. If no voice is provided, a novel voice will be generated dynamically. Optionally, additional context can be included to influence the speech’s style and prosody.
The response is a stream of JSON objects including audio encoded in base64.
Headers
Request
Utterances to be converted to speech output.
Utterances to use as context for generating consistent speech style and prosody across multiple requests. These will not be converted to speech output.
Specifies the output audio file format.
Number of generations of the audio to produce.
Controls how audio output is segmented in the response.
-
When enabled (
true
), input utterances are automatically split into natural-sounding speech segments. -
When disabled (
false
), the response maintains a strict one-to-one mapping between input utterances and output snippets.
This setting affects how the snippets
array is structured in the response, which may be important for applications that need to track the relationship between input text and generated audio segments. When setting to false
, avoid including utterances with long text
, as this can result in distorted output.
Response
Successful Response
The segmented audio output in the requested format, encoded as a base64 string.
The generation ID this snippet corresponds to.
A unique ID associated with this Snippet.
The text for this Snippet.
The index of the utterance in the request this snippet corresponds to.