Octave Continuation Guide
Guide to maintaining coherent speech across multiple utterances and generations.
A key feature that sets Octave apart from traditional TTS systems is that the model understands what it’s saying. This linguistic comprehension powers Octave’s sophisticated continuation capabilities, allowing for coherent audio content across multiple segments. This guide explains how to use Octave’s continuation features to create natural-sounding audio content that spans multiple utterances while preserving flow and emotional continuity.
What is continuation?
With Octave, continuation refers to maintaining contextual awareness between utterances to generate coherent, natural-sounding speech. An utterance is a unit of input that includes text to be synthesized and an optional description of how the speech should sound.
Continuation allows you to use previous utterances as context to inform new speech generation. You can implement continuation in two ways:
- Multiple utterances in a single request: When you provide multiple utterances in one request, Octave automatically uses each utterance as context for the next one in the chain.
- Using previous context: You can provide context as either:
- A context utterance
- A generation ID from previous output
When using continuation, Octave will generate speech for all utterances in the utterances
array of your request.
However, any utterances provided within the context
parameter serve only as reference and will not produce
additional audio. This distinction allows you to build upon previous speech without duplicating audio output.
The primary use case for splitting your text input into multiple utterances, rather than providing all the text in a single utterance, is when you want to provide distinct acting instructions to different parts of your text input.
Key aspects of continuation
Narrative coherence
When creating longer audio that exceeds a single utterance (such as audiobooks or educational materials), continuation ensures your audience experiences a cohesive narrative without awkward shifts in delivery, pacing, or emotional tone. The speech maintains appropriate energy levels and emotional progression, resulting in a more authentic listening experience where each new segment builds naturally from what came before.
Try these examples to experience how Octave maintains narrative coherence, delivering the same phrase with completely different emotional tones based on the context of the preceding utterance:
With positive context (excited interpretation)
With negative context (disappointed interpretation)
Linguistic context
Continuation also provides linguistic context for proper pronunciation, particularly with homographs—words that are spelled the same but pronounced differently based on meaning. For example, Octave can correctly differentiate between:
- “Take a bow.” (
/bau/
) vs. “Take a bow and arrow.” (/bō/
) - “Play the bass guitar.” (
/bās/
) vs. “Go bass fishing.” (/bas/
) - “I read the book yesterday.” (
/red/
) vs. “I will read the book tomorrow.” (/rēd/
)
Try these examples to see how Octave intelligently distinguishes between different pronunciations of the word “bow” based on contextual understanding:
With /bau/
pronunciation
With /bō/
pronunciation
Consistent voice
When continuing from an utterance, Octave intelligently handles voice consistency:
- If you don’t specify a voice for a new utterance, Octave automatically continues using the same voice from the previous utterance.
- You only need to specify a voice when you want to change from the currently established voice.
- This applies to both generated voices (from descriptions) and saved voices from the voice library.
Below are sample requests which show how you can continue with the same voice:
For more information on specifying a voice in your request, see our voices guide.
Multiple utterances in a single request
Continuing from previous generation using context
Changing voices mid-conversation
This intelligent handling of voice consistency saves development effort and ensures a seamless listening experience, making it easier to create dynamic, multi-character narratives without redundant voice specifications.