Octave Continuation Guide

Guide to maintaining coherent speech across multiple utterances and generations.

A key feature that sets Octave apart from traditional TTS systems is that the model understands what it’s saying. This linguistic comprehension powers Octave’s sophisticated continuation capabilities, allowing for coherent audio content across multiple segments. This guide explains how to use Octave’s continuation features to create natural-sounding audio content that spans multiple utterances while preserving flow and emotional continuity.

What is continuation?

With Octave, continuation refers to maintaining contextual awareness between utterances to generate coherent, natural-sounding speech. An utterance is a unit of input that includes text to be synthesized and an optional description of how the speech should sound.

Continuation allows you to use previous utterances as context to inform new speech generation. You can implement continuation in two ways:

  1. Multiple utterances in a single request: When you provide multiple utterances in one request, Octave automatically uses each utterance as context for the next one in the chain.
  2. Using previous context: You can provide context as either:

When using continuation, Octave will generate speech for all utterances in the utterances array of your request. However, any utterances provided within the context parameter serve only as reference and will not produce additional audio. This distinction allows you to build upon previous speech without duplicating audio output.

The primary use case for splitting your text input into multiple utterances, rather than providing all the text in a single utterance, is when you want to provide distinct acting instructions to different parts of your text input.

Key aspects of continuation

Narrative coherence

When creating longer audio that exceeds a single utterance (such as audiobooks or educational materials), continuation ensures your audience experiences a cohesive narrative without awkward shifts in delivery, pacing, or emotional tone. The speech maintains appropriate energy levels and emotional progression, resulting in a more authentic listening experience where each new segment builds naturally from what came before.

Try these examples to experience how Octave maintains narrative coherence, delivering the same phrase with completely different emotional tones based on the context of the preceding utterance:

With positive context (excited interpretation)

1curl "https://api.hume.ai/v0/tts" \
2 -H "X-Hume-Api-Key: $HUME_API_KEY" \
3 --json '{
4 "utterances": [
5 { "text": "Our proposal has been accepted with full funding for the next three years!" },
6 { "text": "I can'\''t believe it!" }
7 ]
8 }'

With negative context (disappointed interpretation)

1curl "https://api.hume.ai/v0/tts" \
2 -H "X-Hume-Api-Key: $HUME_API_KEY" \
3 --json '{
4 "utterances": [
5 { "text": "After all our preparation... They'\''ve decided to cancel the entire project..." },
6 { "text": "I can'\''t believe it!" }
7 ]
8 }'

Linguistic context

Continuation also provides linguistic context for proper pronunciation, particularly with homographs—words that are spelled the same but pronounced differently based on meaning. For example, Octave can correctly differentiate between:

  • “Take a bow.” (/bau/) vs. “Take a bow and arrow.” (/bō/)
  • “Play the bass guitar.” (/bās/) vs. “Go bass fishing.” (/bas/)
  • “I read the book yesterday.” (/red/) vs. “I will read the book tomorrow.” (/rēd/)

Try these examples to see how Octave intelligently distinguishes between different pronunciations of the word “bow” based on contextual understanding:

With /bau/ pronunciation

1curl https://api.hume.ai/v0/tts \
2 -H "X-Hume-Api-Key: $HUME_API_KEY" \
3 --json '{
4 "utterances": [
5 { "text": "What a fantastic performance!" },
6 { "text": "Now take a bow." }
7 ]
8 }'

With /bō/ pronunciation

1curl https://api.hume.ai/v0/tts \
2 -H "X-Hume-Api-Key: $HUME_API_KEY" \
3 --json '{
4 "utterances": [
5 { "text": "First take a quiver of arrows." },
6 { "text": "Now take a bow." }
7 ]
8 }'

Consistent voice

When continuing from an utterance, Octave intelligently handles voice consistency:

  • If you don’t specify a voice for a new utterance, Octave automatically continues using the same voice from the previous utterance.
  • You only need to specify a voice when you want to change from the currently established voice.
  • This applies to both generated voices (from descriptions) and saved voices from the voice library.

Below are sample requests which show how you can continue with the same voice:

For more information on specifying a voice in your request, see our voices guide.

Multiple utterances in a single request

1curl "https://api.hume.ai/v0/tts" \
2 -H "X-Hume-Api-Key: $HUME_API_KEY" \
3 --json '{
4 "utterances": [
5 {
6 "text": "Gather around everyone... can I have your attention? Great! Today we'\''ll be learning about supermassive black holes at the center of galaxies.",
7 "description": "Speaking slightly louder at first to be heard in the busy museum, then transitioning to an enthusiastic but measured teaching tone once attention is gained.",
8 "voice": {
9 "name": "Donovan Sinclair",
10 "provider": "HUME_AI"
11 }
12 },
13 {
14 "text": "I'\''ve arranged for the museum guide to explain their special exhibit on black holes. I think you'\''ll find it really helpful for the concepts we'\''ve been covering in class.",
15 "description": "Speaking more quietly now with a casual, informative tone, showing enthusiasm appropriate for the educational context."
16 }
17 ]
18 }'

Continuing from previous generation using context

1# First request - capture the generation_id
2GENERATION_ID=$(curl "https://api.hume.ai/v0/tts" \
3 -H "X-Hume-Api-Key: $HUME_API_KEY" \
4 --json '{
5 "utterances": [
6 {
7 "text": "Gather around everyone... can I have your attention? Great! Today we'\''ll be learning about supermassive black holes at the center of galaxies.",
8 "description": "Speaking slightly louder at first to be heard in the busy museum, then transitioning to an enthusiastic but measured teaching tone once attention is gained.",
9 "voice": {
10 "name": "Donovan Sinclair",
11 "provider": "HUME_AI"
12 }
13 }
14 ]
15 }' | jq -r '.generations[0].generation_id')
16
17# Second request using the generation_id from the first request
18curl "https://api.hume.ai/v0/tts" \
19 -H "X-Hume-Api-Key: $HUME_API_KEY" \
20 --json '{
21 "utterances": [
22 {
23 "text": "I'\''ve arranged for the museum guide to explain their special exhibit on black holes. I think you'\''ll find it really helpful for the concepts we'\''ve been covering in class.",
24 "description": "Speaking more quietly now with a casual, informative tone, showing enthusiasm appropriate for the educational context."
25 },
26 ],
27 "context": {
28 "generation_id": "'$GENERATION_ID'"
29 }
30 }'

Changing voices mid-conversation

1curl "https://api.hume.ai/v0/tts" \
2 -H "X-Hume-Api-Key: $HUME_API_KEY" \
3 --json '{
4 "utterances": [
5 {
6 "text": "Gather around everyone... can I have your attention? Great! Today we'\''ll be learning about supermassive black holes at the center of galaxies.",
7 "description": "Speaking slightly louder at first to be heard in the busy museum, then transitioning to an enthusiastic but measured teaching tone once attention is gained.",
8 "voice": {
9 "name": "Donovan Sinclair",
10 "provider": "HUME_AI"
11 }
12 },
13 {
14 "text": "I'\''ve arranged for the museum guide to explain their special exhibit on black holes. I think you'\''ll find it really helpful for the concepts we'\''ve been covering in class.",
15 "description": "Speaking more quietly now with a casual, informative tone, showing enthusiasm appropriate for the educational context."
16 },
17 {
18 "text": "Thanks, Professor. Hi everyone, I'\''m Vince from the astronomy department here at the museum. Welcome to our black hole visualization exhibit.",
19 "description": "Speaking in a friendly, welcoming tone with the natural projection needed in a museum space. Confident but conversational, as if addressing a group of university students.",
20 "voice": {
21 "name": "Vince Douglas",
22 "provider": "HUME_AI"
23 }
24 },
25 {
26 "text": "It'\''s quite fascinating how we can detect something we can'\''t directly observe. Black holes don'\''t emit light, but we can study their effects on nearby stars and gas.",
27 "description": "Speaking with genuine interest and enthusiasm, using a clear educational tone that emphasizes key points."
28 }
29 ]
30 }'

This intelligent handling of voice consistency saves development effort and ensures a seamless listening experience, making it easier to create dynamic, multi-character narratives without redundant voice specifications.