Continuation Guide

Guide to maintaining coherent speech across multiple utterances and generations.

Octave supports continuation across generations. It carries context from earlier output into the next generation, keeping long-form speech coherent across multiple utterances so delivery stays natural, consistent, and emotionally continuous.

Ways to continue

1. Chain utterances in one request

  • Put multiple items in the utterances array.
  • Each utterance continues only from the immediate previous utterance in the same request.

2. Continue from a previous call

Pass context in the context field using one of:

  • generation_id: continue from the most recent generation you specify.
  • Context utterances: supply reference utterances that guide delivery.

Aspects of continuation

Narrative coherence

For long-form audio such as audiobooks, continuation keeps the narrative cohesive across utterances. It prevents abrupt shifts in delivery, pacing, and emotion, carries energy and emotional progression forward, and lets each segment build naturally on the last for a more authentic listen.

In the examples below, the same line is delivered with different emotions based on the context set by the preceding utterance.

With positive context (excited interpretation)

1curl "https://api.hume.ai/v0/tts/stream/json" \
2 -H "X-Hume-Api-Key: $HUME_API_KEY" \
3 --json '{
4 "utterances": [
5 {
6 "text": "Our proposal has been accepted with full funding for the next three years!",
7 "voice": {
8 "name": "Ava Song",
9 "provider": "HUME_AI"
10 }
11 },
12 {
13 "text": "I can'\''t believe it!",
14 "voice": {
15 "name": "Ava Song",
16 "provider": "HUME_AI"
17 }
18 }
19 ]
20 }'

With negative context (disappointed interpretation)

1curl "https://api.hume.ai/v0/tts/stream/json" \
2 -H "X-Hume-Api-Key: $HUME_API_KEY" \
3 --json '{
4 "utterances": [
5 {
6 "text": "After all our preparation... They'\''ve decided to cancel the entire project...",
7 "voice": {
8 "name": "Ava Song",
9 "provider": "HUME_AI"
10 }
11 },
12 {
13 "text": "I can'\''t believe it!",
14 "voice": {
15 "name": "Ava Song",
16 "provider": "HUME_AI"
17 }
18 }
19 ]
20 }'

Linguistic context

Continuation also provides linguistic context for proper pronunciation, particularly with homographs—words that are spelled the same but pronounced differently based on meaning. For example, Octave can correctly differentiate between:

  • “Take a bow.” (/bau/) vs. “Take a bow and arrow.” (/bō/)
  • “Play the bass guitar.” (/bās/) vs. “Go bass fishing.” (/bas/)
  • “I read the book yesterday.” (/red/) vs. “I will read the book tomorrow.” (/rēd/)

Try these examples to see how Octave intelligently distinguishes between different pronunciations of the word “bow” based on contextual understanding:

With /bau/ pronunciation

1curl https://api.hume.ai/v0/tts/stream/json \
2 -H "X-Hume-Api-Key: $HUME_API_KEY" \
3 --json '{
4 "utterances": [
5 {
6 "text": "What a fantastic performance!",
7 "voice": {
8 "name": "Ava Song",
9 "provider": "HUME_AI"
10 }
11 },
12 {
13 "text": "Now take a bow.",
14 "voice": {
15 "name": "Ava Song",
16 "provider": "HUME_AI"
17 }
18 }
19 ]
20 }'

With /bō/ pronunciation

1curl https://api.hume.ai/v0/tts/stream/json \
2 -H "X-Hume-Api-Key: $HUME_API_KEY" \
3 --json '{
4 "utterances": [
5 {
6 "text": "First take a quiver of arrows.",
7 "voice": {
8 "name": "Ava Song",
9 "provider": "HUME_AI"
10 }
11 },
12 {
13 "text": "Now take a bow.",
14 "voice": {
15 "name": "Ava Song",
16 "provider": "HUME_AI"
17 }
18 }
19 ]
20 }'

Consistent voice

When continuing from an utterance, Octave intelligently handles voice consistency:

  • Octave automatically continues using the same voice from the previous utterance.
  • You only need to specify a voice when you want to change from the currently established one.

Below are sample requests which show how you can continue with the same voice:

For more information on specifying a voice in your request, see our voices guide.

Multiple utterances in a single request

1curl "https://api.hume.ai/v0/tts/stream/json" \
2 -H "X-Hume-Api-Key: $HUME_API_KEY" \
3 --json '{
4 "utterances": [
5 {
6 "text": "Gather around everyone! May I have your attention? Today we'\''ll be learning about supermassive black holes at the center of galaxies.",
7 "description": "projecting in a large museum auditorium, enthusiastic, joyful, ostentatious",
8 "voice": {
9 "name": "Donovan Sinclair",
10 "provider": "HUME_AI"
11 },
12 "speed": 1.3
13 },
14 {
15 "text": "I'\''ve arranged for the museum guide to explain their special exhibit on black holes! I think you'\''ll find it really helpful for the concepts we'\''ve been covering in class!",
16 "description": "pedagogical, enthusiastic, hinting",
17 "speed": 1.3
18 }
19 ]
20 }'

Continuing from previous generation using context

1# First request - capture the generation_id
2GENERATION_ID=$(curl "https://api.hume.ai/v0/tts/stream/json" \
3 -H "X-Hume-Api-Key: $HUME_API_KEY" \
4 --json '{
5 "utterances": [
6 {
7 "text": "Gather around everyone! May I have your attention? Today we'\''ll be learning about supermassive black holes at the center of galaxies.",
8 "description": "projecting in a large museum auditorium, enthusiastic, joyful, ostentatious",
9 "voice": {
10 "name": "Donovan Sinclair",
11 "provider": "HUME_AI"
12 }
13 "speed": 1.3
14 }
15 ]
16 }' | jq -r '.generations[0].generation_id')
17
18# Second request using the generation_id from the first request
19curl "https://api.hume.ai/v0/tts/stream/json" \
20 -H "X-Hume-Api-Key: $HUME_API_KEY" \
21 --json '{
22 "utterances": [
23 {
24 "text": "I'\''ve arranged for the museum guide to explain their special exhibit on black holes. I think you'\''ll find it really helpful for the concepts we'\''ve been covering in class.",
25 "description": "pedagogical, enthusiastic, hinting",
26 "voice": {
27 "name": "Donovan Sinclair",
28 "provider": "HUME_AI"
29 },
30 "speed": 1.3
31 },
32 ],
33 "context": {
34 "generation_id": "'$GENERATION_ID'"
35 }
36 }'

Changing voices mid-conversation

1curl "https://api.hume.ai/v0/tts/stream/json" \
2 -H "X-Hume-Api-Key: $HUME_API_KEY" \
3 --json '{
4 "utterances": [
5 {
6 "text": "Gather around everyone! May I have your attention? Today we'\''ll be learning about supermassive black holes at the center of galaxies.",
7 "description": "projecting in a large museum auditorium, enthusiastic, joyful, ostentatious",
8 "voice": {
9 "name": "Donovan Sinclair",
10 "provider": "HUME_AI"
11 },
12 "speed": 1.3
13 },
14 {
15 "text": "I'\''ve arranged for the museum guide to explain their special exhibit on black holes. I think you'\''ll find it really helpful for the concepts we'\''ve been covering in class.",
16 "description": "pedagogical, enthusiastic, hinting",
17 "speed": 1.3
18 },
19 {
20 "text": "Thank you, Professor! Hello, everyone! I'\''m Vince from the astronomy department here at the museum. Welcome to our black hole visualization exhibit!",
21 "description": "projecting in a large museum auditorium, professional, academic, welcoming, enthusiastic",
22 "voice": {
23 "name": "Vince Douglas",
24 "provider": "HUME_AI"
25 }
26 },
27 {
28 "text": "It'\''s quite fascinating how we can detect something we can'\''t directly observe. Black holes don'\''t emit light, but we can study their effects on nearby stars and gas.",
29 "description": "expressing awe, enthusiastic, emphatic, passionate"
30 }
31 ]
32 }'

This intelligent handling of voice consistency saves development effort and ensures a seamless listening experience, making it easier to create dynamic, multi-character narratives without redundant voice specifications.

Notes and constraints

  • Continuation is scoped to the immediate preceding utterance only. It does not skip back to earlier utterances or generations.
  • Only items in utterances are synthesized. Items in context are reference-only.
  • Context utterances add latency because Octave must first generate the speech tokens it will continue from.
  • Octave supports multi-speaker continuation. You can keep the current voice or continue from speech generated with a different voice.