Timestamps Guide
Learn how to use word- and phoneme-level timestamps in Octave 2 (preview) for precise audio alignment and post-processing.
Timestamps are only available in Octave 2 (preview). Use the version: 2
parameter in
your TTS request to enable them.
Octave 2 (preview) provides word- and phoneme-level timestamps in streaming responses. These timestamps let you:
- Align audio with text for captions or highlighting words in real time.
- Synchronize multimodal experiences (e.g., animated avatars, lip syncing).
- Post-process speech (e.g., cutting or looping audio segments).
Timestamp support varies by language. Languages without complete timestamp support will be fully supported soon.
How timestamps are delivered
When you request audio with Octave 2 (preview), you’ll receive multiple message types over the stream:
audio
chunks: contain base64-encoded audio data and metadata.timestamp
events: boundaries for a word or phoneme with start/end times in milliseconds.
Each timestamp event includes:
type
:"word"
or"phoneme"
text
: the word or phonetic symboltime.begin
andtime.end
: start and end offsets in milliseconds relative to the audio
Each timestamp event looks like:
Using timestamps in practice
Each timestamp event carries the unit type (“word” or “phoneme”), the text or phoneme symbol, and millisecond offsets (begin and end). These arrive interleaved with audio chunks in the stream.
The examples below show how to isolate the timestamp messages from the audio stream and display them as they’re streamed in.
Best practices
- Use both levels: Phoneme timestamps help with lip syncing; word timestamps are better for captions.
- Expect overlap: Word timestamps span multiple phonemes; use
time.begin
/time.end
for alignment. - Stream incrementally: Timestamps interleave with audio chunks - process as they arrive.