Timestamps Guide
Octave 2 supports streaming word- and phoneme-level timestamps in TTS responses. These timestamps enable developers to:
- Align audio with text for real-time captions or word highlighting.
- Synchronize multimodal outputs such as animated avatars or lip-syncing.
- Post-process speech by cutting, looping, or segmenting audio with precision.
Requesting timestamps
Timestamps are only returned when you specify them in your request.
Use include_timestamp_types
to specify timestamps by passing an array of supported types: "word"
and "phoneme"
.
How you specify timestamps differs between HTTP and WebSocket endpoints:
-
HTTP: Include the
include_timestamp_types
field in your request body. -
WebSocket: Set the
include_timestamp_types
query parameter of the handshake request. This will ensure timestamps will be streamed alongside TTS output audio for the duration of the session.
Receiving timestamps
When you request to receive timestamps you’ll receive
OctaveOutputTimestamp
objects, containing the timestamp data, over the stream. OctaveOutputTimestamp
objects arrive interleaved with
SnippetAudioChunk
objects.
Phoneme-level
Word-level
Phoneme standard
Phoneme-level timestamps use IPA (International Phonetic Alphabet) symbols. For some languages, we use IPA-compatible extensions consistent with the eSpeak NG phoneme inventory and language dictionaries.