Timestamps Guide
Octave 2 supports word- and phoneme-level timestamps in TTS responses. These timestamps enable developers to:
- Align audio with text for real-time captions or word highlighting.
- Synchronize multimodal outputs such as animated avatars or lip-syncing.
- Post-process speech by cutting, looping, or segmenting audio with precision.
Requesting timestamps
Timestamps are only returned when you specify them in your request.
Use include_timestamp_types to specify timestamps by passing an array of supported types: "word" and "phoneme".
Specify "version": "2" in your request body to ensure timestamp support.
How you specify timestamps differs between HTTP and WebSocket endpoints:
-
HTTP: Include the
include_timestamp_typesfield in your request body. -
WebSocket: Set the
include_timestamp_typesquery parameter of the handshake request. This will ensure timestamps will be streamed alongside TTS output audio for the duration of the session.
Receiving timestamps
When you request to receive timestamps you’ll receive
OctaveOutputTimestamp
objects, containing the timestamp data, over the stream. OctaveOutputTimestamp objects arrive interleaved with
SnippetAudioChunk
objects.
Phoneme-level
Word-level
Phoneme standard
Phoneme-level timestamps use IPA (International Phonetic Alphabet) symbols. For some languages, we use IPA-compatible extensions consistent with the eSpeak NG phoneme inventory and language dictionaries.

