Timestamps Guide

Guide to leveraging timestamps for audio outputted by Octave 2.

Octave 2 supports streaming word- and phoneme-level timestamps in TTS responses. These timestamps enable developers to:

  • Align audio with text for real-time captions or word highlighting.
  • Synchronize multimodal outputs such as animated avatars or lip-syncing.
  • Post-process speech by cutting, looping, or segmenting audio with precision.

Requesting timestamps

Timestamps are only returned when you specify them in your request.

Use include_timestamp_types to specify timestamps by passing an array of supported types: "word" and "phoneme".

How you specify timestamps differs between HTTP and WebSocket endpoints:

  • HTTP: Include the include_timestamp_types field in your request body.

    1curl "https://api.hume.ai/v0/tts/stream/json" \
    2 -H "X-Hume-Api-Key: $HUME_API_KEY" \
    3 --json '{
    4 "version": "2",
    5 "include_timestamp_types": ["word", "phoneme"],
    6 "utterances": [
    7 {
    8 "voice": { "id": "5bb7de05-c8fe-426a-8fcc-ba4fc4ce9f9c" },
    9 "text": "My friend told me about this amazing place!",
    10 }
    11 ]
    12 }'
  • WebSocket: Set the include_timestamp_types query parameter of the handshake request. This will ensure timestamps will be streamed alongside TTS output audio for the duration of the session.

Receiving timestamps

When you request to receive timestamps you’ll receive OctaveOutputTimestamp objects, containing the timestamp data, over the stream. OctaveOutputTimestamp objects arrive interleaved with SnippetAudioChunk objects.

1{
2 "type":"timestamp",
3 "request_id":"dc995e9e-5379-48d7-a62a-81593300395e2570671",
4 "generation_id":"eae965b6-7c20-4b56-b703-bbea7e50793f",
5 "snippet_id":"72837cf5-fdb4-4232-9cf0-0b5b8bd2579a",
6 "timestamp": {
7 "type":"phoneme",
8 "text":"m",
9 "time": {
10 "begin":60,
11 "end":80
12 }
13 }
14}
Phoneme standard

Phoneme-level timestamps use IPA (International Phonetic Alphabet) symbols. For some languages, we use IPA-compatible extensions consistent with the eSpeak NG phoneme inventory and language dictionaries.

Resources