Timestamps Guide | Hume API

Octave 2 supports word- and phoneme-level timestamps in TTS responses. These timestamps enable developers to:

Align audio with text for real-time captions or word highlighting.
Synchronize multimodal outputs such as animated avatars or lip-syncing.
Post-process speech by cutting, looping, or segmenting audio with precision.

Requesting timestamps

Timestamps are only returned when you specify them in your request.

Use include_timestamp_types to specify timestamps by passing an array of supported types: "word" and "phoneme".

Specify "version": "2" in your request body to ensure timestamp support.

How you specify timestamps differs between HTTP and WebSocket endpoints:

HTTP: Include the include_timestamp_types field in your request body.

1 curl "https://api.hume.ai/v0/tts/stream/json" \
2   -H "X-Hume-Api-Key: $HUME_API_KEY" \
3   --json '{
4     "version": "2",
5     "include_timestamp_types": ["word", "phoneme"],
6     "utterances": [
7       {
8         "voice": { "id": "5bb7de05-c8fe-426a-8fcc-ba4fc4ce9f9c" },
9         "text": "My friend told me about this amazing place!",
10       }
11     ]
12   }'

WebSocket: Set the include_timestamp_types query parameter of the handshake request. This will ensure timestamps will be streamed alongside TTS output audio for the duration of the session.

Receiving timestamps

When you request to receive timestamps you’ll receive OctaveOutputTimestamp objects, containing the timestamp data, over the stream. OctaveOutputTimestamp objects arrive interleaved with SnippetAudioChunk objects.

Phoneme-level

Word-level

1 {
2   "type":"timestamp",
3   "request_id":"dc995e9e-5379-48d7-a62a-81593300395e2570671",
4   "generation_id":"eae965b6-7c20-4b56-b703-bbea7e50793f",
5   "snippet_id":"72837cf5-fdb4-4232-9cf0-0b5b8bd2579a",
6   "timestamp": {
7     "type":"phoneme",
8     "text":"m",
9     "time": {
10       "begin":60,
11       "end":80
12     }
13   }
14 }

Phoneme standard

Phoneme-level timestamps use IPA (International Phonetic Alphabet) symbols. For some languages, we use IPA-compatible extensions consistent with the eSpeak NG phoneme inventory and language dictionaries.

Resources

TypeScript Lipsync Example

See an example of how to use timestamps in a TypeScript project.