Timestamps Guide

Learn how to use word- and phoneme-level timestamps in Octave 2 (preview) for precise audio alignment and post-processing.

Timestamps are only available in Octave 2 (preview). Use the version: 2 parameter in your TTS request to enable them.

Octave 2 (preview) provides word- and phoneme-level timestamps in streaming responses. These timestamps let you:

  • Align audio with text for captions or highlighting words in real time.
  • Synchronize multimodal experiences (e.g., animated avatars, lip syncing).
  • Post-process speech (e.g., cutting or looping audio segments).

Timestamp support varies by language. Languages without complete timestamp support will be fully supported soon.

How timestamps are delivered

When you request audio with Octave 2 (preview), you’ll receive multiple message types over the stream:

  • audio chunks: contain base64-encoded audio data and metadata.
  • timestamp events: boundaries for a word or phoneme with start/end times in milliseconds.

Each timestamp event includes:

  • type: "word" or "phoneme"
  • text: the word or phonetic symbol
  • time.begin and time.end: start and end offsets in milliseconds relative to the audio

Each timestamp event looks like:

1//phoneme-level timestamp
2{
3 "type": "timestamp",
4 "request_id": "...",
5 "generation_id": "...",
6 "snippet_id": "...",
7 "timestamp": {
8 "type": "phoneme",
9 "text": "s",
10 "time": {
11 "begin": 240,
12 "end": 260
13 }
14 }
15}
16
17//word-level timestamp
18{
19 "type": "timestamp",
20 "request_id": "...",
21 "generation_id": "...",
22 "snippet_id": "...",
23 "timestamp": {
24 "type": "word",
25 "text": "GUESS",
26 "time": { "begin": 20, "end": 260 }
27 }
28}

Using timestamps in practice

Each timestamp event carries the unit type (“word” or “phoneme”), the text or phoneme symbol, and millisecond offsets (begin and end). These arrive interleaved with audio chunks in the stream.

The examples below show how to isolate the timestamp messages from the audio stream and display them as they’re streamed in.

$# Stream raw events; show only timestamp payloads
>curl -N "https://test-api.hume.ai/v0/tts/stream/json" \
>-H "X-Hume-Api-Key: $HUME_API_KEY" \
>--json '{
> "utterances": [
> {
> "text": "My friend told me about this amazing place in the West Village.",
> "voice": {
> "name": "Ava Song",
> "provider": "HUME_AI"
> }
> }
> ],
> "version": "2"
>}' \
>| jq -rc 'select(.type=="timestamp") | .timestamp'

Best practices

  • Use both levels: Phoneme timestamps help with lip syncing; word timestamps are better for captions.
  • Expect overlap: Word timestamps span multiple phonemes; use time.begin/time.end for alignment.
  • Stream incrementally: Timestamps interleave with audio chunks - process as they arrive.