Speech prosody | Hume API

The speech prosody model measures 48 dimensions of emotional expression from the non-linguistic qualities of speech, specifically how something is said rather than what is said. It analyzes pitch, pace, intensity, and other vocal characteristics to capture emotional nuances in audio and video. Recommended input filetypes: .wav, .mp3, .mp4.

Job configuration

Batch API

The following parameters are available when configuring the prosody model for Batch API jobs.

Parameter	Type	Default	Description
`granularity`	string	`utterance`	Level at which predictions are generated. One of `word`, `sentence`, `utterance`, or `conversational_turn`.
`identify_speakers`	boolean	`false`	When enabled, identifies and labels different speakers in the audio (speaker diarization).
`window`	object	—	Sliding window job configuration with `length` (seconds, min 0.5) and `step` (seconds, min 0.5). Useful for analyzing long audio at regular intervals instead of natural speech boundaries.

Streaming API

The prosody model is not configurable in the Streaming API. Enable it by passing an empty object:

1 from hume import AsyncHumeClient
2 from hume.expression_measurement.stream.stream.types import Config
3 
4 client = AsyncHumeClient(api_key="<YOUR_API_KEY>")
5 async with client.expression_measurement.stream.connect(
6     options={"config": Config(prosody={})}
7 ) as socket:
8     result = await socket.send_file("audio.mp3")

Example job configuration

$ curl -X POST "https://api.hume.ai/v0/batch/jobs" \
>   -H "X-Hume-Api-Key: <YOUR_API_KEY>" \
>   -H "Content-Type: application/json" \
>   -d '{
>     "models": {
>       "prosody": {
>         "granularity": "sentence",
>         "identify_speakers": true
>       }
>     },
>     "urls": ["https://example.com/audio.mp3"]
>   }'

The example job configuration above applies to the Batch API. In the Streaming API, the prosody model uses default settings and does not accept job configuration parameters.

Output

Each prediction includes:

Time interval: the begin and end timestamps in seconds
Emotion scores: scores for each of the 48 expressions

1 {
2   "grouped_predictions": [
3     {
4       "id": "unknown",
5       "predictions": [
6         {
7           "text": "I'm so happy to see you",
8           "time": {
9             "begin": 0.32,
10             "end": 1.84
11           },
12           "confidence": 0.95,
13           "speaker_confidence": null,
14           "emotions": [
15             { "name": "Admiration", "score": 0.107 },
16             { "name": "Joy", "score": 0.482 },
17             ...
18           ]
19         }
20       ]
21     }
22   ]
23 }

Granularity

The granularity parameter controls how speech is segmented before predictions are generated. This parameter is only available in the Batch API.

Value	Description
`word`	One prediction per transcribed word. Provides the most detailed temporal resolution.
`sentence`	One prediction per sentence, as determined by natural speech pauses and punctuation.
`utterance`	One prediction per utterance, a continuous segment of speech separated by pauses. This is the default.
`conversational_turn`	One prediction per speaker turn. Requires `identify_speakers` to be enabled.

Sliding window

The window parameter provides an alternative to granularity-based segmentation. Instead of splitting audio at natural speech boundaries, it analyzes the audio in fixed-length, overlapping windows.

$ curl -X POST "https://api.hume.ai/v0/batch/jobs" \
>   -H "X-Hume-Api-Key: <YOUR_API_KEY>" \
>   -H "Content-Type: application/json" \
>   -d '{
>     "models": {
>       "prosody": {
>         "window": {
>           "length": 4.0,
>           "step": 1.0
>         }
>       }
>     },
>     "urls": ["https://example.com/audio.mp3"]
>   }'

length: Duration of each window in seconds (minimum 0.5).
step: How far to advance between windows in seconds (minimum 0.5). A step smaller than the length creates overlapping windows.

Expressions

The speech prosody model measures the following 48 expressions. These are the same expressions measured by the facial expression and vocal burst models.


Admiration	Confusion	Empathic Pain	Pride
Adoration	Contempt	Entrancement	Realization
Aesthetic Appreciation	Contentment	Envy	Relief
Amusement	Craving	Excitement	Romance
Anger	Desire	Fear	Sadness
Anxiety	Determination	Guilt	Satisfaction
Awe	Disappointment	Horror	Shame
Awkwardness	Disgust	Interest	Surprise (negative)
Boredom	Distress	Joy	Surprise (positive)
Calmness	Doubt	Love	Sympathy
Concentration	Ecstasy	Nostalgia	Tiredness
Contemplation	Embarrassment	Pain	Triumph