Speech prosody
The speech prosody model measures 48 dimensions of emotional expression from the non-linguistic qualities of speech, specifically
how something is said rather than what is said. It analyzes pitch, pace, intensity, and other vocal characteristics
to capture emotional nuances in audio and video. Recommended input filetypes: .wav, .mp3, .mp4.
Job configuration
Batch API
The following parameters are available when configuring the prosody model for Batch API jobs.
Streaming API
The prosody model is not configurable in the Streaming API. Enable it by passing an empty object:
Example job configuration
The example job configuration above applies to the Batch API. In the Streaming API, the prosody model uses default settings and does not accept job configuration parameters.
Output
Each prediction includes:
- Time interval: the
beginandendtimestamps in seconds - Emotion scores: scores for each of the 48 expressions
Granularity
The granularity parameter controls how speech is segmented before predictions are generated. This parameter is only
available in the Batch API.
Sliding window
The window parameter provides an alternative to granularity-based segmentation. Instead of splitting audio at natural
speech boundaries, it analyzes the audio in fixed-length, overlapping windows.
length: Duration of each window in seconds (minimum 0.5).step: How far to advance between windows in seconds (minimum 0.5). A step smaller than the length creates overlapping windows.
Expressions
The speech prosody model measures the following 48 expressions. These are the same expressions measured by the facial expression and vocal burst models.

