For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Start buildingGet support
DocumentationAPI ReferenceChangelogDiscord
  • Introduction
    • Welcome to Hume AI
    • Getting your API keys
    • Support
    • Pricing
  • Voice
    • Overview
    • Voice design
    • Voice cloning
    • Voice management
  • Text-to-Speech (TTS)
    • Overview
    • Voice
    • Acting instructions
    • Voice conversion
    • Continuation
    • Timestamps
    • FAQ
  • Speech-to-Speech (EVI)
    • Overview
    • FAQ
  • Expression Measurement
    • Overview
      • Speech prosody
      • Vocal burst
      • Emotional language
      • Facial expression
    • About the science
    • FAQ
  • Integrations
    • MCP
    • Vercel AI SDK
    • LiveKit
    • Pipecat
    • Vapi
    • Twilio
    • Agora
  • Resources
    • Terms of use
    • Use case guidelines
    • Billing
    • Errors
    • Privacy
    • Status
Start buildingGet support
LogoLogo
LogoLogo
On this page
  • Job configuration
  • Batch API
  • Streaming API
  • Example job configuration
  • Output
  • Granularity
  • Sliding window
  • Expressions
Expression MeasurementModels

Speech prosody

Measure emotional expression from the tone, rhythm, and timbre of speech.
Was this page helpful?
Edit this page
Previous

Vocal burst

Measure emotional expression from non-linguistic vocalizations like laughs, sighs, and gasps.

Next
Built with

The Expression Measurement API is being sunset.

  • May 14, 2026 - Last day to create new jobs through the Playground
  • June 14, 2026 - Last day to use the API and download job results

The speech prosody model measures 48 dimensions of emotional expression from the non-linguistic qualities of speech, specifically how something is said rather than what is said. It analyzes pitch, pace, intensity, and other vocal characteristics to capture emotional nuances in audio and video. Recommended input filetypes: .wav, .mp3, .mp4.

Job configuration

Batch API

The following parameters are available when configuring the prosody model for Batch API jobs.

ParameterTypeDefaultDescription
granularitystringutteranceLevel at which predictions are generated. One of word, sentence, utterance, or conversational_turn.
identify_speakersbooleanfalseWhen enabled, identifies and labels different speakers in the audio (speaker diarization).
windowobject—Sliding window job configuration with length (seconds, min 0.5) and step (seconds, min 0.5). Useful for analyzing long audio at regular intervals instead of natural speech boundaries.

Streaming API

The prosody model is not configurable in the Streaming API. Enable it by passing an empty object:

1from hume import AsyncHumeClient
2from hume.expression_measurement.stream.stream.types import Config
3
4client = AsyncHumeClient(api_key="<YOUR_API_KEY>")
5async with client.expression_measurement.stream.connect(
6 options={"config": Config(prosody={})}
7) as socket:
8 result = await socket.send_file("audio.mp3")

Example job configuration

$curl -X POST "https://api.hume.ai/v0/batch/jobs" \
> -H "X-Hume-Api-Key: <YOUR_API_KEY>" \
> -H "Content-Type: application/json" \
> -d '{
> "models": {
> "prosody": {
> "granularity": "sentence",
> "identify_speakers": true
> }
> },
> "urls": ["https://example.com/audio.mp3"]
> }'

The example job configuration above applies to the Batch API. In the Streaming API, the prosody model uses default settings and does not accept job configuration parameters.

Output

Each prediction includes:

  • Time interval: the begin and end timestamps in seconds
  • Emotion scores: scores for each of the 48 expressions
1{
2 "grouped_predictions": [
3 {
4 "id": "unknown",
5 "predictions": [
6 {
7 "text": "I'm so happy to see you",
8 "time": {
9 "begin": 0.32,
10 "end": 1.84
11 },
12 "confidence": 0.95,
13 "speaker_confidence": null,
14 "emotions": [
15 { "name": "Admiration", "score": 0.107 },
16 { "name": "Joy", "score": 0.482 },
17 ...
18 ]
19 }
20 ]
21 }
22 ]
23}

Granularity

The granularity parameter controls how speech is segmented before predictions are generated. This parameter is only available in the Batch API.

ValueDescription
wordOne prediction per transcribed word. Provides the most detailed temporal resolution.
sentenceOne prediction per sentence, as determined by natural speech pauses and punctuation.
utteranceOne prediction per utterance, a continuous segment of speech separated by pauses. This is the default.
conversational_turnOne prediction per speaker turn. Requires identify_speakers to be enabled.

Sliding window

The window parameter provides an alternative to granularity-based segmentation. Instead of splitting audio at natural speech boundaries, it analyzes the audio in fixed-length, overlapping windows.

$curl -X POST "https://api.hume.ai/v0/batch/jobs" \
> -H "X-Hume-Api-Key: <YOUR_API_KEY>" \
> -H "Content-Type: application/json" \
> -d '{
> "models": {
> "prosody": {
> "window": {
> "length": 4.0,
> "step": 1.0
> }
> }
> },
> "urls": ["https://example.com/audio.mp3"]
> }'
  • length: Duration of each window in seconds (minimum 0.5).
  • step: How far to advance between windows in seconds (minimum 0.5). A step smaller than the length creates overlapping windows.

Expressions

The speech prosody model measures the following 48 expressions. These are the same expressions measured by the facial expression and vocal burst models.

AdmirationConfusionEmpathic PainPride
AdorationContemptEntrancementRealization
Aesthetic AppreciationContentmentEnvyRelief
AmusementCravingExcitementRomance
AngerDesireFearSadness
AnxietyDeterminationGuiltSatisfaction
AweDisappointmentHorrorShame
AwkwardnessDisgustInterestSurprise (negative)
BoredomDistressJoySurprise (positive)
CalmnessDoubtLoveSympathy
ConcentrationEcstasyNostalgiaTiredness
ContemplationEmbarrassmentPainTriumph