The Expression Measurement API is being sunset.
Our models capture the widest-ever range of facial, speech, vocal, and language modulations with distinct emotional meanings. We label each of their outputs with emotion terms like “amusement” and “doubt,” not because they always correspond to those emotional experiences (they must not, given that they often differ from one modality to another), but because scientific studies show that these kinds of labels are the most precise language we have for describing expressions.
Our models generate JSON or CSV output files with values typically ranging from 0 to 1 for each output in different segments of the input file (though values out of the 0-1 range are possible). Higher values indicate greater intensity of facial movements or vocal modulations that are most strongly associated with the emotion label corresponding to the output.
A given expression will contain a blend of various emotions, and our models identify features that are associated with each emotional dimension. The score for each dimension is proportional to the likelihood that a human would perceive that emotion in the expression.
Specifically, the scores reflect the likelihood that an average human perceiver would use that emotion dimension to describe a given expression. The models were trained on human intensity ratings gathered using the methods described in this paper: Deep learning reveals what vocal bursts express in different cultures.
While our models measure nuanced expressions that people most typically describe with emotion labels, it’s important to remember that they are not a direct readout of what someone is experiencing. Emotional experience is subjective and its expression is multimodal and context-dependent. Moreover, at any given time, our facial expression outputs might be quite different than our vocal expression outputs. Therefore, it’s important to follow best practices when interpreting outputs.
There are many different ways to use our platform. That said, successful research and applications of our models generally follow four steps: exploration, prediction, improvement, and testing.
As you build expression-related signals, metrics, analyses, models, or feedback into an application, remember to use scientific best practices and follow the ethics guidelines of thehumeinitiative.org.
Our speech prosody model measures the tune, rhythm, and timbre of speech, whereas our language model measures the tone of the words being spoken. When using either model, we offer the flexibility to annotate emotional expressions at several levels of granularity, ranging from individual words to entire conversational turns. It is important to note that independent of granularity, our language model still takes into account up to 50 previous tokens (word or sub-words) of speech; otherwise, it would not be able to capture how the meaning of the words is affected by context.
Word: At the word level, our model provides a separate output for each word, offering the most granular insight into emotional expression during speech.
Sentence: At the sentence level of granularity, we annotate the emotional tone of each spoken sentence with our prosody and language models.
Utterance: Utterance-level granularity is between word- and sentence-level. It takes into account natural pauses or breaks in speech, providing more rapidly updated measures of emotional expression within a flowing conversation. For text inputs, utterance-level granularity will produce results identical to sentence-level granularity.
Conversational Turn: Conversational turn-level analysis is a lower level of granularity. It outputs a single output for each turn; that is, the full sequence of words and sentences spoken uninterrupted by each person. This approach provides a higher-level view of the emotional dynamics in a multi-participant dialogue. For text inputs, specifying conversational turn-level granularity for our Language model will produce results for the entire passage.
Remember, each level of granularity has its unique advantages, and choosing the right one depends on the requirements of your specific application.
State-of-the-art face detection and identification algorithms still occasionally make errors. For instance, our algorithm sometimes detects faces in shadows or reflections. Other times, our algorithm falsely attributes a new identity to someone who has already been in the video, sometimes due to changes in lighting or occlusion. These errors can result in additional face IDs. We are still working to fine-tune our algorithm to minimize errors in the contexts that our customers care about.
Our vocal burst model detects vocalizations such as laughs, screams, sighs, gasps, “mms,” “uhs,” and “mhms.” Natural speech generally contains a few vocal bursts every minute, but scripted speech has fewer vocal bursts. If no vocal bursts are detected, it may be because there are no vocal bursts in the file. However, if you hear vocal bursts that aren’t being detected by the algorithm, note that we are also in the process of improving our vocal burst detection algorithm, so please stay tuned for updates.
This error indicates that our transcription service had difficulty identifying the language spoken in your audio file or the quality was too low. We prioritize quality and accuracy, so if it cannot transcribe with confidence, our models won’t be able to process it further.
By default, we use an automated language detection method for our Speech Prosody, Language, and NER models. However, if you know what language is being spoken in your media samples, you can specify it via its BCP-47 tag and potentially obtain more accurate results.
If you see the message above there are a few steps you can do to resolve the issue:
See the full list of languages supported by the Expression Measurement API here.
You may specify any of the following BCP-47 tags for transcription: zh, da, nl, en, en-AU, en-IN, en-NZ,
en-GB, fr, fr-CA, de, hi, hi-Latn, id, it, ja, ko, no, pl, pt, pt-BR, pt-PT,
ru, es, es-419, sv, ta, tr, or uk.
We support over 50 languages. Among these, 21 languages have additional support for transcription.
English is a primary language, and will yield more accurate predictions than inputs in other supported languages. Currently, our NER model only supports the English language.
The Expression Measurement API works with any operating system and programming language that supports HTTP client libraries for making web requests.
If you’re using Hume’s Python SDK specifically, Expression Measurement is compatible with:
3.9, 3.10, 3.11, 3.12, and 3.13For more information, please visit the Python SDK on GitHub.
Custom Models become essential when raw embeddings from Hume’s expression measurement models require further tailoring for specific applications. Here are scenarios where Custom Models offer significant advantages:
In labeling, regression involves assigning continuous numerical values, while classification involves categorizing data into discrete labels. During training, regression models learn to predict numerical values, whereas classification models learn to categorize data points into predefined classes.
Classification use cases
Regression use cases
In essence, regression models in emotional expression analysis assign continuous values representing intensities or degrees, while classification models categorize expressions into distinct states or reactions.
Our custom model pipeline is designed to accommodate a wide range of data types, including audio, videos, and text, automatically integrating multimodal patterns of expression and language. However, not all datasets are created equal. For best results, we recommend using a dataset that meets certain standards:
Dataset size
Ideally, use a dataset consisting of a minimum of 20 files, but more data is always better for model performance.
Media type consistency
All files within a dataset should be of the same media type (video, audio, image, text…etc.)
It’s generally wise to maintain a consistent naming convention and file format for your dataset. At minimum, ensure files have appropriate extensions, such as .wav, .mp3, .aif, .mov, or .mp4.
Classification vs regression tasks
Depending on your model’s objective (classification or regression), you can use different labeling approaches.
Label consistency
We recommend that your labels follow a consistent format; e.g, do not mix integers and strings. Furthermore, be sure to check for any typos in your labels, as these will be considered as separate classes, e.g, “happy” vs. “hapy.”
Class imbalance
If possible, it helps to have a balanced distribution of labels in your dataset. For example, if you have 50 files and two classes, the best case is to have 25 samples per class. Generally, you need at least 10 samples per class to train a useful model, but more data per class is always better.
Streaming connections disconnect after one minute of inactivity. Build reconnect logic into your application to handle disconnections gracefully. See the Next.js streaming example for a sample implementation.
WebSocket connections can experience disruptions due to network issues. Implement error handling to gracefully manage connection timeouts, drops, and intermittent issues. Build reconnection logic to automatically resume communication when a connection is lost. For a full list of error responses, see the API errors page.
No. Keep the WebSocket connection open and reuse it for multiple requests until your application is done using
the service. To prevent context from leaking across unrelated inputs, use the
reset_stream parameter
between requests.