Expression Measurement API FAQ

How do I interpret my results?

Our models capture the widest-ever range of facial, speech, vocal, and language modulations with distinct emotional meanings. We label each of their outputs with emotion terms like “amusement” and “doubt,” not because they always correspond to those emotional experiences (they must not, given that they often differ from one modality to another), but because scientific studies show that these kinds of labels are the most precise language we have for describing expressions.

Our models generate JSON or CSV output files with values typically ranging from 0 to 1 for each output in different segments of the input file (though values out of the 0-1 range are possible). Higher values indicate greater intensity of facial movements or vocal modulations that are most strongly associated with the emotion label corresponding to the output.

A given expression will contain a blend of various emotions, and our models identify features that are associated with each emotional dimension. The score for each dimension is proportional to the likelihood that a human would perceive that emotion in the expression.

Specifically, the scores reflect the likelihood that an average human perceiver would use that emotion dimension to describe a given expression. The models were trained on human intensity ratings gathered using the methods described in this paper: Deep learning reveals what vocal bursts express in different cultures.

While our models measure nuanced expressions that people most typically describe with emotion labels, it’s important to remember that they are not a direct readout of what someone is experiencing. Emotional experience is subjective and its expression is multimodal and context-dependent. Moreover, at any given time, our facial expression outputs might be quite different than our vocal expression outputs. Therefore, it’s important to follow best practices when interpreting outputs.

What can I do with my outputs?

There are many different ways to use our platform. That said, successful research and applications of our models generally follow four steps: exploration, prediction, improvement, and testing.

Exploration: Researchers and developers generally begin by exploring patterns in their data.

Are there apparent differences across participants or users in a study?
Do patterns in expression vary systematically over time?
Are there different patterns in expression associated with different stages of research or different product experiences?

Prediction: A great way to evaluate and start building on our APIs is to use them to predict metrics that you already know are important.

Are key outcomes like mental health or customer satisfaction better predicted by language and expression than by language alone?
If patterns in expression predict important outcomes, how do these patterns in expression vary over time and reveal critical moments for a user or participant?

Improvement: The goal is often to use measures of expression to directly improve how the application works.

Sometimes, being able to predict an important metric is enough to make a decision. For example, if you can predict whether two people will get along based on their expressions and language, then your application can pair them up.
More formally, you can apply statistics or machine learning to the data you gather to improve how the application works.
You can incorporate our API outputs into an out-of-the-box large language model, simply by converting them into text (e.g., “The user sounds calm but a little frustrated”) and feeding them in as prompts.
You can use expressions to teach an AI model. For example, if your application involves a large language model, such as an AI tutor, you can use measures of expression that predict student performance and well-being to directly fine-tune the AI to improve over time.

Testing: After you’ve incorporated measures of expression into your application, they can be part of every A/B test you perform. You can now monitor the effects of changes to your application not just on engagement and retention, but also on how much users laugh or sigh in frustration, or show signs of interest or boredom.

As you build expression-related signals, metrics, analyses, models, or feedback into an application, remember to use scientific best practices and follow the ethics guidelines of thehumeinitiative.org.

How granular are the outputs of our speech prosody and language models?

Our speech prosody model measures the tune, rhythm, and timbre of speech, whereas our language model measures the tone of the words being spoken. When using either model, we offer the flexibility to annotate emotional expressions at several levels of granularity, ranging from individual words to entire conversational turns. It is important to note that independent of granularity, our language model still takes into account up to 50 previous tokens (word or sub-words) of speech; otherwise, it would not be able to capture how the meaning of the words is affected by context.

Word: At the word level, our model provides a separate output for each word, offering the most granular insight into emotional expression during speech.

Sentence: At the sentence level of granularity, we annotate the emotional tone of each spoken sentence with our prosody and language models.

Utterance: Utterance-level granularity is between word- and sentence-level. It takes into account natural pauses or breaks in speech, providing more rapidly updated measures of emotional expression within a flowing conversation. For text inputs, utterance-level granularity will produce results identical to sentence-level granularity.

Conversational Turn: Conversational turn-level analysis is a lower level of granularity. It outputs a single output for each turn; that is, the full sequence of words and sentences spoken uninterrupted by each person. This approach provides a higher-level view of the emotional dynamics in a multi-participant dialogue. For text inputs, specifying conversational turn-level granularity for our Language model will produce results for entire passage.

Remember, each level of granularity has its unique advantages, and choosing the right one depends on the requirements of your specific application.

Why am I seeing more face identifiers than the number of people in the video?

State-of-the-art face detection and identification algorithms still occasionally make errors. For instance, our algorithm sometimes detects faces in shadows or reflections. Other times, our algorithm falsely attributes a new identity to someone who has already been in the video, sometimes due to changes in lighting or occlusion. These errors can result in additional face IDs. We are still working to fine-tune our algorithm to minimize errors in the contexts that our customers care about.

Why don't I see any vocal bursts in my file?

Our vocal burst model detects vocalizations such as laughs, screams, sighs, gasps, “mms,” “uhs,” and “mhms.” Natural speech generally contains a few vocal bursts every minute, but scripted speech has fewer vocal bursts. If no vocal bursts are detected, it may be because there are no vocal bursts in the file. However, if you hear vocal bursts that aren’t being detected by the algorithm, note that we are also in the process of improving our vocal burst detection algorithm, so please stay tuned for updates.

Why am I getting the "Transcript confidence below threshold value" error?

We’ve documented this issue thoroughly in our API errors page.

You can specify any of the following:

zh, da, nl, en, en-AU, en-IN, en-NZ, en-GB, fr, fr-CA, de, hi, hi-Latn, id, it, ja, ko, no, pl, pt, pt-BR, pt-PT, ru, es, es-419, sv, ta, tr, or uk.

Which languages are supported?

We support over 50 languages. Among these, 20 languages have additional support for transcription.

Language Tag	Language	Text	Transcription
ar	Arabic
bg	Bulgarian
ca	Catalan
cs	Czech
da	Danish
de	German
el	Greek
en	English*
es	Spanish
et	Estonian
fa	Farsi
fi	Finnish
fr	French
fr-ca	French (Canada)
gl	Galician
gu	Gujarati
he	Hebrew
hi	Hindi
hr	Croatian
hu	Hungarian
hy	Armenian
ID	Indonesian
it	Italian
ja	Japanese
ka	Georgian
ko	Korean
ku	Kurdish
lt	Lithuanian
lv	Latvian
mk	FYRO Macedonian
mn	Mongolian
mr	Marathi
ms	Malay
my	Burmese
nb	Norwegian (Bokmål)
nl	Dutch
pl	Polish
pt	Portuguese
pt-br	Portuguese (Brazil)
ro	Romanian
ru	Russian
sk	Slovak
sl	Slovenian
sq	Albanian
sr	Serbian
sv	Swedish
th	Thai
tr	Turkish
uk	Ukrainian
ur	Urdu
vi	Vietnamese
zh-cn	Chinese
zh-tw	Chinese (Taiwan)

English is a primary language, and will yield more accurate predictions than inputs in other supported languages. Currently, our NER model only supports the English language.

Which programming languages and operating systems support the Expression Measurement API?

The Expression Measurement API works with any operating system and programming language that supports HTTP client libraries for making web requests.

If you’re using Hume’s Python SDK specifically, Expression Measurement is compatible with:

Python versions: 3.9, 3.10, 3.11, and 3.12
Operating systems: macOS, Linux, and Windows

For more information, please visit the Python SDK on GitHub.

When should I use Custom Models?

Custom Models become essential when raw embeddings from Hume’s expression measurement models require further tailoring for specific applications. Here are scenarios where Custom Models offer significant advantages:

Specialized contexts: In environments with unique characteristics or requirements, Custom Models enable the creation of context-specific labels, ensuring more relevant and accurate insights. If your project demands a particular set of labels that are not covered by Hume’s emotional expression labels, Custom Models enable you to create and apply these labels, ensuring that the analysis aligns with your specific objectives.
Iterative model improvement: In evolving fields or scenarios where data and requirements change over time, Custom Models offer the flexibility to iteratively improve and adapt the model with new data and labels.

What is Regression vs. Classification in Custom Model labeling and training?

In labeling, regression involves assigning continuous numerical values, while classification involves categorizing data into discrete labels. During training, regression models learn to predict numerical values, whereas classification models learn to categorize data points into predefined classes.

Classification use cases

Emotion Categorization: Classification excels in distinguishing distinct emotional states, like identifying happiness, sadness, or surprise based on linguistic or physical expression cues.
Binary Emotional Analysis: Useful in binary scenarios such as detecting presence or absence of specific emotional reactions, like engagement or disengagement in a learning environment.
Multi-Emotional Identification: Perfect for classifying a range of emotions in complex scenarios, like understanding varied customer reactions from satisfied to dissatisfied based on their verbal and non-verbal feedback.

Regression use cases

Intensity Measurement: Regression is apt for quantifying the intensity or degree of emotional responses, such as assessing the level of stress or joy from vocal or facial cues.
Emotional Progression Tracking: Ideal for monitoring the fluctuation of emotional states over time, like tracking the development of engagement or anxiety in therapy sessions.

In essence, regression models in emotional expression analysis assign continuous values representing intensities or degrees, while classification models categorize expressions into distinct states or reactions.

What are guidelines for building datasets for Custom Models?

Our custom model pipeline is designed to accommodate a wide range of data types, including audio, videos, and text, automatically integrating multimodal patterns of expression and language. However, not all datasets are created equal. For best results, we recommend using a dataset that meets certain standards:

Dataset size

Ideally, use a dataset consisting of a minimum of 20 files, but more data is always better for model performance.

Media type consistency

All files within a dataset should be of the same media type (video, audio, image, text…etc.)

It’s generally wise to maintain a consistent naming convention and file format for your dataset. At minimum, ensure files have appropriate extensions, such as .wav, .mp3, .aif, .mov, or .mp4.

Classification vs regression tasks

Depending on your model’s objective (classification or regression), you can use different labeling approaches.

Classification labels: use either strings or integers as labels (e.g., “confused,” “focused”). We limit the number of categorical labels to 50, and you must have at least two (binary).
Regression targets: use either integers or decimals as targets. A model trained on a regression task with predict a continuous numerical value.

Label consistency

We recommend that your labels follow a consistent format; e.g, do not mix integers and strings. Furthermore, be sure to check for any typos in your labels, as these will be considered as separate classes, e.g, “happy” vs. “hapy.”

Class imbalance

If possible, it helps to have a balanced distribution of labels in your dataset. For example, if you have 50 files and two classes, the best case is to have 25 samples per class. Generally, you need at least 10 samples per class to train a useful model, but more data per class is always better.