How granular are the outputs of our speech prosody and language models?

We offer several granularity options for the speech prosody and language models.

Our speech prosody model measures the tune, rhythm, and timbre of speech, whereas our language model measures the tone of the words being spoken. When using either model, we offer the flexibility to annotate emotional expressions at several levels of granularity, ranging from individual words to entire conversational turns. It is important to note that independent of granularity, our language model still takes into account up to 50 previous tokens (word or sub-words) of speech; otherwise, it would not be able to capture how the meaning of the words is affected by context.


At the word level, our model provides a separate output for each word, offering the most granular insight into emotional expression during speech.


At the sentence level of granularity, we annotate the emotional tone of each spoken sentence with our prosody and language models.


Utterance-level granularity is between word- and sentence-level. It takes into account natural pauses or breaks in speech, providing more rapidly updated measures of emotional expression within a flowing conversation. For text inputs, utterance-level granularity will produce results identical to sentence-level granularity.

Conversational Turn

Conversational turn-level analysis is a lower level of granularity. It outputs a single output for each turn; that is, the full sequence of words and sentences spoken uninterrupted by each person. This approach provides a higher-level view of the emotional dynamics in a multi-participant dialogue. For text inputs, specifying conversational turn-level granularity for our Language model will produce results for entire passage.


Each level of granularity has its unique advantages, and choosing the right one depends on the requirements of your specific application.