Vocal Burst

Non-linguistic vocal utterances, including sighs, laughs, oohs, ahhs, umms, and shrieks (to name but a few), are a particularly powerful and understudied modality of expressive behavior. Recent studies reveal that they reliably convey over 24 distinct dimensions of meaning that are extremely well-preserved across most cultures (Cowen et al., 2019; Brooks et al., 2022; Tzirakis et al., 2022).

Non-linguistic vocal utterances have different acoustic characteristics than speech emotional intonation (prosody) and need to be modeled separately.

Our Vocal Burst Expression Model generates 48 outputs encompassing the 24+ dimensions of meaning that people distinguish in vocal bursts. These 48 outputs also encompass other, alternative conceptualizations for the sake of interpretation and alignment across our different models. As with every model, the labels for each dimension are proxies for how people tend to label the underlying patterns of behavior. They should not be treated as direct inferences of emotional experience.

Our Vocal Burst Description Model provides a more descriptive and categorical view of nonverbal vocal expressions (“gasp,” “mhm,” etc.) intended for use cases such as audio captioning. It generates 67 descriptors, including 30 call types (“sigh,” “laugh,” “shriek,” etc.) and 37 common onomatopoeia transliterations of vocal bursts (“hmm,” “ha,” “mhm,” etc.).

Our vocal burst models are packaged with nonlinguistic vocal utterance detection and works on both audio files and videos.