Empathic Voice Interface FAQ

We’ve compiled a list of frequently asked questions from our developer community. If your question isn’t listed, we invite you to join the discussion on our Discord.

Our API is based on our own empathic LLM (eLLM) and can blend in responses from any external LLM API. Please visit our configuration guide for up-to-date information on Hume’s default configuration options.

When sending messages through EVI’s WebSocket, you can specify your own language_model_api_key in the SessionSettings message. Please visit our API reference for more information here.

No - the cost of supplemental LLMs is not included in EVI’s pricing. Hume covers the costs of LLMs while we make optimizations that will make language generation much cheaper for our customers. This means that the LLM expenses are not included in EVI’s pricing, ensuring a single consistent price per minute regardless which supplemental LLM you choose. Developers can select any supported LLM without additional charges, making it easy to switch between different models based on your needs.

These outputs reflect our prosody model’s confidence that the speaker is expressing the label in their tone of voice and language. Our prosody model is derived from extensive perceptual studies of emotional expressions with millions of participants. The model is trained to pick up on vocal modulations and patterns in language that people reliably interpret as expressing specific emotions. Importantly, the labels do not imply that the person is experiencing the emotions.

  1. Expression labels: These categories (like “amusement”) represent categories of emotional expression that most people perceive in vocal and linguistic patterns. They are not based on explicit definitions of emotions, but rather common interpretations of expressive cues.

  2. Expression measures: These numbers indicate the model’s confidence that a given expression would be interpreted as belonging to a specific category by human observers. They represent the likelihood of a particular interpretation of expressions, not the presence or intensity of a specific emotion.

For more details, see our prosody model documentation and the foundational research by Cowen and Keltner (2017).

At the word-level, prosody measurements are highly dependent on context. Our internal testing shows that they are more stable at the sentence level.

Today we only support English, however we do have plans to support other languages very soon. Feel free to join the conversation on our feedback board - upvote existing language requests, leave comments, or add new languages.

EVI currently supports 8 base voices - Ito, Kora, Dacher, Aura, Finn, Whimsy, Stella, and Sunny - with plans to introduce more in the future. In the meantime, you can craft your own unique voice by adjusting the attributes of any base option.

Visit the playground to try out the base voices and experiment with voice modulation, and learn more about voice customization in our detailed guide.

If you are interested in creating a custom voice for your use case, please submit a sales inquiry. Our team can train custom TTS models for enterprise customers.

Our empathic large language model (eLLM) is a multimodal language model that takes into account both expression measures and language. The eLLM generates a language response and guides text-to-speech (TTS) prosody.

Hume’s eLLM is not contingent on other LLMs and is therefore able to generate an initial response much faster than existing LLM services. However, Hume’s Empathic Voice Interface (EVI) is able to integrate other frontier LLMs into its longer responses which are configurable by developers.

EVI 2 uses Hume’s speech-language model (SLM) that processes both audio and text input to generate expressive speech output. This model is used for both voice and text generation by default with EVI 2. However, many developers want to use specific frontier LLMs, or their own custom LLM. To enable this, we support supplemental LLMs with EVI, where the process is as follows:

  1. EVI transcribes user audio and EVI’s prosody model extracts expression measures from the audio
  2. The transcribed user message and expression measures (converted to a text format) are sent to the supplemental LLM
  3. The supplemental LLM generates a text response, and sends it back to EVI.
  4. EVI 2’s speech-language model voices this text, adjusting its tone, expressiveness, speaking rate, and other characteristics based on the text content. This is not just text-to-speech - it takes into account the previous turns, the user’s speech, and the expressive context to generate the right voice. One good analogy: think of EVI 2 as a skilled actor “acting out” the text from the supplemental LLM, rather than just a robot producing speech for each word.

This system makes EVI 2 interoperable with any LLM, allowing developers to leverage powerful LLMs for text generation while maintaining EVI’s expressive voice capabilities.

EVI supports features that are common across multiple LLM providers, including:

  • Temperature (available for all models)
  • Prompt caching (used for Anthropic and OpenAI models without requiring action from EVI developers)
  • Tool use (available for Anthropic, OpenAI, and Google models)
  • System prompts (available for all models)

Model-specific features like OpenAI’s logprobs and structured output, or Anthropic’s model response prefill, are currently not supported to maintain consistency across LLM providers.

The landscape of large language models (LLMs) and their providers is constantly evolving, affecting which supplemental LLM is fastest with EVI.

The key factor influencing perceived latency using EVI is the time to first token (TTFT), with lower TTFT being better. The model and provider combination with the smallest TTFT will be the fastest.

Artificial Analysis offers a useful dashboard for comparing model and provider latencies.

If you are interested in creating a custom voice for your use case, please submit a sales inquiry. Our team can train custom TTS models for enterprise customers.

Our empathic large language model (eLLM) is a multimodal language model that takes into account both expression measures and language. The eLLM generates a language response and guides text-to-speech (TTS) prosody.

The landscape of large language models (LLMs) and their providers is constantly evolving, affecting which supplemental LLM is fastest with EVI.

The key factor influencing perceived latency using EVI is the time to first token (TTFT), with lower TTFT being better. The model and provider combination with the smallest TTFT will be the fastest.

Artificial Analysis offers a useful dashboard for comparing model and provider latencies.

Notably, there’s a tradeoff between speed and quality. Larger, slower models are easier to prompt. We recommend testing various supplemental LLM options when implementing EVI.

Hume has trained our own expressive text-to-speech (TTS) model that allows it to generate speech with more prosody and expressive nuance than other models. TTS is specifically designed for use within an EVI chat session, allowing EVI to generate speech from a given text input. We do not have a dedicated endpoint for TTS.

To perform TTS within an EVI chat session, you can follow the steps below:

  1. Establish initial connection: Make the initial handshake request to establish the WebSocket connection.

  2. Send text for synthesis: Send an Assistant Input message with the text you want to synthesize into speech:

    assistant_input
    1{
    2 "type": "assistant_input",
    3 "text": "Text to be synthesized."
    4}
  3. Receive synthesized speech: After sending an assistant_input message, you will receive an Assistant Message and Audio Output for each sentence of the provided text.

    The assistant_message contains the text and expression measurement predictions, while the audio_output message contains the synthesized, emotional audio. See the sample messages below:

    assistant_message
    1{
    2 "type": "assistant_message",
    3 "id": "g8ee90fa2c1648f3a32qrea6d179ee44",
    4 "message": {
    5 "role": "assistant",
    6 "content": "Text to be synthesized."
    7 },
    8 "models": {
    9 "prosody": {
    10 "scores": {
    11 "Admiration": 0.0309600830078125,
    12 "Adoration": 0.0018177032470703125
    13 // ... additional scores
    14 }
    15 }
    16 },
    17 "from_text": true
    18}
    audio_output
    1{
    2 "type": "audio_output",
    3 "id": "g8ee90fa2c1648f3a32qrea6d179ee44",
    4 "data": "<base64 encoded audio>"
    5}
  4. End of Response: Once all the text has been synthesized into speech, you will receive an Assistant End message indicating the end of the response:

    assistant_end
    1{
    2 "type": "assistant_end"
    3}

Before implementing this in code, you can test it out by going to our Portal. Start a call in the EVI Playground, then send an Assistant Message with the text you want to synthesize.

Yes, EVI supports pausing EVI’s responses. See our guide on pausing EVI’s responses for more details.

Yes, with EVI, you can easily preserve context when reconnecting or continue a chat right where you left off. For more details, see our guide to resuming chats.

Yes, you can listen to your past conversations with EVI using our audio reconstruction feature. This feature allows you to fetch and play back conversations as single audio files. See our full guide for audio reconstruction here.

The audio reconstruction feature is currently in public beta. As we work on resolving bugs as they are identified and reducing the error rate, reconstructions may fail for some chats. Additionally, this feature is not available for accounts with the no data retention option enabled.


Built with