Empathic Voice Interface FAQ
We’ve compiled a list of frequently asked questions from our developer community. If your question isn’t listed, we invite you to join the discussion on our Discord.
Is EVI multilingual?
Today, EVI 2 supports English, while EVI 3 supports both English and Spanish. We’re actively expanding language support for EVI 3, with French, German, Italian, and Japanese coming soon!
We’d love your input—join the conversation on our feedback board, where you can upvote requested languages, leave comments, or suggest new ones.
How many concurrent connections does EVI support?
By default, EVI supports up to 5 concurrent connections for testing and development. For production environments requiring higher capacity, you can request an increased limit by filling out our request form. EVI is designed to scale seamlessly, and we can support deployments with thousands of concurrent users.
What language model does EVI use?
Our API is based on our own empathic speech-language model and can blend in responses from any external LLM API. Please visit our configuration guide for up-to-date information on Hume’s default configuration options.
Does my choice of supplemental LLM affect EVI's pricing?
No - the cost of supplemental LLMs is not included in EVI’s pricing. Hume covers the costs of LLMs while we make optimizations that will make language generation much cheaper for our customers. This means that the LLM expenses are not included in EVI’s pricing, ensuring a single consistent price per minute regardless which supplemental LLM you choose. Developers can select any supported LLM without additional charges, making it easy to switch between different models based on your needs.
How can I use my own API key for the LLM provider?
When sending messages through EVI’s WebSocket, you can specify your own language_model_api_key
in the SessionSettings
message. Check out our API reference for more information
here.
What do EVI's expression labels and measures mean?
These outputs reflect our prosody model’s confidence that the speaker is expressing the label in their tone of voice and language. Our prosody model is derived from extensive perceptual studies of emotional expressions with millions of participants. The model is trained to pick up on vocal modulations and patterns in language that people reliably interpret as expressing specific emotions. Importantly, the labels do not imply that the person is experiencing the emotions.
-
Expression labels: These categories (like “amusement”) represent categories of emotional expression that most people perceive in vocal and linguistic patterns. They are not based on explicit definitions of emotions, but rather common interpretations of expressive cues.
-
Expression measures: These numbers indicate the model’s confidence that a given expression would be interpreted as belonging to a specific category by human observers. They represent the likelihood of a particular interpretation of expressions, not the presence or intensity of a specific emotion.
For more details, see our prosody model documentation and the foundational research by Cowen and Keltner (2017).
Why is prosody (tone-of-voice) measured at the sentence level?
At the word-level, prosody measurements are highly dependent on context. Our internal testing shows that they are more stable at the sentence level.
Can EVI integrate with my existing systems?
Yes! EVI supports webhooks and function calling to connect with your databases, APIs, and business logic. This allows you to build voice interfaces that can access real-time information and take actions within your existing infrastructure.
How does Hume’s speech-language model work?
Our speech-language model is a multimodal language model that takes into account both expression measures and language. The speech-language model generates a language response and guides text-to-speech (TTS) prosody.
Why is EVI so much faster than other LLMs?
Hume’s speech-language model is not contingent on other LLMs and is therefore able to generate an initial response much faster than existing LLM services. However, Hume’s Empathic Voice Interface (EVI) is able to integrate other frontier LLMs into its longer responses which are configurable by developers.
How does EVI work with supplemental language models?
EVI uses Hume’s speech-language model (SLM) that processes both audio and text input to generate expressive speech output. This model is used for both voice and text generation by default. However, many developers want to use specific frontier LLMs, or their own custom LLM. To enable this, we support supplemental LLMs with EVI, where the process is as follows:
- EVI transcribes user audio and EVI’s prosody model extracts expression measures from the audio
- The transcribed user message and expression measures (converted to a text format) are sent to the supplemental LLM
- The supplemental LLM generates a text response, and sends it back to EVI.
- EVI’s speech-language model voices this text, adjusting its tone, expressiveness, speaking rate, and other characteristics based on the text content. This is not just text-to-speech - it takes into account the previous turns, the user’s speech, and the expressive context to generate the right voice. One good analogy: think of EVI as a skilled actor “acting out” the text from the supplemental LLM, rather than just a robot producing speech for each word.
This system makes EVI interoperable with any LLM, allowing developers to leverage powerful LLMs for text generation while maintaining EVI’s expressive voice capabilities.
Which LLM-specific features are supported with supplemental models?
EVI supports features that are common across multiple LLM providers, including:
- Temperature (available for all models)
- Prompt caching (used for Anthropic and OpenAI models without requiring action from EVI developers)
- Tool use (available for Anthropic, OpenAI, and Google models)
- System prompts (available for all models)
Model-specific features like OpenAI’s logprobs and structured output, or Anthropic’s model response prefill, are currently not supported to maintain consistency across LLM providers.
Which supplemental LLM for EVI has the lowest latency?
The landscape of large language models (LLMs) and their providers is constantly evolving, affecting which supplemental LLM is fastest with EVI.
The key factor influencing perceived latency using EVI is the time to first token (TTFT), with lower TTFT being better. The model and provider combination with the smallest TTFT will be the fastest.
Artificial Analysis offers a useful dashboard for comparing model and provider latencies.
Notably, there’s a tradeoff between speed and quality. Larger, slower models are easier to prompt. We recommend testing various supplemental LLM options when implementing EVI.
Does EVI support TTS?
Yes! To perform TTS within an EVI chat session, you can follow the steps below:
-
Establish initial connection: Make the initial handshake request to establish the WebSocket connection.
-
Send text for synthesis: Send an Assistant Input message with the text you want to synthesize into speech:
assistant_input -
Receive synthesized speech: After sending an
assistant_input
message, you will receive an Assistant Message and Audio Output for each sentence of the provided text.The
assistant_message
contains the text and expression measurement predictions, while theaudio_output
message contains the synthesized, emotional audio. See the sample messages below:assistant_messageaudio_output -
End of Response: Once all the text has been synthesized into speech, you will receive an Assistant End message indicating the end of the response:
assistant_end
Before implementing this in code, you can test it out by going to our Portal. Start a call in the EVI Playground, then send an Assistant Message with the text you want to synthesize.
Is it possible to pause EVI responses within a chat?
Yes, EVI supports pausing EVI’s responses. See our guide on pausing EVI’s responses for more details.
Can I access the transcripts for past conversations with EVI?
Yes! EVI provides full transcripts, expression measurements, and conversation analytics through our Chat history API. These tools help you monitor performance, improve your implementation, understand user satisfaction, and gain insights from interactions at scale.
Can I access the audio of past conversations with EVI?
Yes, you can listen to your past conversations with EVI using our audio reconstruction feature. This feature allows you to fetch and play back conversations as single audio files. See our full guide for audio reconstruction here.
The audio reconstruction feature is currently in public beta. As we work on resolving bugs as they are identified and reducing the error rate, reconstructions may fail for some chats. Additionally, this feature is not available for accounts with the no data retention option enabled.
Can EVI remember past conversations with the same user?
Yes! With EVI you can easily preserve context across Chats, allowing you to pick up right where you left off. For more details, see our guide to resuming chats.