Differences between EVI versions.

Today we support two versions of EVI. This guide details how to specify an EVI version in your configuration, as well as the differences between the two versions—from capabilities to pricing.

Specifying an EVI version

In the Platform UI you can simply select EVI 2 in the “Choose EVI version” step of the create configuration flow. If using the API to create your configuration, see our API reference for how to specify the EVI version in your configuration.

Feature comparison between versions

This table provides a comprehensive comparison of features between EVI 1 and EVI 2, highlighting the new capabilities introduced in the latest version.

FeatureEVI 1EVI 2
Voice qualitySimilar to best TTS solutionsSignificantly improved naturalness, clarity, and expressiveness
Response latency~900ms-2000ms~500-800ms (about 2x faster)
Emotional intelligenceEmpathic responses informed by expression measuresEnd-to-end understanding of voice augmented with emotional intelligence training
Base voices3 core voice options (Kora, Dacher, Ito)5 new high-quality base voice options with expressive personalities (8 total)
Voice customizabilitySupported - can select base voices and adjust voice parametersSupported - extensive customization with parameter adjustments (e.g. pitch, huskiness, nasality)
In-conversation voice promptingNot supportedSupported (e.g., “speak faster”, “sound more excited”, change accents)
Multimodal processingTranscription augmented with high-dimensional voice measuresFully integrated voice and language processing within a single model, along with transcripts and expression measures
Supplemental LLMsSupportedSupported
Tool use and web searchSupportedSupported
Custom language model (CLM)SupportedSupported
Configuration optionsExtensive supportExtensive support (same options as EVI 1)
TypeScript SDK supportSupportedSupported
Python SDK supportSupportedSupported
Multilingual supportEnglish onlyExpanded support for multiple languages
Cost$0.102 per minute$0.0714 per minute (30% reduction)

EVI 2

The Empathic Voice Interface 2 (EVI 2) introduces a new architecture that seamlessly integrates voice and language processing. This multimodal approach allows EVI 2 to understand and generate both language and voice, dramatically enhancing key features over EVI 1 while also enabling new capabilities.

EVI 2 can converse rapidly and fluently with users, understand a user’s tone of voice, generate any tone of voice, and can even handle niche requests like rapping, changing its style, or speeding up its speech. The model specifically excels at emulating a wide range of personalities, including their accents and speaking styles. It is exceptional at maintaining personalities that are fun and interesting to interact with. Ultimately, EVI 2 is capable of emulating the ideal personality for every application and user.

In addition, EVI 2 allows developers to create custom voices by using a new voice modulation method. Developers can adjust EVI 2’s base voices along a number of continuous scales, including gender, nasality, and pitch. This first-of-its-kind feature enables creating voices that are unique to an application or even a single user. Further, this feature does not rely on voice cloning, which currently invokes more risks than any other capability of this technology.

The EVI 2 API is currently in beta. We are still making ongoing improvements to the model. In the coming weeks and months, EVI 2 will sound better, speak more languages, follow more complex instructions, and use a wider range of tools.

Key improvements

Improved voice quality

EVI 2 uses an advanced voice generation model connected to our eLLM, which can process and generate both text and audio. This results in more natural-sounding speech with better word emphasis, higher expressiveness, and more consistent vocal output.

Faster responses

The integrated architecture of EVI 2 reduces end-to-end latency by 40% vs EVI 1, now averaging around 500ms. This significant speed improvement enables more responsive and human-like conversations.

Enhanced emotional intelligence

By processing voice and language in the same model, EVI 2 can better understand the emotional context of user inputs and generate more empathic responses, both in terms of content and vocal tone.

Custom voices and personality

EVI 2 offers new control over the AI’s voice characteristics. Developers can adjust various parameters to tailor EVI 2’s voice to their specific application needs. EVI 2 also supports in-conversation voice prompting, allowing users to dynamically modify EVI’s speaking style (e.g., “speak faster”, “sound excited”) during interactions.

Cost-effectiveness

Despite its advanced capabilities, EVI 2 is 30% more cost-effective than its predecessor, with pricing reduced from $0.1020 to $0.0714 per minute.

Beyond these improvements, EVI 2 also exhibits promising emerging capabilities including speech output in multiple languages. We will make these improvements available to developers as we scale up and improve the model.

We provide the same suite of tools to integrate and customize EVI 2 for your application as we do for EVI 1, and existing EVI developers can easily switch to the new system.

Frequently asked questions

Yes, for most configs, you will be able to update to EVI 2 simply by setting the evi_version to 2 in the configuration. If your config uses a supplemental LLM, we recommend that you select the same supplemental LLM for EVI 2. This will ensure stability and support for the same tools.

EVI 2 is a beta API, and is still in progress as of September 2024. Our team has released the API to allow developers to experiment with EVI 2.

Our developer platform team is available to assist with integration challenges to ensure smooth deployment across various applications - join our Discord for assistance.

For companies that are interested in using EVI 2 in production, please contact our partnerships team.

EVI 2’s multimodal processing integrates voice and text in a single voice-language foundation model. This allows EVI 2 to understand and generate both language and voice in the same latent space, resulting in more coherent and contextually aware responses. EVI 2’s integrated voice-language architecture also offers unprecedented control over both the AI’s personality and voice characteristics. Further, it allows prompting the model to change its speaking style or to follow a personality.

Hume has implemented several key safety measures for EVI 2:

  • Architectural safeguards: EVI 2’s core architecture prevents unauthorized voice cloning by representing voice characteristics as abstract semantic tokens, not raw audio data. This allows personality imitation without enabling direct voice replication. Importantly, EVI 2 is incapable of voice cloning without access to its code. By controlling EVI 2’s identity-related voice characteristics at the architecture level, we force the model to adopt one identity at a time, maintaining a consistent vocal register across sessions. We believe voice cloning currently invokes more risks than any other capability of voice AI, which is why we’ve implemented these architectural safeguards as a core feature of EVI 2.

  • Customizable language control: Developers can use their own LLMs or modify the supplemental LLM, enabling custom content filtering and safeguards tailored to their specific use cases.

  • Continuous testing: Our team regularly red-teams and tests EVI 2 to identify and address potential vulnerabilities.

  • Usage monitoring: We actively monitor API usage, classify major use cases, and can swiftly intervene if we detect misuse.

  • Clear guidelines: Our terms of use and the Hume Initiative guidelines prohibit malicious applications of our technology.

These measures ensure responsible deployment while providing developers the necessary control and transparency for their specific applications.

Yes, we plan to make EVI multilingual and support other languages in Q1 2025, in the following order:

  1. Multiple English accents (e.g. Australian, British)
  2. Common European languages (Spanish, German, Italian, French, Portuguese)
  3. Additional languages based on customer demand (including Arabic, Japanese, Korean, Hindi, Dutch, Swedish, Turkish, Russian, Mandarin)

EVI 2’s ability to learn new languages efficiently with minimal data will facilitate this expansion to more languages.

EVI 2’s speech recognition capabilities are robust across a wide range of scenarios. It is highly accurate across a wide range of accents, breathing patterns, and individual speaking patterns. Performance may degrade in environments with significant background noise, or when multiple speakers overlap. We continue to work on improving performance in challenging acoustic environments.

Built with