Speech-to-Speech (EVI)

Hume’s Empathic Voice Interface (EVI) is an advanced, real-time emotionally intelligent voice AI. EVI measures users’ nuanced vocal modulations and responds to them using a speech-language model, which guides language and speech generation.

By processing the tune, rhythm, and timbre of speech, EVI unlocks a variety of new capabilities, like knowing when to speak and generating more empathic language with the right tone of voice.

These features enable smoother and more satisfying voice-based interactions between humans and AI, opening new possibilities for personal AI, customer service, accessibility, robotics, immersive gaming, VR experiences, and much more.

EVI features

Version comparison

Feature	EVI 3	EVI 4-mini
Languages supported	English	English, Japanese, Korean, Spanish, French, Portuguese, Italian, German, Russian, Hindi, Arabic
Quick responses	Available	Unavailable
Supplemental LLM	Optional	Required

Basic capabilities

Feature	Description
Transcription (ASR)	Fast and accurate ASR returns a full transcript of the conversation, with Hume’s expression measures tied to each sentence.
Text response (LLM)	Rapid language generation with our speech-language model, optionally supplemented with configurable partner APIs (Anthropic, OpenAI, Google, Fireworks, and more).
Voice response (TTS)	Streamed speech generation via our speech-language model.
Low latency response	Immediate response provided by the fastest models running together on one service.

Empathic AI Features

Feature	Description
Responds at the right time	Uses your tone of voice for state-of-the-art end-of-turn detection — the true bottleneck to responding rapidly without interrupting you.
Understands users’ prosody	Provides streaming measurements of the tune, rhythm, and timbre of the user’s speech using Hume’s prosody model, integrated with our speech-language model.
Forms its own natural tone of voice	Guided by the users’ prosody and language, our model responds with an empathic, naturalistic tone of voice, matching the users’ nuanced “vibe” (calmness, interest, excitement, etc.). It responds to frustration with an apologetic tone, to sadness with sympathy, and more.
Responds to expression	Powered by our empathic large language model (speech-language model), EVI crafts responses that are not just intelligent but attuned to what the user is expressing with their voice.
Always interruptible	Stops rapidly whenever users interject, listens, and responds with the right context based on where it left off.
Multi-lingual	EVI 4-mini supports English, Japanese, Korean, Spanish, French, Portuguese, Italian, German, Russian, Hindi, Arabic.

Quickstart

Kickstart your integration with our quickstart guides for Next.js, TypeScript, and Python. Each guide walks you through integrating the EVI API, capturing user audio, and playing back EVI’s response so you can get up and running quickly.

Next.js Quickstart

Build web applications using our React client SDK in Next.js.

TypeScript Quickstart

Develop server-side or frontend applications using our TypeScript SDK.

Python Quickstart

Create integrations in Python using our Python SDK.

Building with EVI

EVI chat sessions run over a real-time WebSocket connection, enabling fluid, interactive dialogue. Users speak naturally while EVI analyzes their vocal expression and responds with emotionally intelligent speech.

Authentication

REST endpoints support the API key authentication strategy. specify your API key in the X-HUME-API-KEY header of your request.

The EVI WebSocket endpoint supports both the API key and Token authentication strategies, specify your API key or Access token in the query parameters of your request.

Configuration

Before starting a session, you’ll need a voice and a configuration.

Design a voice, clone an existing one, or select one from Hume’s extensive Voice Library.
Build an EVI configuration to define system behavior, voice selection, and other settings.

Connection

The EVI Playground is the easiest way to test your configuration. It lets you speak directly with EVI using your selected voice and settings, without writing any code.

To begin a conversation, connect using the EVI WebSocket URL start streaming the user’s audio input, via audio_input messages. EVI responds in real time with a sequence of structured messages:

user_message: Message containing a transcript of the user’s message along with their vocal expression measures
assistant_message: Message containing EVI’s response content.
audio_output: EVI’s response audio corresponding with the assistant_message
assistant_end: Message denoting the end of EVI’s response.

Developer tools

Hume provides a suite of developer tools to integrate and customize EVI.

WebSocket API Reference

Connect with EVI via WebSocket, including message formats and response types.

REST API Reference

Manage EVI configurations and access your chat history.

SDKs

Use official SDKs to streamline integration in Python and web-based projects.

Sample code

Browse example projects demonstrating EVI integration in different frameworks.

API limits

The following limits apply to Hume’s Speech-to-Speech (EVI) API.

Limit	Value
Concurrent sessions	Defined by your subscription tier
Maximum session duration	30 minutes
Maximum message size (WebSocket)	16 MB
Request rate limit (HTTP)	100 requests/second

The EVI API supports thousands of concurrent sessions. To increase limits:

Upgrade your account to Business or Enterprise.
Submit the Sales & Partnerships form.