This quickstart guide outlines the process of implementing the Empathic Voice Interface (EVI).
Select a language below to get started:
TypeScript
Next.js (App Router)
Next.js (Pages Router)
Python
This tutorial utilizes Hume’s TypeScript SDK to consume the Empathic Voice Interface, and can be broken down into five key components: authentication, establishing a secure WebSocket connection, capturing the audio input, and playing back the audio output. To see this code fully implemented within a frontend web application, visit the Github repo here: hume-evi-typescript-example.
Authenticate
In order to make an authenticated connection we will first need to generate an access token. Doing so will require your API key and client secret. These keys can be obtained by logging into the portal and visiting the API keys page.
In the sample code below, the API key and client secret have been saved to environment variables. Avoid hard coding these values in your project to prevent them from being leaked.
Connect
After fetching our access token we can now use it to instantiate the VoiceClient
. First we define our
configuration for the interface. Then we use the client’s create
method to instantiate it with the
specified configuration. Once the client has been instantiated we can set up the WebSocket handlers
and call the client’s connect
method to establish a secure WebSocket connection.
Audio input
There are a a few key steps to capturing audio. We’ll need to handle user permissions to access the microphone, use the Media Stream API to capture audio, use the MediaRecorder API to record the audio, convert the Blob to an ArrayBuffer (binary), and finally send the ArrayBuffer through the WebSocket.
Accepted audio formats include: mp3
, wav
, aac
, ogg
, flac
, webm
,
avr
, cdda
, cvs/vms
, aiff
, au
, amr
, mp2
, mp4
, ac3
, avi
,
wmv
, mpeg
, ircam
.
Audio output
The response will comprise multiple messages, detailed as follows:
- user_message: This message encapsulates the transcription of the audio input. Additionally, it includes expression measurement predictions related to the speaker’s vocal prosody.
- assistant_message: For every sentence within the response, an
AssistantMessage
is dispatched. This message not only relays the content of the response but also features predictions regarding the expressive qualities of the generated audio response. - audio_output: Accompanying each
AssistantMessage
, anAudioOutput
message will be provided. This contains the actual audio (binary) response corresponding to anAssistantMessage
. - assistant_end: Signifying the conclusion of the response to the audio input, an
AssistantEnd
message is delivered as the final piece of the communication.
Here we will focus on playing the received audio output. To play the audio output from the response we need to define our logic for converting the received binary to a Blob, and creating an HTMLAudioInput to play the audio. We then need to update the client’s on message WebSocket event handler to invoke the logic to playback the audio when receiving the audio output. To manage playback for the incoming audio here we’ll implement a queue and sequentially play the audio back.
Interrupt
Interruptibility is a distinguishing feature of the Empathic Voice Interface. If an audio input is sent
through the websocket while receiving response messages for a previous audio input, the response to
the previous audio input will stop being sent. Additionally the interface will send back a
user_interruption
message, and begin responding to the new audio input.