EVI TypeScript Quickstart
A quickstart guide for implementing the Empathic Voice Interface (EVI) with TypeScript.
This tutorial provides step-by-step instructions for implementing EVI using Hume’s TypeScript SDK, and is broken down into five key components:
- Authentication: Authenticate your application with EVI using your credentials.
- Connecting to EVI: Set up a secure WebSocket connection to interact with EVI.
- Capturing & recording audio: Capture audio input and prepare it for processing.
- Audio playback: Play back the processed audio output to the user.
- Interruption: Manage and handle interruptions during the chat.
To see the full implementation within a frontend web application, visit our API examples repository on GitHub: hume-evi-typescript-example.
Authenticate
In order to establish an authenticated connection we will first need to instantiate the Hume client with our API key and Secret key. These keys can be obtained by logging into the portal and visiting the API keys page.
In the sample code below, the API key and Secret key have been saved to environment variables. Avoid hardcoding these values in your project to prevent them from being leaked.
When using our TypeScript SDK, the Access Token necessary to establish an authenticated connection with EVI is fetched and applied under the hood after the Hume client is instantiated with your credientials.
Connect
With the Hume client instantiated with our credentials, we can now establish an authenticated WebSocket connection with EVI and define our WebSocket event handlers. For now we will include placeholder event handlers to be updated in later steps.
Audio input
To capture audio and send it through the socket as an audio input, several steps are necessary. First, we need to handle user permissions
to access the microphone. Next, we’ll use the Media Stream API to capture the audio, and the MediaRecorder API to record the captured audio.
We then base64 encode the recording audio Blob, and finally send the encoded audio through the WebSocket using the sendAudioInput
method.
Accepted audio formats include: mp3
, wav
, aac
, ogg
, flac
, webm
, avr
, cdda
,
cvs/vms
, aiff
, au
, amr
, mp2
, mp4
, ac3
, avi
, wmv
, mpeg
, ircam
.
Audio output
The response will comprise multiple messages, detailed as follows:
user_message
: This message encapsulates the transcription of the audio input. Additionally, it includes expression measurement predictions related to the speaker’s vocal prosody.assistant_message
: For every sentence within the response, anAssistantMessage
is dispatched. This message not only relays the content of the response but also features predictions regarding the expressive qualities of the generated audio response.audio_output
: Accompanying eachAssistantMessage
, anAudioOutput
message will be provided. This contains the actual audio (binary) response corresponding to anAssistantMessage
.assistant_end
: Signifying the conclusion of the response to the audio input, anAssistantEnd
message is delivered as the final piece of the communication.
Here we’ll focus on playing the received audio output. To play the audio output from the response we need to define our logic for converting the received binary to a Blob, and creating an HTMLAudioInput to play the audio.
We then need to update the client’s on message WebSocket event handler to invoke the logic to playback the audio when receiving the audio output. To manage playback for the incoming audio here we’ll implement a queue and sequentially play the audio back.
Interrupt
Interruptibility is a distinguishing feature of the Empathic Voice Interface. If an audio input is sent
through the WebSocket while receiving response messages for a previous audio input, the response to
the previous audio input will stop being sent. Additionally the interface will send back a
user_interruption
message, and begin responding to the new audio input.