Empathic Voice Interface (EVI)

Quickstart

This quickstart guide outlines the process of implementing the Empathic Voice Interface (EVI).

Select a language below to get started:

This tutorial utilizes Hume’s TypeScript SDK to consume the Empathic Voice Interface, and can be broken down into five key components: authentication, establishing a secure WebSocket connection, capturing the audio input, and playing back the audio output. To see this code fully implemented within a frontend web application, visit the Github repo here: hume-evi-typescript-example.

Authenticate

In order to make an authenticated connection we will first need to generate an access token. Doing so will require your API key and client secret. These keys can be obtained by logging into the portal and visiting the API keys page.

In the sample code below, the API key and client secret have been saved to environment variables. Avoid hard coding these values in your project to prevent them from being leaked.

TypeScript
1import { fetchAccessToken } from '@humeai/voice';
2// initialize access token to make authenticated requests.
3let accessToken;
4// fetch and set the access token
5(async () => {
6 accessToken = await fetchAccessToken(
7 import.meta.env.HUME_API_KEY,
8 import.meta.env.HUME_CLIENT_SECRET
9 );
10})();

Connect

After fetching our access token we can now use it to instantiate the VoiceClient. First we define our configuration for the interface. Then we use the client’s create method to instantiate it with the specified configuration. Once the client has been instantiated we can set up the WebSocket handlers and call the client’s connect method to establish a secure WebSocket connection.

TypeScript
1import { VoiceClient, createSocketConfig } from '@humeai/voice';
2// define EVI configuration
3const config = createSocketConfig({
4 auth: { type: 'accessToken', value: accessToken },
5});
6// instantiate client with configuration
7const client = VoiceClient.create(config);
8// setup event handlers for WebSocket
9client.on('open', () => {
10 console.log('WebSocket connection opened');
11});
12client.on('message', (message) => {
13 console.log(message)
14});
15client.on('close', () => {
16 console.log('WebSocket connection closed');
17});
18// establish an authenticated WebSocket connection
19client.connect().then(() => {
20 /* handle success */
21}).catch(() => {
22 /* handle error */
23});

Audio input

There are a a few key steps to capturing audio. We’ll need to handle user permissions to access the microphone, use the Media Stream API to capture audio, use the MediaRecorder API to record the audio, convert the Blob to an ArrayBuffer (binary), and finally send the ArrayBuffer through the WebSocket.

TypeScript
1import { getAudioStream, checkForAudioTracks } from '@humeai/voice';
2// define function for capturing audio
3async function captureAudio() {
4 // prompt for permission and get the audio stream if granted
5 const audioStream = await getAudioStream();
6 // ensures there is only one audio track in the media stream
7 checkForAudioTracks();
8 // get the supported mimeType
9 const result = getSupportedMimeType();
10 const mimeType = result.mimeType;
11 // create a MediaRecorder to record the audio stream
12 const rec = new MediaRecorder(audioStream, { mimeType });
13 // event handler for when recorded audio data is available
14 rec.ondataavailable = ({ data }) => {
15 if (data.size > 0 && client.readyState === WebSocket.OPEN) {
16 const buffer = await event.data.arrayBuffer();
17 // send the buffer (audio binary) through the WebSocket
18 client.sendAudio(buffer);
19 }
20 };
21 // start recording
22 rec.start(100);
23}
24// update the WebSocket open event handler to capture audio
25client.on('open', async () => {
26 console.log('Websocket connection opened');
27 await captureAudio();
28});

Accepted audio formats include: mp3, wav, aac, ogg, flac, webm, avr, cdda, cvs/vms, aiff, au, amr, mp2, mp4, ac3, avi, wmv, mpeg, ircam.

Audio output

The response will comprise multiple messages, detailed as follows:

  1. user_message: This message encapsulates the transcription of the audio input. Additionally, it includes expression measurement predictions related to the speaker’s vocal prosody.
  2. assistant_message: For every sentence within the response, an AssistantMessage is dispatched. This message not only relays the content of the response but also features predictions regarding the expressive qualities of the generated audio response.
  3. audio_output: Accompanying each AssistantMessage, an AudioOutput message will be provided. This contains the actual audio (binary) response corresponding to an AssistantMessage.
  4. assistant_end: Signifying the conclusion of the response to the audio input, an AssistantEnd message is delivered as the final piece of the communication.

Here we will focus on playing the received audio output. To play the audio output from the response we need to define our logic for converting the received binary to a Blob, and creating an HTMLAudioInput to play the audio. We then need to update the client’s on message WebSocket event handler to invoke the logic to playback the audio when receiving the audio output. To manage playback for the incoming audio here we’ll implement a queue and sequentially play the audio back.

TypeScript
1import { arrayBufferToBlob } from '@humeai/voice';
2// queue to hold the audio blobs
3let audioQueue = [];
4// flag to check if audio is currently being played
5let isPlaying = false;
6// the current audio to be played
7let currentAudio;
8// function to play the audio blob from the queue
9function playAudio(): void {
10 if (audioQueue.length > 0 && !isPlaying) {
11 // update isPlaying state
12 isPlaying = true;
13 // pull next audio output from the queue
14 const audioBlob = audioQueue.shift();
15 if (audioBlob) {
16 const audioUrl = URL.createObjectURL(audioBlob);
17 // converts Blob to AudioElement for playback
18 currentAudio = new Audio(audioUrl);
19 // play audio
20 currentAudio.play();
21 // callback for when audio finishes playing
22 currentAudio.onended = async () => {
23 // update isPlaying state
24 isPlaying = false;
25 // attempt to pull next audio output from queue
26 if (audioQueue.length) playAudio();
27 };
28 }
29 }
30}
31// update the client's WebSocket message event handler.
32client.on('message', (message) => {
33 switch (message.type) {
34 case 'audio_output':
35 // extract audio output from the message
36 const audioOutput = message.data;
37 // convert audio output binary to a Blob
38 const blob = base64ToBlob(audioOutput, mimeType);
39 // add Blob to the playback queue
40 audioQueue.push(blob);
41 if (audioQueue.length <= 1) {
42 // attempt to play next audio output from the queue
43 await playAudio();
44 }
45 break;
46 }
47});

Interrupt

Interruptibility is a distinguishing feature of the Empathic Voice Interface. If an audio input is sent through the websocket while receiving response messages for a previous audio input, the response to the previous audio input will stop being sent. Additionally the interface will send back a user_interruption message, and begin responding to the new audio input.

TypeScript
1// function for stopping the audio and clearing the queue
2function stopAudio(): void {
3 currentAudio?.pause();
4 currentAudio = null;
5 isPlaying = false;
6 audioQueue.length = 0;
7}
8// update client's WebSocket message event handler
9client.on('message', (message) => {
10 switch (message.type) {
11 case 'audio_output':
12 // extract audio output from the message
13 const audioOutput = message.data;
14 // convert audio output binary to a Blob
15 const blob = base64ToBlob(audioOutput, mimeType);
16 // add Blob to the playback queue
17 audioQueue.push(blob);
18 if (audioQueue.length <= 1) {
19 // play next audio output from the audio playback queue
20 await playAudio();
21 }
22 break;
23 case 'user_interruption':
24 // stop audio playback and clear audio queue
25 stopAudio();
26 break;
27 }
28});