Empathic Voice Interface (EVI)

Quickstart

This quickstart guide outlines the process of implementing the Empathic Voice Interface (EVI).

Select a language below to get started:

This tutorial utilizes Hume’s TypeScript SDK to consume the Empathic Voice Interface, and can be broken down into five key components: authentication, establishing a secure WebSocket connection, capturing the audio input, and playing back the audio output. To see this code fully implemented within a frontend web application, visit the Github repo here: hume-evi-typescript-example.

Authenticate

In order to establish an authenticated connection we will first need to instantiate the Hume client with our API key and Client Secret. These keys can be obtained by logging into the portal and visiting the API keys page.

In the sample code below, the API key and client secret have been saved to environment variables. Avoid hard coding these values in your project to prevent them from being leaked.

TypeScript
1import { Hume, HumeClient } from 'hume';
2
3// instantiate the Hume client and authenticate
4const client = new HumeClient({
5 apiKey: import.meta.env.HUME_API_KEY,
6 clientSecret: import.meta.env.HUME_CLIENT_SECRET,
7});

When using our Typescript SDK, the Access Token necessary to establish an authenticated connection with EVI is fetched and applied under the hood after the Hume client is instantiated with your credientials.

Connect

With the Hume client instantiated with our credentials, we can now establish an authenticated WebSocket connection with EVI and define our WebSocket event handlers. For now we will include placeholder event handlers to be updated in later steps.

TypeScript
1import { Hume, HumeClient } from 'hume';
2
3// instantiate the Hume client and authenticate
4const client = new HumeClient({
5 apiKey: import.meta.env.HUME_API_KEY,
6 clientSecret: import.meta.env.HUME_CLIENT_SECRET,
7});
8
9// instantiates WebSocket and establishes an authenticated connection
10const socket = await client.empathicVoice.chat.connect({
11 onOpen: () => {
12 console.log('WebSocket connection opened');
13 },
14 onMessage: (message) => {
15 console.log(message);
16 },
17 onError: (error) => {
18 console.error(error);
19 },
20 onClose: () => {
21 console.log('WebSocket connection closed');
22 }
23});

Audio input

To capture audio and send it through the socket as an audio input, several steps are necessary. First, we need to handle user permissions to access the microphone. Next, we’ll use the Media Stream API to capture the audio, and the MediaRecorder API to record the captured audio. We then base64 encode the recording audio Blob, and finally send the encoded audio through the WebSocket using the sendAudioInputmethod.

TypeScript
1import {
2 convertBlobToBase64,
3 ensureSingleValidAudioTrack,
4 getAudioStream,
5} from 'hume';
6
7// the recorder responsible for recording the audio stream to be prepared as the audio input
8let recorder: MediaRecorder | null = null;
9// the stream of audio captured from the user's microphone
10let audioStream: MediaStream | null = null;
11
12// define function for capturing audio
13async function captureAudio(): Promise<void> {
14 // prompts user for permission to capture audio, obtains media stream upon approval
15 audioStream = await getAudioStream();
16 // ensure there is only one audio track in the stream
17 ensureSingleValidAudioTrack(audioStream);
18 // instantiate the media recorder
19 recorder = new MediaRecorder(audioStream, { mimeType });
20 // callback for when recorded chunk is available to be processed
21 recorder.ondataavailable = async ({ data }) => {
22 // IF size of data is smaller than 1 byte then do nothing
23 if (data.size < 1) return;
24 // base64 encode audio data
25 const encodedAudioData = await convertBlobToBase64(data);
26 // define the audio_input message JSON
27 const audioInput: Omit<Hume.empathicVoice.AudioInput, 'type'> = {
28 data: encodedAudioData,
29 };
30 // send audio_input message
31 socket?.sendAudioInput(audioInput);
32 };
33 // capture audio input at a rate of 100ms (recommended)
34 const timeSlice = 100;
35 recorder.start(timeSlice);
36}
37
38// define a WebSocket open event handler to capture audio
39async function handleWebSocketOpenEvent(): Promise<void> {
40 // place logic here which you would like invoked when the socket opens
41 console.log('Web socket connection opened');
42 await captureAudio();
43}

Accepted audio formats include: mp3, wav, aac, ogg, flac, webm, avr, cdda, cvs/vms, aiff, au, amr, mp2, mp4, ac3, avi, wmv, mpeg, ircam.

Audio output

The response will comprise multiple messages, detailed as follows:

  1. user_message: This message encapsulates the transcription of the audio input. Additionally, it includes expression measurement predictions related to the speaker’s vocal prosody.
  2. assistant_message: For every sentence within the response, an AssistantMessage is dispatched. This message not only relays the content of the response but also features predictions regarding the expressive qualities of the generated audio response.
  3. audio_output: Accompanying each AssistantMessage, an AudioOutput message will be provided. This contains the actual audio (binary) response corresponding to an AssistantMessage.
  4. assistant_end: Signifying the conclusion of the response to the audio input, an AssistantEnd message is delivered as the final piece of the communication.

Here we will focus on playing the received audio output. To play the audio output from the response we need to define our logic for converting the received binary to a Blob, and creating an HTMLAudioInput to play the audio. We then need to update the client’s on message WebSocket event handler to invoke the logic to playback the audio when receiving the audio output. To manage playback for the incoming audio here we’ll implement a queue and sequentially play the audio back.

TypeScript
1import {
2 convertBase64ToBlob,
3 getBrowserSupportedMimeType
4} from 'hume';
5
6// audio playback queue
7const audioQueue: Blob[] = [];
8// flag which denotes whether audio is currently playing or not
9let isPlaying = false;
10// the current audio element to be played
11let currentAudio: : HTMLAudioElement | null = null;
12// mime type supported by the browser the application is running in
13const mimeType: MimeType = (() => {
14 const result = getBrowserSupportedMimeType();
15 return result.success ? result.mimeType : MimeType.WEBM;
16})();
17
18// play the audio within the playback queue, converting each Blob into playable HTMLAudioElements
19function playAudio(): void {
20 // IF there is nothing in the audioQueue OR audio is currently playing then do nothing
21 if (!audioQueue.length || isPlaying) return;
22 // update isPlaying state
23 isPlaying = true;
24 // pull next audio output from the queue
25 const audioBlob = audioQueue.shift();
26 // IF audioBlob is unexpectedly undefined then do nothing
27 if (!audioBlob) return;
28 // converts Blob to AudioElement for playback
29 const audioUrl = URL.createObjectURL(audioBlob);
30 currentAudio = new Audio(audioUrl);
31 // play audio
32 currentAudio.play();
33 // callback for when audio finishes playing
34 currentAudio.onended = () => {
35 // update isPlaying state
36 isPlaying = false;
37 // attempt to pull next audio output from queue
38 if (audioQueue.length) playAudio();
39 };
40}
41
42// define a WebSocket message event handler to play audio output
43function handleWebSocketMessageEvent(
44 message: Hume.empathicVoice.SubscribeEvent
45): void {
46 // place logic here which you would like to invoke when receiving a message through the socket
47 switch (message.type) {
48 // add received audio to the playback queue, and play next audio output
49 case 'audio_output':
50 // convert base64 encoded audio to a Blob
51 const audioOutput = message.data;
52 const blob = convertBase64ToBlob(audioOutput, mimeType);
53 // add audio Blob to audioQueue
54 audioQueue.push(blob);
55 // play the next audio output
56 if (audioQueue.length === 1) playAudio();
57 break;
58 }
59}

Interrupt

Interruptibility is a distinguishing feature of the Empathic Voice Interface. If an audio input is sent through the websocket while receiving response messages for a previous audio input, the response to the previous audio input will stop being sent. Additionally the interface will send back a user_interruption message, and begin responding to the new audio input.

TypeScript
1// function for stopping the audio and clearing the queue
2function stopAudio(): void {
3 // stop the audio playback
4 currentAudio?.pause();
5 currentAudio = null;
6 // update audio playback state
7 isPlaying = false;
8 // clear the audioQueue
9 audioQueue.length = 0;
10}
11
12// update WebSocket message event handler to handle interruption
13function handleWebSocketMessageEvent(
14 message: Hume.empathicVoice.SubscribeEvent
15): void {
16 // place logic here which you would like to invoke when receiving a message through the socket
17 switch (message.type) {
18 // add received audio to the playback queue, and play next audio output
19 case 'audio_output':
20 // convert base64 encoded audio to a Blob
21 const audioOutput = message.data;
22 const blob = convertBase64ToBlob(audioOutput, mimeType);
23 // add audio Blob to audioQueue
24 audioQueue.push(blob);
25 // play the next audio output
26 if (audioQueue.length === 1) playAudio();
27 break;
28 // stop audio playback, clear audio playback queue, and update audio playback state on interrupt
29 case 'user_interruption':
30 stopAudio();
31 break;
32 }
33}