EVI TypeScript Quickstart

A quickstart guide for implementing the Empathic Voice Interface (EVI) with TypeScript.

This tutorial provides step-by-step instructions for implementing EVI using Hume’s TypeScript SDK, and is broken down into five key components:

  1. Authentication: Authenticate your application with EVI using your credentials.
  2. Connecting to EVI: Set up a secure WebSocket connection to interact with EVI.
  3. Capturing & recording audio: Capture audio input and prepare it for processing.
  4. Audio playback: Play back the processed audio output to the user.
  5. Interruption: Manage and handle interruptions during the chat.

To see the full implementation within a frontend web application, visit our API examples repository on GitHub: hume-evi-typescript-example.

1

Authenticate

In order to establish an authenticated connection we will first need to instantiate the Hume client with our API key and Secret key. These keys can be obtained by logging into the portal and visiting the API keys page.

In the sample code below, the API key and Secret key have been saved to environment variables. Avoid hardcoding these values in your project to prevent them from being leaked.

TypeScript
1import { Hume, HumeClient } from 'hume';
2
3// instantiate the Hume client and authenticate
4const client = new HumeClient({
5 apiKey: import.meta.env.HUME_API_KEY || '',
6 secretKey: import.meta.env.HUME_SECRET_KEY || '',
7});

When using our TypeScript SDK, the Access Token necessary to establish an authenticated connection with EVI is fetched and applied under the hood after the Hume client is instantiated with your credientials.

2

Connect

With the Hume client instantiated with our credentials, we can now establish an authenticated WebSocket connection with EVI and define our WebSocket event handlers. For now we will include placeholder event handlers to be updated in later steps.

TypeScript
1import { Hume, HumeClient } from 'hume';
2
3// instantiate the Hume client and authenticate
4const client = new HumeClient({
5 apiKey: import.meta.env.HUME_API_KEY || '',
6 secretKey: import.meta.env.HUME_SECRET_KEY || '',
7});
8
9// instantiates WebSocket and establishes an authenticated connection
10const socket = await client.empathicVoice.chat.connect({
11 configId: import.meta.env.HUME_CONFIG_ID || null,
12});
13
14// define handler functions and assign them to the corresponding WebSocket event handlers
15socket.on('open', handleWebSocketOpenEvent);
16socket.on('message', handleWebSocketMessageEvent);
17socket.on('error', handleWebSocketErrorEvent);
18socket.on('close', handleWebSocketCloseEvent);
3

Audio input

To capture audio and send it through the socket as an audio input, several steps are necessary. First, we need to handle user permissions to access the microphone. Next, we’ll use the Media Stream API to capture the audio, and the MediaRecorder API to record the captured audio. We then base64 encode the recording audio Blob, and finally send the encoded audio through the WebSocket using the sendAudioInputmethod.

TypeScript
1import {
2 convertBlobToBase64,
3 ensureSingleValidAudioTrack,
4 getAudioStream,
5 getBrowserSupportedMimeType,
6} from 'hume';
7
8// the recorder responsible for recording the audio stream to be prepared as the audio input
9let recorder: MediaRecorder | null = null;
10
11// the stream of audio captured from the user's microphone
12let audioStream: MediaStream | null = null;
13
14// mime type supported by the browser the application is running in
15const mimeType: MimeType = (() => {
16 const result = getBrowserSupportedMimeType();
17 return result.success ? result.mimeType : MimeType.WEBM;
18})();
19
20// define function for capturing audio
21async function captureAudio(): Promise<void> {
22 // prompts user for permission to capture audio, obtains media stream upon approval
23 audioStream = await getAudioStream();
24
25 // ensure there is only one audio track in the stream
26 ensureSingleValidAudioTrack(audioStream);
27
28 // instantiate the media recorder
29 recorder = new MediaRecorder(audioStream, { mimeType });
30
31 // callback for when recorded chunk is available to be processed
32 recorder.ondataavailable = async ({ data }) => {
33 // IF size of data is smaller than 1 byte then do nothing
34 if (data.size < 1) return;
35
36 // base64 encode audio data
37 const encodedAudioData = await convertBlobToBase64(data);
38
39 // define the audio_input message JSON
40 const audioInput: Omit<Hume.empathicVoice.AudioInput, 'type'> = {
41 data: encodedAudioData,
42 };
43
44 // send audio_input message
45 socket?.sendAudioInput(audioInput);
46 };
47
48 // capture audio input at a rate of 100ms (recommended for web)
49 const timeSlice = 100;
50 recorder.start(timeSlice);
51}
52
53// define a WebSocket open event handler to capture audio
54async function handleWebSocketOpenEvent(): Promise<void> {
55 // place logic here which you would like invoked when the socket opens
56 console.log('Web socket connection opened');
57 await captureAudio();
58}

Accepted audio formats include: mp3, wav, aac, ogg, flac, webm, avr, cdda, cvs/vms, aiff, au, amr, mp2, mp4, ac3, avi, wmv, mpeg, ircam.

4

Audio output

The response will comprise multiple messages, detailed as follows:

  1. user_message: This message encapsulates the transcription of the audio input. Additionally, it includes expression measurement predictions related to the speaker’s vocal prosody.
  2. assistant_message: For every sentence within the response, an AssistantMessage is dispatched. This message not only relays the content of the response but also features predictions regarding the expressive qualities of the generated audio response.
  3. audio_output: Accompanying each AssistantMessage, an AudioOutput message will be provided. This contains the actual audio (binary) response corresponding to an AssistantMessage.
  4. assistant_end: Signifying the conclusion of the response to the audio input, an AssistantEnd message is delivered as the final piece of the communication.

Here we’ll focus on playing the received audio output. To play the audio output from the response we need to define our logic for converting the received binary to a Blob, and creating an HTMLAudioInput to play the audio.

We then need to update the client’s on message WebSocket event handler to invoke the logic to playback the audio when receiving the audio output. To manage playback for the incoming audio here we’ll implement a queue and sequentially play the audio back.

TypeScript
1// audio playback queue
2const audioQueue: Blob[] = [];
3
4// flag which denotes whether audio is currently playing or not
5let isPlaying = false;
6
7// the current audio element to be played
8let currentAudio: : HTMLAudioElement | null = null;
9
10// play the audio within the playback queue, converting each Blob into playable HTMLAudioElements
11function playAudio(): void {
12 // IF there is nothing in the audioQueue OR audio is currently playing then do nothing
13 if (!audioQueue.length || isPlaying) return;
14
15 // update isPlaying state
16 isPlaying = true;
17
18 // pull next audio output from the queue
19 const audioBlob = audioQueue.shift();
20
21 // IF audioBlob is unexpectedly undefined then do nothing
22 if (!audioBlob) return;
23
24 // converts Blob to AudioElement for playback
25 const audioUrl = URL.createObjectURL(audioBlob);
26 currentAudio = new Audio(audioUrl);
27
28 // play audio
29 currentAudio.play();
30
31 // callback for when audio finishes playing
32 currentAudio.onended = () => {
33 // update isPlaying state
34 isPlaying = false;
35
36 // attempt to pull next audio output from queue
37 if (audioQueue.length) playAudio();
38 };
39}
40
41// define a WebSocket message event handler to play audio output
42function handleWebSocketMessageEvent(
43 message: Hume.empathicVoice.SubscribeEvent
44): void {
45 // place logic here which you would like to invoke when receiving a message through the socket
46 switch (message.type) {
47 // add received audio to the playback queue, and play next audio output
48 case 'audio_output':
49 // convert base64 encoded audio to a Blob
50 const audioOutput = message.data;
51 const blob = convertBase64ToBlob(audioOutput);
52
53 // add audio Blob to audioQueue
54 audioQueue.push(blob);
55
56 // play the next audio output
57 if (audioQueue.length === 1) playAudio();
58 break;
59 }
60}
5

Interrupt

Interruptibility is a distinguishing feature of the Empathic Voice Interface. If an audio input is sent through the WebSocket while receiving response messages for a previous audio input, the response to the previous audio input will stop being sent. Additionally the interface will send back a user_interruption message, and begin responding to the new audio input.

TypeScript
1// function for stopping the audio and clearing the queue
2function stopAudio(): void {
3 // stop the audio playback
4 currentAudio?.pause();
5 currentAudio = null;
6
7 // update audio playback state
8 isPlaying = false;
9
10 // clear the audioQueue
11 audioQueue.length = 0;
12}
13
14// update WebSocket message event handler to handle interruption
15function handleWebSocketMessageEvent(
16 message: Hume.empathicVoice.SubscribeEvent
17): void {
18 // place logic here which you would like to invoke when receiving a message through the socket
19 switch (message.type) {
20 // add received audio to the playback queue, and play next audio output
21 case 'audio_output':
22 // convert base64 encoded audio to a Blob
23 const audioOutput = message.data;
24 const blob = convertBase64ToBlob(audioOutput);
25
26 // add audio Blob to audioQueue
27 audioQueue.push(blob);
28
29 // play the next audio output
30 if (audioQueue.length === 1) playAudio();
31 break;
32
33 // stop audio playback, clear audio playback queue, and update audio playback state on interrupt
34 case 'user_interruption':
35 stopAudio();
36 break;
37 }
38}