TTS NodeJS Quickstart Guide
Step-by-step guide for integrating the TTS API using Hume’s TypeScript SDK.
This guide shows how to use Hume’s Text-to-Speech API using Hume’s TypeScript SDK for applications that run in a NodeJS-compatible runtime. It assumes your system has FFMpeg available.
It demonstrates:
- Using an existing voice.
- Create a new voice via a prompt.
- Continuing from previous speech.
- Providing “acting instructions” to modulate the voice.
- Generating speech from live input.
The complete code for the example in this guide is available on GitHub.
Environment Setup
Create a new project and install the required packages:
Authenticating the HumeClient
You must authenticate to use the Hume TTS API. Your API key can be retrieved from the Hume AI platform.
This example uses dotenv. Place your API key in a .env
file at the root of your project.
First, use your API key to instantiate the HumeClient
, importing as necessary.
Next, define a helper for playing back audio with ffplay
.
Playing audio
The startAudioPlayer
function creates an FFmpeg process that plays audio from stdin. It returns an object with a stdin
stream for writing audio data and a stop
method to cleanly terminate playback.
Using a pre-existing voice
Use this method if you want to synthesize speech with a high-quality voice from Hume’s Voice Library, or specify provider: 'CUSTOM_VOICE'
to use a voice that you created previously via the Hume Platform or the API.
Create a new voice via a prompt
The Voice Creation API allows you to create custom voices programatically, via prompting. There are two steps to creating a voice:
- Send a description of the voice, along with sample text that is characteristic of the voice, to the standard
tts
endpoint without specifying a voice, withinstant_mode
disabled. - Take the
generationId
from one of the resulting audio samples, and use it to create a new voice with the Voice Creation API.
Here, we arbitrarily select the second sample. In a real application, you would likely allow the end user to listen to the samples and make a selection.
Continuing previous speech
You can make new speech sound like a natural continuation from previous speech by providing the generationId
of the previous audio in the context
parameter. This helps maintain consistency in tone, pacing, and emotional state.
Additionally, you can provide “acting instructions” using the description
field alongside an existing voice. When you specify both a voice and a description, the description
modulates the voice’s tone, emotion, and delivery style while maintaining the core voice characteristics.
This code continues from the snippet above that created a new voice, and continues the speech from where the selected generation left off.
Generating speech from live input
If you need to generate speech from text that is being produced in real-time, you can use the bidirectional streaming WebSocket endpoint at /v0/tts/stream/input
.
Support for connecting to the WebSocket directly is coming soon to the TypeScript SDK. For the time being, this example shows how you can implement a simple WebSocket client yourself and still use types provided by the SDK for type safety.
First, install the ws
package:
Then, use the ws
library to connect to the WebSocket. Specify the following query parameters
no_binary=true
- to receive audio as base64 text rather than binary, to simplify parsinginstant_mode=true
- to receive audio snippets as soon as they are generated, rather than waiting for the full utterance to be completeformat_type=pcm
- to receive raw PCM audio without WAV headers, which is easier to pipe directly to an audio playerapi_key=your_api_key
- to authenticate the request
We wrap the WebSocket in a StreamingTtsClient
that provides an async iterator interface for consuming audio snippets as they arrive.
The PublishTts
type from the Typescript SDK describes the format of messages supported by the WebSocket. You can specify text
and voice
to send text to be spoken.
The WebSocket will buffer the text you send it by default, as having more context typically improves correctness and expressiveness. Audio will be produced when the buffer is full. However, you can send a message with flush: true
to tell the server to start generating audio for the text you have sent so far. When you are done generating speech, send a message with close: true
, and the server will end the connection once it is finished with the text you have given it previously.
The WebSocket produces a stream of independent audio snippets, rather than a continuous stream of sometimes-silent audio.
In the example project, we use ffplay
as the audio player, which expects a continuous stream. To take care of this, we use the SilenceFiller
helper provided by the Hume Typescript SDK, which creates a continuous stream from independent PCM audio snippets.
This is not always needed. In some settings you can initialize a new audio player for each audio chunk. If you are using an audio player that writes directly to your system’s audio output (such as aplay
for Linux systems, afplay
for Mac systems), you can typically write each chunk directly when it is ready to play, without worrying about filling gaps with silence.
Running the Example
To run the example: