TTS NodeJS Quickstart Guide

Step-by-step guide for integrating the TTS API using Hume’s TypeScript SDK.

This guide shows how to use Hume’s Text-to-Speech API using Hume’s TypeScript SDK for applications that run in a NodeJS-compatible runtime. It assumes your system has FFMpeg available.

It demonstrates:

  1. Using an existing voice.
  2. Create a new voice via a prompt.
  3. Continuing from previous speech.
  4. Providing “acting instructions” to modulate the voice.
  5. Generating speech from live input.

The complete code for the example in this guide is available on GitHub.

Environment Setup

Create a new project and install the required packages:

$npm init -y
>npm install hume dotenv
>npm install --save-dev typescript @types/node

Authenticating the HumeClient

You must authenticate to use the Hume TTS API. Your API key can be retrieved from the Hume AI platform.

This example uses dotenv. Place your API key in a .env file at the root of your project.

.env
$echo "HUME_API_KEY=your_api_key_here" > .env

First, use your API key to instantiate the HumeClient, importing as necessary.

1// index.ts
2import { HumeClient } from "hume"
3import dotenv from "dotenv"
4
5dotenv.config()
6
7const hume = new HumeClient({
8 apiKey: process.env.HUME_API_KEY!
9})

Next, define a helper for playing back audio with ffplay.

Playing audio

1// audio_player.ts
2import { spawn } from 'child_process';
3
4export const startAudioPlayer = () => {
5 const ffplay = spawn('ffplay', ['-nodisp', '-autoexit', '-'], {
6 stdio: ['pipe', 'ignore', 'ignore']
7 });
8
9 return {
10 stdin: ffplay.stdin,
11 stop: async () => {
12 ffplay.stdin.end();
13 await new Promise(resolve => ffplay.on('close', resolve));
14 }
15 };
16};

The startAudioPlayer function creates an FFmpeg process that plays audio from stdin. It returns an object with a stdin stream for writing audio data and a stop method to cleanly terminate playback.

Using a pre-existing voice

Use this method if you want to synthesize speech with a high-quality voice from Hume’s Voice Library, or specify provider: 'CUSTOM_VOICE' to use a voice that you created previously via the Hume Platform or the API.

1const utterance = {
2 text: "Dogs became domesticated between 23,000 and 30,000 years ago.",
3 voice: { name: 'Ava Song', provider: 'HUME_AI' as const }
4}
5
6const stream = await hume.tts.synthesizeJsonStreaming({
7 utterances: [utterance],
8 // With `stripHeaders: true`, only the first audio chunk will contain
9 // headers in container formats (wav, mp3). This allows you to start a
10 // single audio player and stream all audio chunks to it without artifacts.
11 stripHeaders: true
12})
13
14const audioPlayer = startAudioPlayer()
15for await (const snippet of stream) {
16 const buffer = Buffer.from(snippet.audio, "base64")
17 audioPlayer.stdin.write(buffer)
18}
19await audioPlayer.stop()

Create a new voice via a prompt

The Voice Creation API allows you to create custom voices programatically, via prompting. There are two steps to creating a voice:

  1. Send a description of the voice, along with sample text that is characteristic of the voice, to the standard tts endpoint without specifying a voice, with instant_mode disabled.
  2. Take the generationId from one of the resulting audio samples, and use it to create a new voice with the Voice Creation API.

Here, we arbitrarily select the second sample. In a real application, you would likely allow the end user to listen to the samples and make a selection.

1// Create voice options for user selection
2const result = await hume.tts.synthesizeJson({
3 utterances: [{
4 description: "Crisp, upper-class British accent with impeccably articulated consonants and perfectly placed vowels. Authoritative and theatrical, as if giving a lecture.",
5 text: "The science of speech. That's my profession; also my hobby. Happy is the man who can make a living by his hobby!"
6 }],
7 numGenerations: 2,
8 stripHeaders: true,
9})
10
11const audioPlayer = startAudioPlayer()
12let sampleNumber = 1;
13for (const generation of result.generations) {
14 const buffer = Buffer.from(generation.audio, "base64")
15 audioPlayer.stdin.write(buffer)
16 console.log(`Playing option ${sampleNumber}...`)
17 sampleNumber++;
18}
19await audioPlayer.stop()
20
21// Select the second voice option for this example
22const selectedGenerationId = result.generations[1].generationId
23
24const voiceName = `higgins-${Date.now()}`;
25await hume.tts.voices.create({
26 name: voiceName,
27 generationId: selectedGenerationId,
28})
29
30console.log(`Created voice: ${voiceName}`)

Continuing previous speech

You can make new speech sound like a natural continuation from previous speech by providing the generationId of the previous audio in the context parameter. This helps maintain consistency in tone, pacing, and emotional state.

Additionally, you can provide “acting instructions” using the description field alongside an existing voice. When you specify both a voice and a description, the description modulates the voice’s tone, emotion, and delivery style while maintaining the core voice characteristics.

This code continues from the snippet above that created a new voice, and continues the speech from where the selected generation left off.

1const audioPlayer = startAudioPlayer()
2const stream = await hume.tts.synthesizeJsonStreaming({
3 utterances: [{
4 voice: { name: voiceName },
5 text: "YOU can spot an Irishman or a Yorkshireman by his brogue. I can place any man within six miles. I can place him within two miles in London. Sometimes within two streets.",
6 description: "Bragging about his abilities"
7 }],
8 context: {
9 generationId: selectedGenerationId
10 },
11 stripHeaders: true
12})
13
14for await (const snippet of stream) {
15 const buffer = Buffer.from(snippet.audio, "base64")
16 audioPlayer.stdin.write(buffer)
17}
18await audioPlayer.stop()

Generating speech from live input

If you need to generate speech from text that is being produced in real-time, you can use the bidirectional streaming WebSocket endpoint at /v0/tts/stream/input.

Support for connecting to the WebSocket directly is coming soon to the TypeScript SDK. For the time being, this example shows how you can implement a simple WebSocket client yourself and still use types provided by the SDK for type safety.

First, install the ws package:

$npm install ws

Then, use the ws library to connect to the WebSocket. Specify the following query parameters

  • no_binary=true - to receive audio as base64 text rather than binary, to simplify parsing
  • instant_mode=true - to receive audio snippets as soon as they are generated, rather than waiting for the full utterance to be complete
  • format_type=pcm - to receive raw PCM audio without WAV headers, which is easier to pipe directly to an audio player
  • api_key=your_api_key - to authenticate the request

We wrap the WebSocket in a StreamingTtsClient that provides an async iterator interface for consuming audio snippets as they arrive.

1// streaming.ts
2import WebSocket from "ws";
3import {SnippetAudioChunk} from "hume/serialization/resources/tts/types/SnippetAudioChunk";
4import { PublishTts } from "hume/api/resources/tts";
5
6export class StreamingTtsClient {
7 private constructor(
8 private readonly ws: WebSocket,
9 private readonly queue: Queue<string>
10 ) { }
11
12 static async connect(apiKey: string): Promise<StreamingTtsClient> {
13 if (!apiKey) throw new Error("HUME_API_KEY is not set");
14
15 const url = `wss://api.hume.ai/v0/tts/stream/input?api_key=${apiKey}&no_binary=true&instant_mode=true&strip_headers=true&format_type=pcm`;
16 const ws = new WebSocket(url);
17 const queue = new Queue<string>();
18
19 ws.onmessage = (event) => {
20 queue.push(event.data.toString())
21 };
22 ws.onclose = (_event) => {
23 queue.end();
24 };
25 ws.onerror = (_error) => {
26 queue.end();
27 };
28
29 await new Promise<void>((resolve, reject) => {
30 ws.onopen = () => {
31 resolve();
32 };
33 ws.onerror = (e) => {
34 reject(e);
35 };
36 });
37
38 return new StreamingTtsClient(ws, queue);
39 }
40
41 send(message: PublishTts) {
42 if (this.ws.readyState !== WebSocket.OPEN) throw new Error("WebSocket not connected.");
43 this.ws.send(JSON.stringify(message));
44 }
45
46 disconnect() {
47 this.ws.close();
48 }
49
50 async *[Symbol.asyncIterator]() {
51 for await (const item of this.queue) {
52 yield SnippetAudioChunk.parseOrThrow(JSON.parse(item), {
53 unrecognizedObjectKeys: "passthrough",
54 });
55 }
56 }
57}
58
59// Resolves a promise with T, or null to indicate the stream ended.
60type Resolver<T> = (value: T | null) => void;
61
62class Queue<T> {
63 private pushed: T[] = [];
64 // If non-null, there is a consumer waiting for data, and
65 // calling `waiting` with a chunk will resolve a promise that
66 // sends the data to the consumer.
67 private waiting: Resolver<T> | null = null;
68 private ended = false;
69
70 push(x: T) {
71 if (this.ended) return;
72 if (this.waiting) {
73 const w = this.waiting;
74 this.waiting = null;
75 w(x);
76 }
77 else this.pushed.push(x);
78 }
79 end() {
80 if (this.ended) return;
81 this.ended = true;
82 if (this.waiting) { this.waiting(null); this.waiting = null; }
83 }
84 async *[Symbol.asyncIterator]() {
85 while (true) {
86 if (this.pushed.length) yield this.pushed.shift()!;
87 else {
88 const x = await new Promise<T | null>(r => (this.waiting = r));
89 if (x === null) break;
90 yield x;
91 }
92 }
93 }
94}

The PublishTts type from the Typescript SDK describes the format of messages supported by the WebSocket. You can specify text and voice to send text to be spoken.

The WebSocket will buffer the text you send it by default, as having more context typically improves correctness and expressiveness. Audio will be produced when the buffer is full. However, you can send a message with flush: true to tell the server to start generating audio for the text you have sent so far. When you are done generating speech, send a message with close: true, and the server will end the connection once it is finished with the text you have given it previously.

1const stream = await StreamingTtsClient.connect(process.env.HUME_API_KEY!);
2
3// Helper methods for flushing and closing the stream
4const sendFlush = () => stream.send({ flush: true });
5const sendClose = () => stream.send({ close: true });
6
7const voice = { name: "Ava Song", provider: "HUME_AI" } as const;
8const sendInput = async () => {
9 stream.send({ text: "Hello world.", voice });
10 sendFlush();
11 console.log('Waiting 8 seconds...')
12 await new Promise(r => setTimeout(r, 8000));
13 stream.send({ text: "Goodbye, world.", voice });
14 sendFlush();
15 sendClose();
16};

The WebSocket produces a stream of independent audio snippets, rather than a continuous stream of sometimes-silent audio.

In the example project, we use ffplay as the audio player, which expects a continuous stream. To take care of this, we use the SilenceFiller helper provided by the Hume Typescript SDK, which creates a continuous stream from independent PCM audio snippets.

This is not always needed. In some settings you can initialize a new audio player for each audio chunk. If you are using an audio player that writes directly to your system’s audio output (such as aplay for Linux systems, afplay for Mac systems), you can typically write each chunk directly when it is ready to play, without worrying about filling gaps with silence.

1const player = startAudioPlayer();
2const silenceFiller = new SilenceFiller();
3
4// Pipe silence filler output to audio player stdin
5silenceFiller.pipe(player.stdin);
6
7// Handle pipe errors
8silenceFiller.on('error', (err) => {
9 console.error("SilenceFiller error:", err);
10});
11
12const handleMessages = async () => {
13 for await (const chunk of stream) {
14 const buf = Buffer.from(chunk.audio, "base64");
15 silenceFiller.writeAudio(buf);
16 }
17
18 await silenceFiller.endStream();
19 await player.stop();
20};
21
22// Trigger both sending input and receiving audio at the same time.
23await Promise.all([handleMessages(), sendInput()]);

Running the Example

To run the example:

$npx ts-node index.ts