TTS .NET Quickstart Guide
Step-by-step guide for integrating the TTS API using Hume’s .NET SDK.
This guide shows how to get started using Hume’s Text-to-Speech capabilities in .NET using Hume’s .NET SDK. It demonstrates:
- Converting text to speech with a new voice.
- Saving a voice to your voice library for future use.
- Giving “acting instructions” to modulate the voice.
- Generating multiple variations of the same text at once.
- Providing context to maintain consistency across multiple generations.
The complete code for the example in this guide is available on GitHub.
Environment Setup
Create a new .NET project and install the required packages:
dotnet CLI
Visual Studio
Authenticating the HumeApiClient
You must authenticate to use the Hume TTS API. Your API key can be retrieved from the Hume AI platform.
This example uses environment variables. Set your API key as an environment variable:
Then create a new file Program.cs
and use your API key to instantiate the HumeApiClient
.
Helper function
Define a function to aid in writing generated audio to a temporary file:
Calling Text-to-Speech
To use Hume TTS, you can call client.Tts.SynthesizeJsonAsync
with a SynthesizeJsonRequest
containing a list of utterances. Inside each utterance, put the Text
to speak, and optionally provide a Description
of how the voice speaking the text should sound. If you don’t provide a description, Hume will examine Text
and attempt to determine an appropriate voice.
The base64-encoded bytes of an audio file with your speech will be present at .Generations[0].Audio
in the returned object. By default, there will only be a single variation in the .Generations
array, and the audio will be in wav
format.
The .Generations[0].GenerationId
field will contain an ID you can use to refer to this specific generation of speech in future requests.
Saving voices
Use client.Tts.Voices.CreateAsync
to save the voice of a generated piece of audio to your voice library for future use:
Continuity
Inside an utterance, specify the name or ID of a voice to generate more speech from that voice.
To generate speech that is meant to follow previously generated speech, specify Context
with the GenerationId
of that speech.
You can specify a number up to 5 in NumGenerations
to generate multiple variations of the same speech at the same time.
Acting Instructions
If you specify both Voice
and Description
, the Description
field will behave as “acting instructions”. It will keep the character of the specified Voice
, but modulated to match Description
.
Streaming speech
You can stream utterances using the SynthesizeJsonStreamingAsync
method. This allows you to process audio chunks as they become available rather than waiting for the entire speech generation to complete.
You can either write these chunks to files as we’ve done above, or play them in real-time with an audio player. Below is an example of real-time playback using a pipe-based streaming audio player: