TTS Python Quickstart Guide
Step-by-step guide for integrating the TTS API using Hume’s Python SDK.
This guide shows how to get started using Hume’s Text-to-Speech capabilities in Python using Hume’s Python SDK. It demonstrates:
- Converting text to speech with a new voice.
- Saving a voice to your voice library for future use.
- Giving “acting instructions” to modulate the voice.
- Generating multiple variations of the same text at once.
- Providing context to maintain consistency across multiple generations.
The complete code for the example in this guide is available on GitHub.
Environment Setup
Set up a Python virtual environment and install the required packages:
uv
venv
Authenticating the HumeClient
You must authenticate to use the Hume TTS API. Your API key can be retrieved from the Hume AI platform.
This example uses python-dotenv. Place your API key in a file .env
at the root your project.
Then create a new file app.py
and use your API key to instantiate the AsyncHumeClient
.
Helper function
Define a function to aid in writing generated audio to a temporary file:
Calling Text-to-Speech
To use Hume TTS, you can call hume.tts.synthesize_json
with a list of utterances. Inside each utterance, put the text
to speak, and optionally provide a description
of how the voice speaking the text should sound. If you don’t provide a description, Hume will examine text
and attempt to determine an appropriate voice.
The base64-encoded bytes of an audio file with your speech will be present at .generations[0].audio
in the returned object. By default, there will only be a single variation in the .generations
array, and the audio will be in wav
format.
The .generations[0].generation_id
field will contain an ID you can use to refer to this specific generation of speech in future requests.
Saving voices
Use hume.tts.voices.create
to save the voice of a generated piece of audio to your voice library for future use:
Continuity
Inside an utterance, specify the name or ID of a voice to generate more speech from that voice.
To generate speech that is meant to follow previously generated speech, specify context
with the generation_id
of that speech.
You can specify a number up to 5 in num_generations
to generate multiple variations of the same speech at the same time.
Acting Instructions
If you specify both voice
and description
, the description
field will behave as “acting instructions”. It will keep the character of the specified voice
, but modulated to match description
.