Hume MCP Server

Use Hume AI’s Octave TTS with your favorite MCP clients like Claude Desktop, Cursor, and Windsurf.

The Hume MCP Server implements the Model Context Protocol (MCP) for Hume AI’s Octave Text-To-Speech, allowing you to use MCP clients like Claude Desktop, Cursor, and Windsurf to collaborate with AI assistants on your voice projects.

What for?

If you hope to narrate a large source text, such as a book, play, or long-form video, there’s a lot more to the project than just converting the text to speech. You have to

  • Design voices
  • Break the text into pieces
  • Assign each line of dialogue to a voice
  • Separate acting instructions from spoken text

LLMs can perform some of these tasks and help you keep these efforts organized. MCP is an industry protocol that lets you easily give an AI assistant the ability to use tools like Octave TTS on your behalf.

Available Tools

The Hume MCP Server exposes the following tools to compatible MCP clients:

Tool Description
tts

Synthesize (and play) speech from text. This is the primary tool for generating speech with optional voice selection, acting instructions, and playback control.

play_previous_audio

Replay previously generated audio by referencing its generation ID. Useful for comparing different versions or revisiting earlier speech samples.

list_voices

List all available voices in your account’s library, including both custom voices and Hume-provided preset voices.

save_voice

Save a generated voice to your library for reuse in future TTS requests, allowing you to build a collection of customized voices.

delete_voice

Remove a voice from your custom voice library when it’s no longer needed.

Quickstart

To get started with the Hume MCP Server, you’ll need to configure your MCP Client Application to use it:

Add the following to the .mcpServers property in claude_desktop_config.json configuration file.

Claude Desktop Configuration
1{
2 "mcpServers": {
3 "hume": {
4 "command": "npx",
5 "args": [
6 "@humeai/mcp-server"
7 ],
8 "env": {
9 "HUME_API_KEY": "<your_hume_api_key>"
10 }
11 }
12 }
13}

Prerequisites

Before you can use the Hume MCP Server, you’ll need:

  • An account and API Key from Hume AI
  • Node.js installed on your system
  • (optional) A command-line audio player
    • ffplay from FFMpeg is recommended, but the server will attempt to detect and use any of several common players

The MCP server calls Hume APIs on your behalf and will use credits from your account, incurring costs just as if you were making the API calls directly or doing Text-to-Speech through the web interface.

Source Code

The Hume MCP Server is open source. You can view and contribute to the source code in the GitHub repository.

Prompt Examples

Here are some example prompts to help you get started with the Hume MCP Server.

These examples assume that the assistant has the ability to read and write from a filesystem. This usually already the case for MCP clients like Cursor that are attached to an editor. For standalone chat apps like Claude Desktop, you can give the assistant filesystem access through the Filesystem MCP Server.

https://github.com/modelcontextprotocol/servers/tree/main/src/filesystem

Ask the assistant to create a voice with specific characteristics:

Create a warm, friendly female voice with a slight Irish accent that
would be good for narrating a children's story.
Produce a good voice description and sample text by asking
me questions about the my desired voice qualities.
Then, give me several options and iterate based on
my feedback.

Have the assistant read you content.

I have the text of a blog post that I'd like to listen to in my
Downloads folder. Can you read it to me in an appropriate voice?

This comprehensive prompt helps the assistant break down an audiobook chapter into segments and design appropriate voices:

<Goal>
Narrate the audiobook chapter in my text with high quality AI-generated
speech according to my artistic vision.
</Goal>
<Steps>
1. Break the text down into segments
2. Design and save a base voice for the narrator.
3. Design *variants* of the narrator voice for each character.
4. Convert the text of each segment to speech.
</Steps>
<Segmentation>
* Every line of quoted dialogue should be its own segment
* Quotation marks should be removed from segments that are solely
dialogue.
* Use the following formatting for segments
## Segment 1
voice_name: ...
text: ...
description: ...
## Segment 2
voice_name: ...
text: ...
(no description)
</Segmentation>
<ToolCalls>
ALWAYS stop to collect feedback and ask for confirmation before performing
a 'tts' tool call.
</ToolCalls>
<VoiceDesign>
* Descriptions for a new voice should be 2 sentences MAX. Sample text
should be 2 sentences MAX.
* Don't use source text for the sample text -- invent new text that is
stylized to reflect the character and emotion of the desired voice.
* To generate a variant, ALWAYS specify the base voice as `voiceName`.
* Descriptions should be VERY short and describe one or two voice
qualities (masculinity, pitch, pace) that should vary from the base
voice.
</VoiceDesign>
<Narration>
* ALWAYS use continuation and voiceName.
* Never send acting instructions "description" unless it is provided in
the script.
</Narration>
Let's get started with step 1!

This prompt explains how to create distinct character voices through a technique called “variant chaining”:

To make it sound like the narrator is "doing a voice" you have to create
a voice with more distance from the base narrator voice than you can get
by generating a single iteration of providing acting instructions to
modulate the voice. You can do this through "variant chaining".
* Start with the base voice.
* Pick one or two qualities of the voice that are different than the base
voice to emphasize in the acting instructions and source text.
* Create and save {variant_voice}_0.
* Create new acting instructions and source text, use them create and
save {variant_voice}_1 using {variant_voice}_0 as a base.
* Repeat until the results are satisfactory.
Often times 2 variants is enough for a character of the same gender. You
might need 3 or more variants emphasizing masculinity for a character of
the opposite gender.

Command Line Options

The Hume MCP Server accepts several command line options to customize its behavior:

Options:
--workdir, -w <path> Set working directory for audio files (default: system temp)
--(no-)embedded-audio-mode Enable/disable embedded audio mode (default: false)
--(no-)instant-mode Enable/disable instant mode (default: false) (incurs 10% additional cost)
--help, -h Show help message

Environment Variables

You can configure the behavior of the Hume MCP Server using these environment variables:

Variable Description
HUME_API_KEY

Your Hume AI API key (required). You can obtain this from the Hume AI Platform.

WORKDIR

Working directory for audio files (default: OS temp directory + “/hume-tts”). This is where generated audio files will be stored.

EMBEDDED_AUDIO_MODE

Enable/disable embedded audio mode (default: false, set to ‘true’ to enable).

Embedded audio files are a new addition to the MCP specification and most MCP client application do not yet support them. This can be useful if you are designing an MCP client specifically to work with Hume.

INSTANT_MODE

Enable/disable instant mode (default: false, set to ‘true’ to enable). Instant mode allows for faster TTS generation but incurs a 10% additional cost. This setting overrides the default instant_mode parameter sent to the TTS API.

Default API Parameters

The MCP Server applies several default parameters to API requests for convenience:

Tool Default Parameters Description
tts strip_headers: true

Headers and non-speech text are automatically removed from the input.

tts format.type: "wav"

All audio is generated in WAV format for best compatibility with audio players.

tts instant_mode: true

Instant mode is enabled by default for the TTS API (API default is false) for faster synthesis. This default can be overridden by setting the global instant mode option through the command line flag or environment variable.

list_voices page_size: 100

Returns up to 100 voices per request (API default is 10) to minimize pagination needs.