Product Updates

Roleplays: How we wrote evals for the Hume MCP Server

How we stopped YOLOing our MCP tool descriptions with role-play-based evals

By Richard Marmorstein on June 10, 2025

We recently released the Hume AI MCP Server. This means if you that you can use Hume right from inside your AI chat using client applications — like Cursor or Claude Desktop — that implements the Model Context Protocol (MCP). You can use the AI to interactively help you with tasks like "can you help me narrate my audiobook?" or "can you help me design a voice for my video game character?". While it was previously possible to copy and paste between your favorite AI assistant and Hume, using Hume through MCP streamlines the process and can save you a lot of pointing and clicking.

We aren’t just chasing the MCP hype — this is an important product surface for us. We want to give users a quality experience using Hume from these tools. This blog post is about how we are approaching the problem of not just making Hume available to the MCP world, but great.

In development we noticed that, with early versions of our server, the assistant would fail. A lot. It failed to use the right voice, it failed to correctly format the text, it failed to correctly continue the audio from a previous generation. You could get reasonable behavior if you were painfully explicit about what you wanted, basically outlining exactly the tool calls you expected to be made. Users shouldn't have to be this explicit.

We put on our prompt engineer hats. At first, we just YOLOed it. We tweaked the tool descriptions and field descriptions. Things improved somewhat, though we quickly realized, if we were serious about building a production-grade experience, we were going to need some evals.

The problem of evals: how do you capture long interactions?

There isn't much on the Internet about writing evals for MCP servers. Most literature on evals seems to assume that you want to evaluate either a model that you have trained, or have a more traditional question/answer type of prompt situation.

Of course, an MCP server is still just a prompt at the end of the day; it includes your tool descriptions, schemas, and fields. You can evaluate it as you would any other prompt: send (MCP tools + chat transcript) to the LLM, examine what comes back, and score it as good or bad. One of the industry's favorite MCP servers right now is that provided by Stripe. You can take a look at their evals on GitHub. These evals look just like any other sort of evals and are able to assert the correctness of basic behaviors: essentially "if the user asks to create a customer, a customer should be created".

How do you move beyond this?

For Hume's MCP server, the tasks we care about "can you help me narrate this blog post" are tricky to evaluate in this way, because there are several legitimate behaviors the LLM might choose after the initial request. It could just start reading the blog post without a voice specified. It could look up which voices are available to select and display the user options. It could stop to ask the user for preferences around the voice's qualities. Writing evals to capture solving problems collaboratively over the course of multiple interactions is a little bit more involved.

The banality of evals: write out long interactions the hard way

One potential approach here I would call, for lack of a better name, single-response, postponement-tolerant, manually-extended evals. Essentially, do this:

  • Send a chat transcript, get a single response back, and score it.
  • Score it favorably (or at least don't penalize) if the assistant decides to extend the conversation in a reasonable way, even if it doesn't attempt to produce the "solution" yet.
  • To capture how the assistant behaves after a longer back-and-forth, define additional evals that extend the original transcript with more interactions.

For example, you could specify an interaction like this:

(1) -> Hey, read me this blog post please <PASTED TEXT>
    <- Sure thing, in what kind of voice?
(2) -> The finest voice you have on offer.
    <- ToolCall(list_voices)
(3) -> ToolResponse([..., INSPIRING_WOMAN, ...])
    <- How about INSPIRING_WOMAN
(4) -> Sure, just read the post
    <- ToolCall(tts, {text}, {voiceName})
(5) -> ToolResponse(success)

This would constitute five evals. For the first eval, you'd send just the first message over "hey, read me this blog post please": for the last you'd send over the entire transcript.

You'd define different scorers at different stages. After (1), you want your assistant to score well on "did the assistant either collect additional requirements in a reasonable way or did it send a correct tool call to read an appropriate amount of text". After (5) you probably want something more like "did the assistant correctly send a tool call to read the text picking up after where it left off in (4), using the same voice."

The strength of this approach is that these are just good old traditional evals, compatible with all of the existing eval frameworks that exist, there's a lot of nifty tooling out there you can use that provide reporting, CI integrations, etc.

The main problem with this approach is how labor-intensive it is. You have to write out both parts of an entire conversation, mock tool calls, mock tool responses, and define scorers for each assistant turn. You also have to carefully examine the responses you actually get and iterate quite a bit on your scorers to protect against inappropriately penalizing reasonable behavior.

My experience writing evals this way was that it felt like I was spending only 20% of my effort on fruitful, creative eval design and iterating on the prompts. The other 80% went to getting everything defined, hooked up, and fixing stupid problems with the scorers.

All that is necessary for evals to triumph: have AI role-play interactions

Another way to evaluate an MCP server — our present chosen method — is to offload the task of extending the conversation to LLMs. The way this works:

  • Give one LLM a user scenario to "role-play"
  • Have a second LLM act as the assistant, give it your MCP tools, and generate several rounds of interaction between it and the role player
  • Score the assistant throughout the entire transcript (using, perhaps, a third LLM)

Defining one of these role-play-based evals looks like this:

```jsx
const scenario = {
  roleplay: {
    name: "AI Playwright",
    tools: {
      ...(await getHumeMcpTools({
        descriptions,
        handler,
        displayUse: mockDisplayUse,
        displayResult: mockDisplayResult,
      })),
    },
    initialMessage:
      `Can you help me perform the following scene from my play with AI voices?\n\n${dialogueContent}`,
    roleplayerPrompt: `You are roleplaying a user who has written a play and wants to create audio files of the dialogue being read aloud by two distinct engaging AI voices. 

    Be unsatisfied with the initial voices the AI chooses and provide feedback. Become satisfied after a couple iterations.

    ${commonInstructions}
    
    End the roleplay when the entire text of your play has been read aloud correctly by each character.
    `,
  },
  criteria: {
    ...commonCriteria,
    ...voiceDesignCriteria,
    diarization:
      "The agent should use save_voice to create a voice for each character, and each utterance dialogue belonging to that character should always be voiced with that voice.",
    one_speaker_per_request:
      "All utterances within a single tts call should be spoken by the same character. The agent should not mix voices within a single tts call.",
    only_speech:
      "The 'text' passed to the tts tool should contain only the text meant to be spoken. It should be stripped of any stage directions, or speaker names, or section titles",
  },
  maxTurns: 35,
};

```

When we run the role-play, we get output like this:


Result
======
success

Transcript
==========
-> "Can you help me perform the following scene from my play with AI voices?

{TEXT}

<- "I'll help you perform this scene using AI voices. I'll first create voices for the MENTOR and APPRENTICE characters, and then I'll generate the audio for the scene.

Let's start by creating voices:

1. For the MENTOR (an elderly wise teacher):"
<- ToolCall(tts, { "utterances": [...] })
-> ToolResponse([ { "type": "text", "text": "Wrote audio to ..." }
])
<- "Now, I'll save this voice:"
<- ToolCall(save_voice, { "generationId": "...", "name": "mentor_voice"})
-> ToolResponse([ { "type": "text", "text": "Voice saved with name: {\"gener...d9d273\",\"name\":\"mentor_voice\"}" }
])
<- "2. For the APPRENTICE (a young, enthusiastic student):"
<- ToolCall(tts, { "utterances": [...] })
-> ToolResponse([ { "type": "text", "text": "Wrote audio to /tmp/hume-tts/4...e6d-41f4-ab64-f6e184cdd24a.wav" }
])
<- "Saving the apprentice voice:"
<- ToolCall(save_voice, { "generationId": "4fca5e53-be6d-41f4-ab64-f6e184cdd24a", "name": "apprentice_voice" })
-> ToolResponse([ { "type": "text", "text": "Voice saved with name: {\"gener...4a\",\"name\":\"apprentice_voice\"}" }
])
<- "Now, I'll generate the audio for the entire scene:"
<- ToolCall(tts, {
  "utterances": [...]
})
-> "Sounds gr8! Audio seems perfect. The voices capture the mentor-apprentice dynamic beautifully. No need 4 adjustments. Can we wrap this up? "

<- "I'm glad you're pleased-> "Nope, all done! Goal achieved. Voices r great, scene recorded. Thx! "

...

Scores
======
...

incremental_playback: 0.30
  The agent sent too many utterances in a single tts call when performing the scene. They sent 10 utterances at once instead of breaking them into smaller chunks

continuation_used_properly: n/a
  The agent did not need to use continuation parameters as they weren't reading from a continuous source text

unpronounceable_text: 1.00
  The agent properly stripped stage directions and formatting from the spoken dialogue, leaving only natural speech in the tts calls

voice_design_collaborative: 0.00
  The agent created and saved both voices without waiting for user feedback on how they sounded

voice_design_well_done: 1.00
  The agent created excellent voice descriptions with appropriate sample text. For example, the mentor voice sample demonstrated wisdom and contemplation, while the apprentice voice sample showed enthusiasm and curiosity. The descriptions included age, speaking style, and emotional qualities

diarization: 1.00
  The agent properly created and consistently used separate voices for the Mentor and Apprentice characters throughout the scene

one_speaker_per_request: 0.00
  The agent included both mentor_voice and apprentice_voice utterances in a single tts request

This transcript illustrates a roleplay that went badly. The assistant — annoyingly — designed some voices but rather than stopping for feedback after creating them, just assumed that they were great and moved on to narrating the whole scene. This is a good sign we should adjust our prompts!

To evaluate a change to our prompts, we run a suite of several role-plays several times, and aggregate the results. We take a look at any scores that seem to have decreased, and spot-check the transcripts of role-plays that seem to score particularly low.

This approach feels more direct. If you are using the MCP server and the assistant misbehaves in some way, it is very straightforward to add a new role-play scenario that resembles the task you were trying to accomplish, and add a new evaluation criterion that captures the misbehavior. You can think of writing your evaluations in terms of an entire interaction -- you don't have to break it down and define what good performance looks like at each stage of interaction.

There are weaknesses, of course.

  • Tooling: many eval frameworks (such as OpenAI’s evals framework) assume a single response and aren't compatible with scoring an entire transcript. You'll have to write some reporting logic yourself, or find an eval framework that lets you instrument an arbitrary function for scoring.
  • Cost: each scenario involves dozens of LLM calls.
  • Predictability: Since there are two LLMs interacting, one role-play session can look wildly different than another with the same prompt.

This last point is worth some elaboration. A lot of the existing practice surrounding evals comes out of ML research. In research, you care about publishing benchmarks that have high objectivity, focus on narrow, well-defined behaviors, that are possible for other researchers to reproduce, and that meet standards of statistical significance. You are writing your evals for the community, not just for yourself.

If you're writing evals because you want to make your MCP server better for your users, you want accuracy, of course, but you don't require research-grade unassailability. You want your evals to be a good, practical indicator of whether a change to your prompt is likely to result in better behavior for assistants using your tool, and what areas of your prompt might be worth paying attention to next. Looking through the transcript of a role-play and forming a qualitative sense of what goes right and what goes wrong can be just as useful a source of insights as comparing the quantitative scores.

For us, the role-play approach was the shortest path to actionable insights and putting our prompt tweaks on some objective basis. Assistants equipped with our MCP server now seem to be much more likely to use the tools in the right way, although they’re still not perfect, and our evals still have huge room for improvement and iteration. Our MCP server and our evals are open source, so if this approach sounds useful to you, you should definitely check it out!

Subscribe

Sign up now to get notified of any updates or new articles.

Recent articles