
Over the last year, we’ve seen a clear shift across frontier labs. As models become more capable, teams are moving toward voice-first applications: assistants, devices, creative tools, and enterprise systems where speech is the primary interface. That shift is underway at places like Amazon, Google, Meta, OpenAI, and Anthropic.
What’s changed isn’t just where voice is used, but what it represents. Voice is no longer a feature layered on top of an intelligent system. It’s becoming a foundational modality through which models reason, interact, and are judged by users.
Building and harnessing a voice model with these more general capabilities is a new technical challenge. The solution turns out to be less about clever architectures and more about data, evaluations, and control. Voice carries layers of information that don’t exist in text: emotion, prosody, timing, accent, audio quality, and conversational context. None of these are binary, and none are easy to measure with standard benchmarks.
In speaking to research teams, a consistent pattern emerges: roughly 80% of their effort goes into curating the right datasets and pipelines to train and evaluate the model. That is a challenge Hume is uniquely positioned to help address.
How Hume Got Here
Hume was early to voice AI. We built our own speech-language models, shipped them to real users, and lived through the failure modes firsthand. That work forced us to build infrastructure that didn’t exist: ways to collect, label, evaluate, and iteratively improve voice systems at scale.
Over time, it became clear that this infrastructure was valuable well beyond our own model training needs.
Today, we make it available to frontier labs and AI-first enterprises aiming to build frontier voice models worthy of driving the core interface of an application. We focus on one thing: voice and emotion. Nothing else.
What We Provide
Expression Understanding & Voice Modulation
Voice models need to understand and generate the right tune, rhythm, and timbre of speech, not just words. We provide datasets designed for expressive control in text-to-speech and speech-to-speech systems: anxious, calm, excited, whispering, scratchy, confident, and more. Our corpus is tagged with 200+ emotions and 400+ voice characteristics, enabling precise curation rather than brute-force scale.
Voice Design
Distinct from modulation, voice design enables models to create voices from descriptions. This is critical for gaming, entertainment, audiobooks, narration, and brand identity, where teams want voices that are consistent, distinctive, and controllable without recording and cloning actors. Some teams need a handful of preset voices; others need thousands. We support both.
Industry-Specific Conversational Data
Voice models behave differently depending on context. Medical conversations aren’t customer service calls. Education isn’t entertainment. We provide well-labeled, human-verified conversational data for domains where nuance matters most, including medical audio, education and tutoring, user and market research interviews, and customer service.
Reliability & Safety for Voice
Voice models have unique failure modes. Common triggers include alphanumeric sequences like booking reference numbers, cross-lingual speech (for example, Singlish or Malay-English mixes), and sudden mispronunciations mid-sentence. Our approach combines an active science team with feedback from real-world usage, addressing these issues through targeted pre- and post-training datasets. One result: after targeted training, our latest model achieved a 0% error rate on phone number reproduction, which is surprisingly rare among frontier models.
Annotations Pipelines for Speech Data
Frontier labs already generate and store vast amounts of speech data from real-world usage of their applications. What’s often missing is the ability to quickly turn that raw audio into usable signals. Hume specializes in collecting high-quality annotations and human ratings, such as naturalness, emotion, and quality. Because our workflows and platform are already in place, teams can move from question to insight, with results returned in as little as four hours for hundreds of thousands of audio samples.
Evaluation & RL “Gym” for Voice Models
Improving a voice model requires knowing what better sounds like. Hume’s evaluation and reinforcement learning voice gym provides structured prompts across real-world use cases—such as customer service, medical interactions, and education—which allow labs to collect the right human preference data. These signals power evaluation, fine-tuning, and reinforcement learning, helping models learn which behaviors to reinforce and which to avoid.
Why Teams Choose Hume
We only do voice and emotion. Our work is grounded in affective science, with psychologically controlled data collection and labeling. That focus lets us move fast, delivering research-grade datasets in days, not months.
Our infrastructure scales, with a proprietary query system that searches across emotional, acoustic, and contextual dimensions. Teams can choose fully human-labeled data, or automated pipelines that reach up to 95% accuracy on their own, with human verification layered on to finalize results, increasing iteration speed and reducing operational overhead.
As an independent research partner, Hume works with frontier labs and enterprises to help them move faster by supplying the parts of voice systems that are hardest to do well internally.
As AI becomes more conversational and more human-facing, voice isn’t a feature—it’s the primary interface. And emotion is the difference. That’s the layer we’re building.

Why Voice Needs Its Own Evaluation & RL Infrastructure
Voice is shaped by emotion, noise, and human perception, changing both meaning and failure modes. Speech is continuous and contextual, judged by people in the moment, not binary correctness. As voice becomes a primary AI interface, text-based evaluation and RL break down. Hume provides the data and research infrastructure to build voice AI that is realistic, expressive, and reliable.
Meaning shifts with tone: The same sentence can sound anxious, confident, sarcastic, or calm.
Real audio is messy: Accents, code-switching, noise, and imperfect microphones.
Evaluation is subjective: “Better” is often preference, not correctness.
Failures are different: Voice models mispronounce, derail, or generate unrelated speech.


