Model Architecture

Leapfrog years of R&D. Access the architectures behind leading voice models.

Don't solve every problem from scratch. At Hume, we believe the best voice AI is built through collective effort.

TADA

Low latency, low hallucination, open source

Deploy TADA (Text and Audio Dual Alignment), a TTS system with text and audio in one synchronized stream to reduce token-level hallucinations and improve latency.

Zero hallucinations

Zero content hallucinations across 1,000+ test samples.

5x faster

5x faster than similar-grade LLM-based TTS systems.

Long-form audio

2,048 tokens cover ~700 seconds with TADA vs. ~70 seconds in conventional systems.

Free transcript

Get a transcript alongside audio with no added latency.

MLX support

Run locally on Apple Silicon with optimized MLX inference.

Open source

Fully open weights and architecture for research and production.

EVI

Empathic speech to speech with contextual understanding

Access EVI (Empathic Voice Interface), a speech to speech system with user speech prosody understanding, native language generation, and customizable voices.

Emotion instruction

Emotion instruction following and unparalleled naturalness.

Voice design

Voice cloning and voice design support.

Natural turn-taking

Interruptibility and back channeling.

Tool use

Tool use and dynamic variables for agentic workflows.

Context injection

Context injection and external LLM compatibility.

Multilingual

Native language generation across a growing list of languages.

Octave

Low latency TTS with voice design and expression modulation

Use Octave (Omni Capable Text and Voice Engine), an LLM-based TTS system with voice design, voice modulation, voice cloning, voice conversion, and more.

Low latency TTS with voice design and expression modulation

Multispeaker

Multispeaker and multilingual synthesis in a single model.

Voice design

Infinite voices through natural language voice descriptions.

Creator platform

Purpose-built for audiobooks and podcasts.

Voice cloning

Clone any voice from a short audio sample.

Expression modulation

Fine-grained control over emotion and delivery style.

Low latency

Streaming output with fast time-to-first-byte for real-time use.

From Our Lab

Peer-reviewed insights

View all
arXiv·Feb 2026

TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment (Under Review)

TD
SR
AG
+6
Trung Dang, Sharath Rao, Ananya Gupta and 6 more

Modern Text-to-Speech (TTS) systems increasingly leverage Large Language Model (LLM) architectures to achieve scalable, high-fidelity, zero-shot generation. However, these systems typically rely on fixed-frame-rate acoustic tokenization, resulting in speech sequences that are significantly longer than, and asynchronous with their corresponding text. Beyond computational inefficiency, this sequence length disparity often triggers hallucinations in TTS and amplifies the modality gap in spoken language modeling (SLM). In this paper, we propose a novel tokenization scheme that establishes one-to-one synchronization between continuous acoustic features and text tokens, enabling unified, single-stream modeling within an LLM. We demonstrate that these synchronous tokens maintain high-fidelity audio reconstruction and can be effectively modeled in a latent space by a large language model with a flow matching head. Moreover, the ability to seamlessly toggle speech modality within the context enables text-only guidance--a technique that blends logits from text-only and text-speech modes to flexibly bridge the gap toward text-only LLM intelligence. Experimental results indicate that our approach achieves performance competitive with state-of-the-art TTS and SLM systems while virtually eliminating content hallucinations and preserving linguistic integrity, all at a significantly reduced inference cost.

Frontiers in Psychology·May 2024

How emotion is experienced and expressed in multiple cultures: a large-scale experiment across North America, Europe, and Japan

Alan Cowen
Jeff
GP
+13
Alan Cowen, Jeffrey Brooks, Gautam Prasad and 13 more

Core to understanding emotion are subjective experiences and their expression in facial behavior. Past studies have largely focused on six emotions and prototypical facial poses, reflecting limitations in scale and narrow assumptions about the variety of emotions and their patterns of expression.

iScience·Feb 2024

Deep learning reveals what facial expressions mean to people in different cultures

Jeff
LK
MO
+10
Jeffrey Brooks, Lauren Kim, Michael Opara and 10 more

Cross-cultural studies of the meaning of facial expressions have largely focused on judgments of small sets of stereotypical images by small numbers of people. Here, we used large-scale data collection and machine learning to map what facial expressions convey in six countries.

Get Started with Hume Today

Build, train, and evaluate your voice AI models with us. Reach out to get started.

Stay in the loop

Get the latest on empathic AI research, product updates, and company news.

Join the community

Connect with other developers, share projects, and get help from the team.

Join our Discord