Opensourcing TADA: Fast, Reliable Speech Generation Through Text-Acoustic Synchronization

Sharath Rao and Mori Liu

·March 10, 2026·research

Opensourcing TADA: Fast, Reliable Speech Generation Through Text-Acoustic Synchronization

The future of voice AI hinges on sounding natural, fast, expressive, and free of quirks like hallucinated words or skipped content. Today's LLM-based TTS systems are forced to choose between speed, quality, and reliability because of a fundamental mismatch between how text and audio are represented inside language models.

TADA (Text-Acoustic Dual Alignment) resolves that mismatch with a novel tokenization schema that synchronizes text and speech one-to-one. The result: the fastest LLM-based TTS system available, with competitive voice quality, virtually zero content hallucinations, and a footprint light enough for on-device deployment.

Hume AI is open-sourcing TADA to accelerate progress toward efficient, reliable voice generation. Code and pre-trained models are available now.

live demo

Adoration Speech

0:00

Fearful Speech

0:00

Anger Speech

0:00

Long Speech

0:00

Approach

For every second of spoken audio, the acoustic signal carries far more information than the corresponding text. A second of audio might be 2–3 text tokens but 12.5–25 acoustic frames. This mismatch means LLM-based TTS systems must manage sequences where audio tokens vastly outnumber text tokens — leading to longer context windows, higher memory consumption, slower inference, and more opportunities for the model to lose track of what it's supposed to say.

Most existing systems address this by reducing audio frame rates or introducing intermediate "semantic" tokens between text and audio. Both approaches introduce their own tradeoffs: degraded expressiveness, added complexity, or both.

TADA takes a different path. Instead of compressing audio into fewer fixed-rate frames of discrete audio tokens, we align audio representations directly to text tokens — one continuous acoustic vector per text token. This creates a single, synchronized stream where text and speech move in lockstep through the language model.

For input audio, an encoder paired with an aligner extracts acoustic features from the audio segment corresponding to each text token. For output audio, the LLM's final hidden state serves as a conditioning vector for a flow-matching head, which generates acoustic features that are then decoded into audio and fed back into the model.

Since each LLM step corresponds to exactly one text token and one audio frame, TADA generates speech faster and with less computational effort. And because the architecture enforces a strict one-to-one mapping between text and audio, the model cannot skip or hallucinate content by construction.

Evaluation

Speed

TADA generates speech at a real-time factor (RTF) of 0.09 — more than 5x faster than similar grade LLM-based TTS systems. This is possible because TADA operates at just 2–3 frames (tokens) per second of audio, compared to 12.5–75 tokens per second in other approaches.

Hallucination

Our model was trained on large scale, in-the-wild data, without post-training, and achieves the same reliability as models trained on smaller curated datasets. We measured hallucination rate by flagging any sample with a character error rate (CER) above 0.15 — a threshold that captures unintelligible speech, skipped text, and inserted content. In the 1000+ test samples from LibriTTSR, TADA produced zero hallucinations.

Voice Quality

In human evaluation on expressive, long-form speech (EARS dataset), TADA scored 4.18/5.0 on speaker similarity and 3.78/5.0 on naturalness, placing second overall — ahead of several systems trained on significantly more data.

Potential Applications

On-device deployment: TADA is lightweight enough to run on mobile phones and edge devices without requiring cloud inference. For device manufacturers and app developers building voice interfaces, this means lower latency, better privacy, and no API dependency.

Long-form and conversational speech: TADA's synchronous tokenization is dramatically more context-efficient than existing approaches. Where a conventional system exhausts a 2048-token context window in about 70 seconds of audio, TADA can accommodate roughly 700 seconds in the same budget. This opens the door to long-form narration, extended dialogue, and multi-turn voice interactions.

Production reliability: Zero hallucinations in our tests suggests fewer edge cases to catch, fewer customer complaints, and less post-processing overhead in the product. This makes TADA well-suited for deploying voice in regulated or sensitive environments like healthcare, finance, and education.

Limitations and Future Work

Long-form degradation: While the model supports more than 10 minutes of context, we noticed occasional cases of speaker drift during long generations. Our online rejection sampling strategy reduces this significantly, but it's not fully resolved. We suggest resetting the context as an intermediate workaround.

The modality gap: When the model generates text alongside speech, language quality drops relative to text-only mode. We introduce Speech Free Guidance (SFG), a technique that blends logits from text-only and text-speech inference modes to help close this gap, but more work is required.

Use-cases: The model is only pre-trained on speech continuation; further fine-tuning is required for assistant scenarios. Get in touch to inquire about Hume's extensive library of fine-tuning data.

Scale: The current release covers English and seven additional languages, so there's clear room to expand. We're training larger models with broader language coverage with Hume AI data.

We're releasing TADA because we believe this architecture opens a productive direction for the field, and we want to accelerate progress. We invite researchers and developers to build on this work — whether that means extending the tokenizer to new modalities, solving the long-context problem, or adapting the framework for new applications.

Get Started

TADA is available now under an open-source license. We're releasing 1B and 3B parameter Llama-based models and the full audio tokenizer and decoder.

1B (English): huggingface.co/HumeAI/tada-1b

3B (multilingual): huggingface.co/HumeAI/tada-3b-ml

Demo: huggingface.co/spaces/HumeAI/tada

GitHub: github.com/HumeAI/tada

arXiv: https://arxiv.org/abs/2602.23068

Hume builds voice AI research infrastructure for frontier labs and AI-first enterprises. If you're working on voice models and need high-quality training data, evaluation systems, or reinforcement learning infrastructure, get in touch at hello@hume.ai.

Opensourcing TADA: Fast, Reliable Speech Generation Through Text-Acoustic Synchronization

Approach

Evaluation

Speed

Hallucination

Voice Quality

Potential Applications

Limitations and Future Work

Get Started

Recommended Posts

Stay in the loop

Join the community