Octave 2: next-generation multilingual voice AI
Published on October 1, 2025
Today we’re launching Octave 2, the second generation of our frontier voice AI model for text-to-speech. We just made a preview of Octave 2 available on our platform and through our API.
Octave 2:
-
More deeply understands the emotional tone of speech.
-
Extends our text-to-speech system to 11 languages.
-
Is 40% faster and more efficient, generating audio in under 200ms.
-
Offers new first-of-their-kind features for a speech-language model, including voice conversion and direct phoneme editing.
-
Pronounces uncommon words, repeated words, numbers, and symbols more reliably.
- Is half the price of Octave 1.
A speech-language model is a state-of-the-art AI model trained to understand and synthesize both language and speech. Unlike traditional TTS models, it understands how the script informs the tune, rhythm, and timbre of acting, inferring when to whisper secrets, shout triumphantly, or calmly explain a fact. Understanding these aspects of speech also allows it to reproduce the personality, and not just vocal timbre, of any speaker.
With Octave 2, we’ve taken these capabilities a step further.
Hyperrealistic voice AI in 11 languages
Octave 2 extends our next-generation voice AI to 11 languages: Arabic, English, French, German, Hindi, Italian, Japanese, Korean, Portuguese, Russian, and Spanish.
Example Japanese-language generation from Octave 2:
Example Korean-language generation from Octave 2:
Both of these voices were created with instant cloning, each using a 15-second audio recording of a native speaker's voice. When used to generate speech in a different language, Octave 2 predicts the speaker's accent. For instance, this is an English-language sample generated using the Japanese voice.
Octave 2 is becoming proficient in other languages, too; we will be announcing support for at least 20 languages in the coming months.
High quality at low latency
Octave 2 is the fastest and most efficient model of its kind, returning responses in under 200ms.
This was achieved without trading quality for latency. Instead, we deployed Octave 2 on some of the world’s most advanced chips for LLM inference. Working closely with Sambanova, we developed a new inference stack specific to Octave 2’s new speech-language model architecture.
Octave 2 isn’t just fast, it’s also efficient. We’re offering it at half the price of Octave 1. With dedicated deployments, this can be reduced to under a cent per minute of audio. This efficiency allows Octave 2 to power large-scale applications in entertainment, gaming, customer service, and more.
Voice conversion
With Octave 2, we’ve been working on two novel capabilities for a speech-language model: realistic voice conversion and direct phoneme editing.
With voice conversion, Octave 2 can exchange one voice for another while freezing the phonetic qualities and timing of the spoken utterance. This is ideal for use cases that require actors to stand in for other actors, such as dubbing in a new language with the original actor’s voice, or making precise human touch-ups to AI voiceovers.
For instance, when we prompt the model with the following speech:
And the following target voice:
The model generates the following converted speech:
Phoneme Editing
We're also exploring a new phoneme editing capability, where minute adjustments can be made to the timing and pronunciation of speech. This enables support for custom pronunciation of names, manipulation of word emphasis, and more.
For instance, take this classic film quote, recreated with voice conversion:
Using phoneme editing, we can alter the pronunciation of words in the original quote. Here's an example:
We've created a new word, "leviaso," out of the phonemes present in the original quote. This kind of granular phoneme replication and editing would have proven difficult, if not impossible, with text input alone.
Voice conversion and phoneme editing will be available soon on our platform.
Build conversational experiences with EVI 4 mini
Finally, we're launching EVI 4 mini, which brings all of the capabilities of Octave 2 to our speech-to-speech API. Now, you can build faster, smoother interactive experiences in 11 languages. For example, we built a translator app using EVI 4 mini with just a few voice samples and a prompt.
EVI 4 mini doesn't yet generate its own language natively, so you'll need to pair it with an external LLM through our API until we launch the full version.
Access Octave 2 and EVI 4 mini
Today, we're rolling out access to Octave 2 on our text-to-speech playground and API, and to EVI 4 mini on our speech-to-speech playground and API.
Soon we'll be releasing more evaluations, more languages, and access to voice conversion and phoneme editing.
In the meantime, we're excited to see what you create!
Subscribe
Sign up now to get notified of any updates or new articles.