Product Updates

Octave 2: next-generation multilingual voice AI

Published on October 1, 2025

Today we’re launching Octave (Omni capable text and voice engine) 2, the second generation of our frontier voice AI model for text-to-speech. We just made a preview of Octave 2 available on our platform and through our API.

Octave 2:

More deeply understands the emotional tone of speech.
Extends our text-to-speech system to 11 languages.
Is 40% faster and more efficient, generating audio in under 200ms.
Offers new first-of-their-kind features for a speech-language model, including voice conversion and direct phoneme editing.
Pronounces uncommon words, repeated words, numbers, and symbols more reliably.
Is half the price of Octave 1.

A speech-language model is a state-of-the-art AI model trained to understand and synthesize both language and speech. Unlike traditional TTS models, it understands how the script informs the tune, rhythm, and timbre of acting, inferring when to whisper secrets, shout triumphantly, or calmly explain a fact. Understanding these aspects of speech also allows it to reproduce the personality, and not just vocal timbre, of any speaker.

With Octave 2, we’ve taken these capabilities a step further.

Hyperrealistic voice AI in 11 languages

Octave 2 extends our next-generation voice AI to 11 languages: Arabic, English, French, German, Hindi, Italian, Japanese, Korean, Portuguese, Russian, and Spanish.

Example Japanese-language generation from Octave 2:

「ハハハ、えっと、これがフォーマルなイベントだとは知らなかった。だから、えっと、バットマンのパジャマを着ているんだ。あなたの家ではカジュアルなイベントだと思っていたんだけど。すごく着飾っていない気がする。」

Example Japanese

00:00

Example Korean-language generation from Octave 2:

"내가 이걸 얼마나 오랫동안 했는지 알아? 그런데 네가 그냥. 삭제했어? 내 인생의 3개월이 순식간에 사라졌어. 네가 다시 확인도 안 해줬으니까!"

Example Korean

00:00

Both of these voices were created with instant cloning, each using a 15-second audio recording of a native speaker's voice. When used to generate speech in a different language, Octave 2 predicts the speaker's accent. For instance, this is an English-language sample generated using the Japanese voice.

Japanese to English

00:00

Octave 2 is becoming proficient in other languages, too; we will be announcing support for at least 20 languages in the coming months.

High quality at low latency

Octave 2 is the fastest and most efficient model of its kind, returning responses in under 200ms.

This was achieved without trading quality for latency. Instead, we deployed Octave 2 on some of the world’s most advanced chips for LLM inference. Working closely with Sambanova, we developed a new inference stack specific to Octave 2’s new speech-language model architecture.

Octave 2 isn’t just fast, it’s also efficient. We’re offering it at half the price of Octave 1. With dedicated deployments, this can be reduced to under a cent per minute of audio. This efficiency allows Octave 2 to power large-scale applications in entertainment, gaming, customer service, and more.

Voice conversion

With Octave 2, we’ve been working on two novel capabilities for a speech-language model: realistic voice conversion and direct phoneme editing.

With voice conversion, Octave 2 can exchange one voice for another while freezing the phonetic qualities and timing of the spoken utterance. This is ideal for use cases that require actors to stand in for other actors, such as dubbing in a new language with the original actor’s voice, or making precise human touch-ups to AI voiceovers.

For instance, when we prompt the model with the following speech:

Original Speech

00:00

And the following target voice:

Target Voice

00:00

The model generates the following converted speech:

Converted Speech

00:00

Phoneme Editing

We're also exploring a new phoneme editing capability, where minute adjustments can be made to the timing and pronunciation of speech. This enables support for custom pronunciation of names, manipulation of word emphasis, and more.

For instance, take this classic film quote, recreated with voice conversion:

Replicated Phonemes

00:00

Using phoneme editing, we can alter the pronunciation of words in the original quote. Here's an example:

Edited Phonemes

00:00

We've created a new word, "leviaso," out of the phonemes present in the original quote. This kind of granular phoneme replication and editing would have proven difficult, if not impossible, with text input alone.

Voice conversion and phoneme editing will be available soon on our platform.

Build conversational experiences with EVI 4 mini

Finally, we're launching EVI 4 mini, which brings all of the capabilities of Octave 2 to our speech-to-speech API. Now, you can build faster, smoother interactive experiences in 11 languages. For example, we built a translator app using EVI 4 mini with just a few voice samples and a prompt.

EVI 4 mini doesn't yet generate its own language natively, so you'll need to pair it with an external LLM through our API until we launch the full version.

Access Octave 2 and EVI 4 mini

Today, we're rolling out access to Octave 2 on our text-to-speech playground and API, and to EVI 4 mini on our speech-to-speech playground and API.

Soon we'll be releasing more evaluations, more languages, and access to voice conversion and phoneme editing.

In the meantime, we're excited to see what you create!

Octave 2: next-generation multilingual voice AI

Octave 2:

Hyperrealistic voice AI in 11 languages

Example Japanese-language generation from Octave 2:

Example Korean-language generation from Octave 2:

High quality at low latency

Voice conversion

Phoneme Editing

Build conversational experiences with EVI 4 mini

Access Octave 2 and EVI 4 mini

Create experiences with the most empathic AI voices today

Create experiences with the most empathic AI voices today