Announcing EVI 3 API: The most customizable speech-to-speech model
Published on July 17, 2025
The latest version of our empathic voice interface, EVI 3, is now available through our API. EVI 3 is the first speech-language model that can speak expressively with any voice, real or designed, without fine-tuning. In addition, we’re introducing hyperrealistic voice cloning and new LLM integrations with Claude 4, Gemini 2.5, Kimi K2, and more.
Benefits of a speech-to-speech model
As a speech language model, EVI 3 is higher quality and faster than TTS systems for conversational AI use cases. It is higher quality because its pretraining on trillions of tokens of text and then millions of hours of speech allows it to learn how language interacts with voice characteristics and tones. For instance, it knows what words to emphasize, what makes people laugh, and how accents and other voice characteristics interact with vocabulary. This is quite unlike traditional TTS systems, which lack a meaningful understanding of language.
EVI 3 is also much faster than systems with separate LLM and TTS models. By generating language and speech with the same intelligence, it can take advantage of the considerable overlap in the processing required to think and speak.
Unique features of EVI 3
Unlike previous speech-language models (e.g., EVI 2, GPT-4o, Gemini), EVI 3 is able to speak expressively with any voice, real or designed, without fine-tuning. This accomplishment is made possible by research breakthroughs we’ve made at Hume related to speech tokens, fine-tuning, and multimodal reward modeling.
What this means is that developers can now use any of the 200K+ voices they’ve designed on our platform in speech-to-speech conversation with EVI 3. We’re also introducing voice cloning on our platform for the first time. With just 30 seconds of audio or even less, EVI 3 is able to capture not just the timbre and accent of a voice but its rhythm, tone, and even aspects of the speaker’s personality.
EVI 3 is also uniquely designed for maximum interoperability with other LLMs. Through our platform, you can select any popular LLM and its responses will be merged seamlessly into EVI 3’s quicker responses as they become available. You can integrate your own custom LLM or RAG system with one line of code.

Pricing
EVI is also exceedingly efficient, and we are excited to make it available at very low prices at scale. Pricing starts at $0.0X/min and we can offer prices well below $0.02/min for the highest volume applications (talk to sales if you have a large-scale application in mind).
Getting started with EVI 3
You can try out EVI 3 using our public demo, configure and test EVI 3 on our playground, and find guides on integrating EVI 3 in our documentation.
We’re excited to see what you build. But if you’re still unclear on what to use EVI 3 for, or where this is all going, read on.
The applications of EVI 3 and the future of AI
At Hume, we were early to recognize that AI technology will rely on three different kinds of models that work in tandem: interfaces, agents, and workers. AI interfaces like EVI 3 are the models that communicate with users, powering AI assistants, VR characters, tutors, and more. Interfaces need to communicate with AI agents, which perform parallel tasks that require back-and-forth communication with other, slower systems, such as search engines and databases, and workers, which solve problems through long chains of reasoning.

Given their special role, models used for AI interfaces need to satisfy three requirements.
-
First, as a core capability, they need to engage in speech-to-speech interaction, which should be empathic, satisfying, and customizable.
-
Second, they need to respond at conversational latency, approximately 300ms, which requires a comparably small model.
-
Finally, they need to be able to communicate efficiently with other models, passing along essential information quickly.

At the intersection of these three requirements for AI interfaces, recent advances in speech-language models dominate over systems that use separate models for language and text-to-speech. In a speech-language model (SLM), the same intelligence handles language and voice. This is necessary for AI to sound human and to minimize latency.

A few months ago we launched Octave, an SLM that surpassed traditional text-to-speech models in quality and latency.
.png?width=1920&height=1260&quality=75&format=webp&fit=inside)
EVI 3
EVI 3 is our latest SLM. It inherits Octave’s state-of-the-art speech output and fulfills the other requirements of AI interfaces: native LLM intelligence and efficient communication with other systems. We’ve structured the responses of EVI 3 to take advantage of its ability to offer intelligent low-latency responses while simultaneously passing essential context to larger, more intelligent models and other tools that run in parallel. EVI 3 seamlessly integrates the responses of other systems once available.
.png?width=1920&height=1260&quality=75&format=webp&fit=inside)
This enables us to offer a number of new model capabilities to developers for the first time.
-
EVI 3 is fully promptable. The same prompt controls both EVI 3’s voice and language, which you can learn more about here.
-
EVI 3 is interoperable with other LLMs. You can select from a number of popular LLMs within the EVI 3 API to use as supplementary LLMs, or you can bring your own LLM. When external LLMs are used, EVI 3 will generate the initial response to users and then allow the other LLM to take over and control the remainder of each response as soon as it can.
-
EVI 3 can seamlessly integrate new context that you pass in while it is speaking.
-
EVI 3 can be used with any voice, including the 200K+ voices users have designed on our platform. (Using new RL approaches, we’ve trained EVI 3 to generate high-quality conversational voice outputs using any reference voice, without fine-tuning.)
-
As part of this launch, we’re also releasing our voice cloning feature. EVI 3 brings LLM intelligence to voice cloning, allowing it to generate hyperrealistic voices with <30s recordings, and making use of its deeper implicit understanding of human psychology than traditional text-to-speech models.
We hope that EVI 3 helps you build the future of AI.
Subscribe
Sign up now to get notified of any updates or new articles.