Hume Startup Grant Program now liveApplication
Product Updates

Comparing the world’s first voice-to-voice AI models

By Jeremy Hadfield on Sep 11, 2024

1200 X 627

Comparing the world’s first voice-to-voice AI models

Imagine if you could speak naturally to any product or app—four times faster than typing—and it could talk back. Imagine if, based not just on what you said but also how you said it, it did what you wanted it to do. That’s what voice-to-voice foundation models, the latest major breakthrough in AI, will enable for many, if not most, products and services in the coming months and years.

The world’s first working voice-to-voice models are Hume AI's Empathic Voice Interface 2 (EVI 2) and OpenAI's GPT-4o Advanced Voice Mode (GPT-4o-voice). EVI 2 is publicly available, as an app and an API that developers can build on. On the other hand, GPT-4o-voice has been previewed to a small number of ChatGPT users. Here we explore the similarities, differences, and potential applications of these systems.

What are voice-to-voice AI models?

Voice-to-voice AI models apply the same principles as large language models (LLMs), but they directly process audio of the human voice instead of text. Whereas large language models are trained on millions of pages of text, voice-to-voice models are trained on millions of hours of recorded voice data. These models enable users to speak with AI through voice alone. 

In many ways, these new voice-to-voice models bring to fruition what legacy technologies like Siri and Alexa had long promised. Siri and Alexa were presented as general-purpose voice understanding systems, with the capability to fulfill arbitrary voice queries. Unfortunately, Siri and Alexa were not actually powered by general-purpose voice AI models, but by traditional computer programs that generated fixed responses to a hardcoded set of keywords.

As general-purpose systems that can fulfill arbitrary voice queries, voice-to-voice models make possible, for the first time, the things people always wished Siri and Alexa could do. Since these kinds of voice assistants were first launched over a decade ago, many have forgotten what made them so exciting to begin with. Voice is how humans interact with each other, our most natural modality for communication. Consider:

  • The average person speaks at 150 words per minute but types at only 40 wpm. Voice makes interacting with computers - especially for input - much faster.

  • Speech recognition accuracy has improved by over 5x since 2012, now rivaling or exceeding human transcription in accuracy.

  • Voice-to-voice models have the potential to democratize computing for 773 million illiterate adults worldwide.

  • For 2.2 billion people with visual impairments, voice-to-voice models are not just convenient - they can become their primary gateway to digital interaction.

Voice-to-voice models will allow billions more people to use state-of-the-art technology with seamless communication. Within a decade, our current interfaces may feel as outdated as command-line interfaces in a GUI world.

Comparing EVI 2 and GPT-4o-voice

Similarities

EVI 2 and GPT-4o-voice have many capabilities in common. Both are multimodal language models that can process both audio and language and output both voice and language. As a result, they can both converse rapidly and fluently with users with sub-second response times, understand a user’s tone of voice, generate any tone of voice, and even respond to some more niche requests like changing their speaking rate or rapping. Voice-to-voice models overcome the inherent limitations of traditional stitched-together systems that rely on separate steps for transcription, language modeling, and text-to-speech.

Differences

Although only EVI 2 is publicly available, we can begin to speculate as to what differentiates the two models by comparing EVI 2 to early demos of GPT-4o-voice. 

EVI 2 is optimized for emotional intelligence. EVI 2 excels at anticipating and adapting to your preferences, made possible by its special training for emotional intelligence. EVI 2 leverages Hume's research on human expression to interpret and respond to subtle emotional cues in the user's voice. Then, it can use these cues to make more empathic responses that are more likely to support the user’s well-being. In contrast, while ChatGPT voice is capable of interpreting tone and responding with an emotional tone of voice, it doesn't seem to have the same depth of focus on emotional intelligence as EVI 2, and does not appear to be trained to promote the user’s well-being.

EVI 2 is trained to maintain compelling personalities. Hume’s speech-language model is trained to maintain characters and personalities that are fun and interesting to interact with. On the other hand, GPT-4o-voice is currently restricted to a small set of prototypical “AI assistant” personalities. 

EVI 2 is customizable. Where GPT-4o-voice has four known preset voices with relatively static personalities, EVI 2 can emulate an infinite number of personalities, including accents and speaking styles, with flexible prompting and voice modulation tools. We developed a novel voice modulation approach that allows anyone to adjust EVI 2's 7 (and counting) base voices along a number of continuous scales, including gender, nasality, pitch, and more.

EVI 2 is designed for developers. Available through our API, EVI 2 is designed for developers with a customizable voice and personality that can be tailored to specific apps and users. In contrast, GPT-4o-voice is a component of a consumer product, the ChatGPT app. So far, GPT-4o-voice appears to have been designed for ChatGPT, with a particular “AI assistant” personality tailored to the app. OpenAI has not announced its intentions to release an API for GPT-4o-voice. As a result, EVI 2 is the only currently available voice-to-voice model API.

EVI 2 can be used with any LLM. While EVI 2 generates its own language, it is designed to be flexible and interoperable with other LLMs. This includes supplemental LLMs from OpenAI, Anthropic, Google, Meta, or any other provider. It also enables custom language options, allowing developers to bring their own LLMs or generate fixed responses. This flexibility enables developers to leverage the strengths of different LLMs while still benefiting from EVI's voice and empathic AI capabilities. In contrast, GPT-4o-voice is tightly integrated with OpenAI's ecosystem. Assuming ChatGPT voice is made available as an API, it would still only be well-suited for applications where GPT-4o’s responses are consistently preferred over other LLMs like Gemini or Claude.

GPT-4o supports more languages. OpenAI’s voice offering supports input and output in a wide range of languages. The list of supported languages is not public, but based on demos it seems to be pretty extensive. Similarly, EVI 2’s architecture allows for voice and text generation in any language, but the small model is currently only fluent in English, Spanish, French, German, and Polish. Many more languages will be added soon. 

The use cases for EVI 2

Voice-to-voice models are set to transform a wide range of products and services over the coming months. 

Customer service. For customer-facing businesses, voice-to-voice models can provide 24/7 support with unprecedented empathy and understanding. This is crucial, as businesses lose an estimated $75 billion annually due to poor customer service (source), and 81% of customers prefer self-service options according to Harvard Business Review (source). Businesses are unable to answer all incoming calls, which results in significant missed revenue - small and medium-sized businesses miss between 22% to 62% of all incoming calls (source). For example, in the automotive industry, a single missed call can result in a $220 lost opportunity, with the average automotive business losing $49,000 in revenue leakage per year due to unanswered calls (source). Using voice AI models can allow businesses to earn millions more in revenue. 

A more efficient interface for virtually any application. Voice-to-voice models can significantly boost productivity by allowing hands-free, natural language interactions with complex systems. Since the voice is 4x faster than typing and allows applications to perform any action, not just the ones presented on a specific UI page, this may unlock an order of magnitude increase in productivity. Any application can add a voice interface to accelerate interaction speed and improve accessibility for millions of users.

Mental health, education, and personal development. The ability of voice-to-voice models to understand context and emotional cues opens up possibilities in specific fields like mental health, education, and personal development. The market for AI-powered mental health apps is forecast to reach $8 billion by 2025 (source), showcasing the immense potential for personalized, empathic AI services at scale. 

These are just a small selection of example use cases for EVI 2. By enabling any application to add a customizable voice interface, EVI 2 enables countless new uses for voice AI.

Looking forward: the future of EVI

Currently, EVI 2 is available only in one model size: EVI-2-small. We are still making improvements to this model. In the coming weeks, it will become more reliable, learn more languages, follow more complex instructions, and use a wider range of tools. We’re also fine-tuning EVI-2-large, an upgraded voice-to-voice model we will be announcing soon. 

While maintaining or exceeding EVI-2-small’s voice capabilities, EVI-2-large will be more responsive to prompts and excel in complex reasoning. For now, if your application makes use of complex reasoning skills such as logical reasoning and tool use, we recommend that you configure the EVI API to use EVI 2 in conjunction with an external LLM.

EVI 2 represents a critical step forward in our mission to optimize AI for human well-being. We focused on making its voice and personality highly adaptable to give the model more ways to meet users’ preferences and needs. The default personalities of EVI 2 already reflect how the model is optimized for user satisfaction, demonstrating that AI optimized for well-being will have a particularly pleasant and fun personality as a result of its deeper alignment with your goals. 

Our ongoing research is focused on optimizing for individual user’s preferences automatically, with methods to fine-tune the model to generate responses that align with ongoing signs of happiness and satisfaction during everyday use of an application.

Voice-to-voice AI models represent a transformative leap in human-computer interaction. We can’t wait to try the delightful user experiences developers build with the EVI 2 API. 

Resources

EVI 2 Documentation 

EVI 2 Pricing

Developer Platform

Hume Discord

Subscribe

Sign up now to get notified of any updates or new articles.

Recent articles