Introducing Hume’s Empathic Voice Interface (EVI) API
Published on Apr 18, 2024
Integrate emotionally intelligent voice experiences into any application with our EVI API.
Introducing our voice AI – the Empathic Voice Interface
Last month, we released the demo of our Empathic Voice Interface (EVI) API. The first emotionally intelligent voice AI API is finally here!
EVI does a lot more than stitch together transcription, LLMs, and text-to-speech. With a new empathic LLM (eLLM) that processes your tone of voice, EVI unlocks new capabilities like knowing when to speak, generating more empathic language, and intelligently modulating its own tune, rhythm, and timbre.
EVI is the first voice AI that really sounds like it understands you. By adapting its tone of voice, it emulates the way humans convey meaning beyond words, unlocking more efficient, smooth, and satisfying AI interactions.
Accessing EVI: integrating emotionally intelligent voice AI into your applications
The main way to work with EVI is through a WebSocket connection that sends audio and receives responses in real time. This enables fluid, bidirectional dialogue where users speak, EVI listens and analyzes their voice, and EVI generates emotionally intelligent responses. You start a conversation by connecting to the WebSocket and streaming the user’s voice input to EVI.
As the user speaks to EVI, the client can also send EVI text to speak aloud, which is intelligently integrated it into the conversation.
See our documentation for more information on how to integrate EVI into your application. A great way to get started is on our platform, which allows developers to interactively configure custom system prompts and voices.
Empathic AI (eLLM) features
- Responds at the right time: Uses your tone of voice for state-of-the-art end-of-turn detection — the true bottleneck to responding rapidly without interrupting you.
- Understands users’ prosody: Provides streaming measurements of the tune, rhythm, and timbre of the user’s speech using Hume’s prosody model, integrated with our eLLM.
- Forms its own natural tone of voice: Guided by the users’ prosody and language, our model responds with an empathic, naturalistic tone of voice, matching the users’ nuanced “vibe” (calmness, interest, excitement, etc.). It responds to frustration with an apologetic tone, to sadness with sympathy, and more.
- Responds to expression: Powered by our empathic large language model (eLLM), EVI crafts responses that are not just intelligent but attuned to what the user is expressing with their voice.
- Always interruptible: Stops rapidly whenever users interject, listens, and responds with the right context based on where it left off.
- Aligned with well-being: Trained on human reactions to optimize for positive expressions like happiness and satisfaction. EVI continuously learns from users’ reactions.
Configurability: customizing your voice AI API
With the general release of EVI we’re also releasing our Configuration API, which will enable developers to customize their EVI—the system prompt, LLM, the tools that EVI can use, context to use during the conversation, and more. You can configure EVI in both the API or the UI. Configurable elements below —
-
System prompt: customize EVI’s personality, response style, and the content of speech through prompt engineering. Use our guidelines for prompting EVI to improve the performance, or try out our sample prompts on the voice playground.
-
Inject other LLM responses into our model: Hume’s empathic large language model (eLLM) always generates the first response to a query, but you can configure other LLMs to formulate longer responses.
-
Integrate another LLM API: Currently we support Fireworks Mixtral8x7b, all OpenAI models, and all Anthropic models.
-
Bring your own LLM or generate text another way: Connect our WebSocket to your own server with your own tool or text generation, allowing you to determine all EVI messages in the conversation.
-
Bring your own model: Rather than using our LLMs, connect our WebSocket to your own server to generate your own text, allowing you to determine exactly how EVI responds in the conversation.
-
TTS: Use just EVI’s expressive voice by sending our API text to be spoken aloud.
We plan to add more configuration options soon, allowing EVI to use tools, change its speaking style, and more. Join our Discord for product updates and technical support.
Learn more about emotionally intelligent voice AI
Build with our conversational AI voice
Speak with our Empathic Voice Interface
Subscribe
Sign up now to get notified of any updates or new articles.
Share article
Recent articles
Introducing EVI 2, our new foundational voice-to-voice model
EVI 2 is our new foundational voice-to-voice model. It is one of the first AI models with which you can have remarkably human-like voice conversations. It can converse rapidly and fluently with users with subsecond response times, understand a user’s tone of voice, generate any tone of voice, and even respond to some more niche requests like changing its speaking rate or rapping. It can emulate a wide range of personalities, accents, and speaking styles and possesses emergent multilingual capabilities.
Comparing the world’s first voice-to-voice AI models
The world’s first working voice-to-voice models are Hume AI's Empathic Voice Interface 2 (EVI 2) and OpenAI's GPT-4o Advanced Voice Mode (GPT-4o-voice). EVI 2 is publicly available, as an app and an API that developers can build on. On the other hand, GPT-4o-voice has been previewed to a small number of ChatGPT users. Here we explore the similarities, differences, and potential applications of these systems.
How Tone AI uses Hume’s API to boost audience growth
How Tone AI uses Hume’s Expression Measurement API to boost audience growth for NFL teams and media organizations