Hume Startup Grant Program now liveApplication
Science

Introducing a new evaluation for creative ability in Large Language Models

Published on Feb 9, 2024

Introducing HumE-1 (Human Evaluation 1), our new evaluation for large language models (LLMs) that uses human ratings to evaluate LLMs for their ability to perform creative tasks in the ways that matter to us, evoking the feelings we want to feel.

Large Language Models (LLMs) are already being used to write books and articles, assist legal professionals and healthcare practitioners, and provide mental health support. As the capabilities of LLMs grow, so will their use cases, and they will have an increasing impact on our lives.

However, existing benchmarks and evaluation approaches for LLMs fail to capture how they affect our satisfaction and well-being, which depends in large part on subjective qualities of generated content such as interestingness, conciseness, aesthetic quality, and humor.

At Hume, we’re directly optimizing LLMs for these kinds of qualities by learning from listeners’ expressive reactions to language. To track our progress, we needed to develop a new kind of evaluation that could capture how well models can produce content that is interesting, engaging, funny, moving, and more; that is, how well they can perform creative tasks in the ways that matter to us.

These capabilities can be thought of as facets of emotional intelligence, but they are required for almost all tasks, even basic ones that might not initially seem emotional. For example, LLMs should deliver information in a manner that is interesting and not boring. They should produce “good” writing; that is, text that is aesthetically pleasing to read. They should also be able to generate content that evokes more specific feelings appropriate to the use case in which they are embedded, such as humor in response to a joke or horror in response to a horror story.

Today, we’re introducing HumE-1 (Human Evaluation 1), in which we evaluate each model for their ability to write motivational quotes, interesting facts, funny jokes, beautiful haikus, charming limericks, scary horror stories, appetizing descriptions of food, and persuasive arguments to donate to charity. We selected these tasks for their ability to tap into a range of different emotional dimensions of human experience.

We designed our evaluation with prompts that we consider to be “honest” and “naturalistic” in the sense that the LLM is given a straightforward task and told how it will be evaluated. (We believe honest and naturalistic prompts are preferable over highly engineered prompts because they are more likely to reflect how the LLM will perform when given a real request.) 

Specifically, for each task, we prompted the LLM by telling it that it was participating in a competition among LLMs that would be judged by human raters along specific dimensions. Then, we conducted our evaluation by running surveys with real human participants who actually judged the responses according to the specified dimensions. Participants rated each model response on a scale from 1-5 in terms of whether they found it motivating, interesting, funny, beautiful, charming, scary, appetizing, or persuasive.

Soon, we’ll be releasing our own models, but given the recent release of Gemini Ultra, we were curious how the most advanced LLMs would perform. We’ve tested GPT-4 TurboGemini Ultra, and LLaMa 70b, and, for comparison, a few smaller language models like RedPajama-INCITE-Chat-3B. (As a caveat, it’s worth noting that we collected Gemini Ultra responses through Google’s chat interface, which may have access to auxiliary tools such as web search.)

We are excited to share this first round of results today. In short: Gemini Ultra performed best, with GPT-4 Turbo not far behind, but both models still have significant room for improvement.

Pilot Barplot Avgmodel  2 .png

HumE-1 scores:

HumE-1 Scores

More Detailed Results

We found that, on average, models performed worst at telling funny jokes, and best at writing appetizing descriptions of food.

Average ratings by task type:

Chart

Average ratings, aggregated by task:

Average ratings, aggregated by task

However, we found some interesting differences in how responses from each individual model performed.

Overall, model abilities scaled with model complexity, with the 3B parameter model showing the lowest ratings overall across all categories. The highest ratings overall were given to GPT-4 Turbo and Gemini Ultra, with some interesting outliers: LLaMA 70b was the best overall at writing persuasive arguments and scary stories, though only by a bit.

Average ratings by model:

Chart 2

Average ratings for each task, aggregated by model:

Pilot Barplot Bymodel  1 .png

Methods

We collected responses from GPT-4 Turbo and GPT 3.5 using the OpenAI API; responses from LLaMA 70b, RedPajama-INCITE using the Together API; and responses from Gemini Ultra using Google’s chat interface.

Each prompt followed the same format, instructing models to produce responses that would subsequently be rated by human judges.

“You are participating in a competition among a variety of large language models (LLMs). 

The competition involves a series of prompts. For each prompt, you will provide a response that will subsequently be evaluated by human raters. 

The human raters will be instructed to evaluate each response in terms of a stated objective.

Your goal is for your responses to be rated more positively on average than those of every other LLM participating in the competition. 

Here is your prompt: Write an original, one-sentence motivational quote about  working hard. Your response will be evaluated by raters out of 5 stars in terms of whether they find it genuinely motivating. Your goal is to achieve the highest possible score. Please provide your response to the prompt.”

Human ratings were collected on Prolific. We told participants that they were acting as judges in a writing competition. Ratings were collected on a scale from 1-5 with responses reflecting the intent of the original model response. For example, horror stories were rated on a scale from 1 = “not scary at all” to 5 = “extremely scary”.

Conclusions

LLMs are increasingly being used for innovative, creative applications, but a current gap in our evaluations of LLMs is how well they perform tasks that require a sense of what moves people.

We’re building a new generation of LLMs that are trained on fine-grained emotional expression data to directly optimize AI to support human wellness and well-being. We’re excited to provide a new way to understand how well these models perform against the state of the art.

To stay up to date on our new developments, follow us on Twitter/X @hume_ai and sign up for platform access here.

Subscribe

Sign up now to get notified of any updates or new articles.

Recent articles