Evaluating Large Language Models: A Necessity in the Age of Advanced AI

September 26, 2023
Sep 26, 2023
Read time: 
Evaluating Large Language Models: A Necessity in the Age of Advanced AI

Large Language Models (LLMs) are not your typical models; they are advanced, intricate, and capable of understanding and generating human-like text. They are revolutionizing the field of Natural Language Processing (NLP) and are being employed in a myriad of applications, ranging from content creation to customer support. However, with the rapid advancements in this field, it becomes imperative to have robust, reliable, and standardized methods to evaluate these models.

The Importance of Evaluation

Evaluating LLMs is crucial to understand their capabilities, limitations, and to ensure their optimal performance in real-world applications. It is not just about measuring the accuracy or the quality of the generated text but also about understanding how well the model comprehends and responds to diverse and complex prompts. A well-evaluated model can be a valuable asset, providing reliable and coherent outputs, while an unevaluated one can lead to inaccuracies and misunderstandings.

The Need for Standardization

Having a standardized, quantitative, and repeatable method for evaluating LLMs is essential. Standardization in evaluation methods ensures that models are assessed on a common ground, allowing for meaningful comparisons between different models. It provides a clear and objective measure of a model’s performance, enabling developers and users to make informed decisions when selecting a model for a specific task.

A Multifaceted Approach

Here I present three different ways of evaluating LLMs: the gut check, standard metrics, and creating an evaluation dataset specific to the target task. Each of these methods has its own merits and can provide valuable insights into the model’s performance. The gut check is quick and gives an initial sense of the model’s relevance. Standard metrics, recognized by the community, offer a more objective and quantitative measure of the model’s capabilities. However, they may not tell the whole story, and relying solely on them might cause one to overlook models that could be better suited for specific tasks. Creating an evaluation dataset for the target task, although effort-intensive, provides the most accurate and reliable results, ensuring that the model performs well in the intended application.

Gut Check

First, there is the gut check where you give the same prompt to multiple models and see what they produce. If the output is sensible and reasonably close to what you expect, the model passes. This is quick and easy, especially if you have access to a hosted model. This method is meant to give a ballpark sense of: is this model even relevant? It's better if the prompt is related to or an example of the type of task that you're going to have it do on your project, but it doesn't have to be. So if you want to write stories, ask the model to write a children's/funny/insert-genre story and see if it produces any semblance of a story or if it produces gobbledygook. It might be good to have a couple of prompts that are related to your target task, but that ask for varying degrees of length or content or some other differing characteristic.

Standard Metrics 

The second way is to use standard metrics that are recognized by the community and are published on public leaderboards or as part of research paper results. Huggingface hosts one of the most popular leaderboards - click on the "About" tab to read about the metrics. These metrics do not tell the whole story, however. Additionally, as Natural Language Understanding and Generative AI evolve, the metrics that the community uses evolve - for example, GLUE metrics used to be the standard and are still used by some people. In my view, these types of metrics are useful to pick an initial list of LLMs to try out, but I would never calculate them myself (unless I created a new model from scratch). 

Looking at leaderboards gives me some candidate LLMs to try out for my project. If an LLM score is really low across a bunch of tasks, it's probably not worth even trying out. Instead, I pick out some that have mediocre or high scores and test them out with the gut check test (#1 above). Models I pick don't have to have the highest scores because these tasks usually aren't really representative of the types of things that you want an LLM to do. A good analog is the ACT collegiate test. People who score higher on the ACT tend to be good at academics, whereas people who score lower, are worse. The person that scores a perfect 36 might be the smartest person, but maybe they just got lucky and someone with a slightly lower score is smartest. Finally, the test does not evaluate many other kinds of abilities such as driving a car or painting a picture, so you need other ways to evaluate people in addition to the ACT. Similarly, solely relying on standard metrics for LLMs might cause you to miss models that don't have the highest scores, but might be better for your target task.

Evaluation Dataset 

The third way takes the most effort but evaluates models on your target task. It gives you the best results and can be codified to run automatically. I would do this after selecting down to one or just a few models using the above two approaches.

  • First, you need come up with various sets of inputs for the model (or models) that represent your target task: you may have a number of different prompts, you may have different values for model parameters you want to try out (e.g. different values for temperature), maybe you have some various pre- and post-processing set up in your pipeline.
  • For each input, you also need an example output that matches your desired outputs from the model. Most of the time, creating this evaluation dataset is a very manual process. However, you may be able to automatically generate parts of it, for example using an LLM to generate examples of questions and answers.
  • Once you have an evaluation dataset, you need a way to determine if the generated output response fits/matches the response in your evaluation dataset. For instance, if you're going to use a model for classification, then you need to create or generate a dataset where the prompt is whatever input you're trying to classify, and the output is the class that each prompt belongs to (true label). Then run those prompts one at a time through your model and see if the predicted output class matches the actual true label class. This is a straightforward example because there are a finite set of classes and you can include them in the prompt to indicate to the model that it should output one of the class names (or respond with "I don't know"). It gets more complicated if your target task is something less easy to describe such as stories or summaries of articles. In that case you might have to get creative about how you judge the generated responses, possibly by using another LLM to compare it to the output from the evaluation dataset asking that model if the two outputs "match."


In conclusion, evaluating Large Language Models is not just important but essential in leveraging their capabilities effectively. It is crucial to approach this evaluation with a standardized and multifaceted methodology, considering the unique and advanced nature of these models. By doing so, we can harness the full potential of LLMs, ensuring their reliability and effectiveness in various applications, and paving the way for more innovations in the field of Natural Language Processing.


Learn more about our Decision Science practice.