"It Wrote That": An Overview of Language Models

March 4, 2020
Aug 11, 2022
Read time: 
"It Wrote That": An Overview of Language Models

Our previous post from our series on Language Models showed an example of a post generated by a state-of-the-art deep learning model, GPT-2. Now, let’s take some time to dive deeper into the world of language models.

Introduction to Language Models

Language models capture the intricate relationships between words and exploit those relationships to perform tasks like text generation. Most language models take a sequence of words (or characters, subwords, groups of words, etc.) and try to predict a missing word in that sequence. The sequence can be a phrase, sentence, paragraph, or longer text such as a blog post. By repeatedly predicting the next word for a sequence, language models can generate new text. For example (human prompts in bold, model generated text in italics):

This blog post is part of a series on the use of a simple data structure called a linked list in R.

Today, I am going to share with you my favorite recipe from my new cookbook, the Ultimate Low-Carb Diet Cookbook

Once upon a time there was a rabbit. It was a normal rabbit. There was no evil. The rabbit was a normal rabbit.

Sport scores from last night’s match between Real Madrid and Bayer Leverkusen.

Bayer Leverkusen 1 - 0 Real Madrid

Manuel Neuer - 7.5/10

A History and Major Challenge of Language Models

Language models face a major challenge which does not exist with many other types of prediction problems. This challenge is the need to deal with sequential data - the model must relate the current word being predicted to words earlier in the sequence. The ability to relate different words in the sequence, referred to as a model’s “memory,” is critical for generating long, coherent, non-repeating text.

Early language models effectively had no memory. These early models primarily relied on word counts. Over time, new models were developed incorporating neural networks. These new models were much more capable, but they also struggled with the memory challenge. Eventually, new neural network architectures came along, such as Recurrent Neural Networks (RNNs) and Long-Short Term Memory (LSTM), each one able to account for longer distances between words in sequences. This distance, however, was still too short to generate reasonable text the length of a blog post.

The most recent architecture developments have drastically improved the memory of state-of-the-art models. The architecture used by OpenAI’s Generative Pretrained Transformer 2 (GPT-2) and other recent models such as Google AI’s Bidirectional Encoder Representations from Transformers (BERT) have a much longer memory, allowing them to produce longer sequences of coherent text. This new architecture is called the transformer architecture. It allows models like GPT-2 to incorporate words in the sequence much further away than previous types of models. This allows GPT-2-like models to generate longer sequences of coherent text.

How Language Models Work

Language models involve two main steps: 

  1. Represent the words of the input text as lists of numbers called vectors. Computers cannot directly understand words; they only understand numbers. Therefore, this step is necessary to convert the input text into something a computer can understand.
  2. Use these vectors to produce new output text. Sometimes the new text is answers to questions about a passage; sometimes it is a summary of a block of text; sometimes it is a translation of the original words; and sometimes it is a brand new description or story building on the original input.

During the first step, older language models did little more than count the occurrences of words or word combinations in the input text. While this is useful for certain tasks related to information retrieval, it fails to capture the relationships between word meanings making it less suitable for more advanced applications such as text generation. For example, word vectors for “dog” and for “cat” from an older model tell you nothing about how similar or dissimilar dogs and cats are.

Newer language models use word embeddings during this step, which do not rely solely on the number of times a word appears in the text. They incorporate contextual information learned from the training data. An embedding is simply a collection of real-valued vectors, one vector per word in your training set. The vectors in an embedding are learned from training data in ways that capture relationships between words. For example, embeddings can know that “man,” “king,” and “sir” are related in a similar way that “woman,” “queen,” and “madam” are related. The next step then exploits this relationship information.

During the second step, input vectors are fed through a predictor, and the model produces output text. In older models, the word vectors were combined into a single sequence vector of the same length, which removed the word order and further compounded memory the issue. For current models, word vectors are sequentially fed through the model, allowing the model to take word positions into account. These current models have mechanisms that either carry information forward to be used in future word predictions (in the case of RNNs and LSTMs) or look backwards to incorporate prior words in the current prediction (in the case of Transformers).

What Can Language Models Do For You

Not restricted to generating blog posts, language models can be used to summarize text, answer questions about text, and translate text between languages. New uses of these large scale language models are being continuously discovered, and while all of these applications are exciting, creating large scale language models is difficult. Training them requires large amounts of data and computing power, including multiple GPUs. 

As a practitioner, you can save time and resources by using existing models rather than building one from scratch. You can download many for free: OpenAI, Google, and Facebook (or many in one place). Trained on large, diverse datasets, these models are generic enough to apply to many projects. The GPT-2 model we used to generate our post was unmodified from OpenAI’s release.

Still there are times when you might need to make them more applicable to your specific task or problem. By applying transfer learning or fine-tuning techniques to pre-trained models, you can benefit from the training the creators put into them, but adapt them to your needs. Best of all, these techniques require much smaller datasets specific to your task and far less computing power. See our webapp for examples of models fine-tuned to tell knock-knock jokes and generate Dr. Seuss stories, or check out this Google Colab notebook to get started fine-tuning your own model!

Read the first blog post in this series to see the OpenAI’s language model in action.