Perplexity#

Requirements

It is important to note that perplexity can only be calculated using autoregressive language models.

Perplexity is a metric commonly used in natural language processing to evaluate the quality of language models, particularly in the context of text generation. Unlike metrics such as BLEU or BERT, perplexity doesn't directly measure the quality of generated text by comparing it with reference texts. Instead, perplexity assesses the "confidence" or "surprise" of a language model in predicting the next word in a sequence of words.

Perplexity quantifies how well a language model can predict the next word in a sequence of words. It is calculated based on the probability distribution of words generated by the model. A high perplexity indicates that the language model is not confident in its text generation — that is, the model is "perplexed" — whereas a low perplexity indicates that the language model is confident in its generation.

Note: Perplexity can only be calculated for autoregressive language models

Autoregressive language models, such as GPT and LLaMA, are language models that work by generating one token at a time, based on a set number of preceding tokens.

To generate an output, it analyzes the words that come before it and calculates the likelihood of different words being the next one. Then, it picks the word with the highest chance of being correct for the next part of the sentence. After that, it repeats the entire process, using the newly selected word as part of the context for the next prediction.

For example, if the preceding generated text is "Hello, today is a great day to" and our context length is set to 6 preceding words, our model's output distribution might look like the following:

P("walk" | "today is a great day to") = 46%
P("run" | "today is a great day to") = 44%
P("sleep" | "today is a great day to") = 9.9%
P("goodbye" | "today is a great day to") = 0.1%

For a deep dive into autoregressive language models like GPT and LLaMA, feel free to refer to this publication on the theory behind their inner-workings.

Implementation Details#

Mathematically, perplexity is defined as the exponential of the negative log-likelihood of a sequence (don't worry if this doesn't make sense, we'll break it down!).

\[ \begin{align*} \text{Perplexity} &= \exp{( -\frac{1}{t} \sum^t_i \log p_\theta (x_i | x_\text{context}) )} \end{align*} \]

where \(x_i\) represents the token that is generated, and \(x_\text{context}\) represents the preceding tokens that the current generated token is conditioned on.

In pseudocode, the algorithm works as such:

Perplexity Pseudocode

total_log_likelihood = 0
word_count = 0

for example in test_dataset:
    # The correct next word to be predicted
    reference_word = example.reference
    # Predicted output distribution
    predicted_distribution = model.predict(example.context)

    # Add the log-likelihood of the correct next word predicted by the model.
    total_log_likelihood += log(predicted_distribution[reference_word])
    word_count += 1

# Calculate the exponential of the negative log likelihood averaged over the
# number of words
perplexity = e ^ (-total_log_likelihood / word_count)

This implementation is simplified and ignores details such as tokenizing the test dataset.

Interpretation#

Ideally, a lower perplexity is desirable. Since perplexity a relative metric with no upper bound, it is best used to compare similar models and configurations on a specific task and dataset to look for improvements in model performance.

For context, the perplexity score of GPT-2 calculated on the wikitext dataset is 16.4541. The perplexity of LLaMA-2 7B calculated on its diverse test dataset is between 5.0 - 6.0. It is not uncommon to see your language model perform between these ranges for text-generation tasks.

Example#

Assume we have a sentence-completion language model that uses a context-length of 4 and has a very limited vocabulary. Let's calculate the perplexity of the hypothetical model.

Ground Truth	Context
`The curious cat explored the mysterious, overgrown garden`.	`The curious cat explored...`

Step 1. Generate output distributions

Forward Pass 1:
\(\text{model(}\)[the, curious, cat, explored] \()\) -> {the: 45%, a: 25%, with: 10%, ...}*
* In this notation, we indicate that model is generating a probability that a given word will be the next word in this sentence. That is, the model believes there is a 45% chance that the next word will be "the", a 25% chance the next word will be "a", and so on.

Forward Pass 2:
\(\text{model(}\)[curious, cat, explored, the] \()\) -> {mysterious: 20%, dark: 15%, colorful: 10%, ...}

Forward Pass 3:
\(\text{model(}\)[cat, explored, the, mysterious] \()\) -> {overgrown: 70%, and: 12%, underground: 2%, ...}

Forward Pass 4:
\(\text{model(}\)[explored, the, mysterious, overgrown] \()\) -> {jungle: 60%, cave: 30%,garden: 5%, ...}

Step 2. Calculate Perplexity

Our total log likelihood is:

\[ log(0.45) + log(0.2) + log(0.7) + log(0.05) \approx -2.502 \]

Thus, our perplexity is given by:

\[ \begin{align*} \text{Perplexity} &\approx \exp{( \frac{2.502}{4} )} \\ &\approx 1.869 \end{align*} \]

Note that in reality, the assigned output probabilities may be much lower due to models having extremely large vocabularies. So, a perplexity of 1.869 may not be realistic and representative of an actual language model.

Limitations and Biases#

While perplexity is a valuable metric for comparing and fine-tuning language models, it has its limitations and should always be used in conjunction with other evaluation metrics and methods.

Perplexity is sensitive to the size and diversity of datasets. Since each dataset has its own distribution of words and semantic nuances, a model that achieves low perplexity on one dataset may underperform on another. Thus, it is important to gauge perplexity on a diverse dataset with different types of texts.
Perplexity is hard to absolutely interpret. Unlike metrics such as BLEU or BERTScore, Perplexity only provides a measure of "confidence" rather than a measure of semantic or textual similarity. While a model with lower perplexity is desirable, it does not provide a lot of information into the semantic quality of the generated texts.
Perplexity does not capture the diversity of language usage (i.e. emotions and expressions). In many workflows such as text generation, language diversity is an important quality. A model with low perplexity may still generate repetitive or uninteresting text, which isn't necessarily desirable.

Limitations aside, perplexity still remains a widely used and valuable metric for many applications in natural language processing. It is a crucial metric to have in most NLP workflows, and provides a different perspective into the quality of your language model compared to textual-similarity metrics.