ROUGEN#
ROUGE vs. Recall
Complimentary to BLEU, ROUGEN can be thought of as an analog to recall for text comparisons.
ROUGEN (RecallOriented Understudy for Gisting Evaluation), a metric within the broader ROUGE metric collection, is a vital metric in the field of natural language processing and text evaluation. It assesses the quality of a candidate text by measuring the overlap of ngrams between the candidate text and reference texts, and ranges between 0 and 1. A score of 0 indicates no overlap between candidate and reference texts, whereas a perfect score of 1 indicates perfect overlap. ROUGEN provides insights into the ability of a system to capture essential content and linguistic nuances, making it an important and versatile tool used in many NLP workflows. As the name implies, it is a recallbased metric — a complement to the precisionbased BLEU score.
Implementation Details#
Formally, ROUGEN is an ngram recall between a candidate and a set of reference texts. That is, ROUGEN calculates the number of overlapping ngrams between the generated and reference texts divided by the total number of ngrams in the reference texts.
Mathematically, we define ROUGEN as follows:
where \(\text{Match(ngram)}\) is the maximum number of ngrams cooccuring in a candidate text and set of reference texts.
It is clear that ROUGEN is analogous to a recallbased measure, since the denominator of the equation is the sum of the number of ngrams on the referenceside.
Multiple References  Taking the Max#
We usually want to compare a candidate text against multiple reference texts, as there is no single correct reference text that can capture all semantic nuances. Though the above formula is sufficient for calculating ROUGEN across multiple references, another proposed way of calculating ROUGEN across multiple references, as highlighted in the original ROUGE paper, is as follows:
We compute pairwise ROUGEN between a candidate text s, and every reference, \(r_i\), in the reference set, then take the maximum of pairwise ROUGEN scores as the final multiple reference ROUGEN score. That is,
The decision to use classic \(\text{ROUGEN}\) or \(\text{ROUGEN}_\text{multi}\) is up to the user.
Interpretation#
When using ROUGEN, it is important to consider the metric for multiple values of N (ngram length). For smaller values of N, like ROUGE1, it places more focus on capturing the presence of keywords or content terms in the candidate text. For larger values of N, like ROUGE3, it places focus on syntactic structure and replicating linguistic patterns. In other words, ROUGE1 and small values of N are suitable for tasks where the primary concern is to assess whether the candidate text contains essential vocabulary, while ROUGE3 and larger values of N are used in tasks where sentence structure and fluency are important.
Generally speaking, a higher ROUGEN score is desirable. However, the score varies among different tasks and values of N, so it is a good idea to benchmark your model's ROUGEN score against other models trained on the same data, or previous iterations of your model.
Examples#
ROUGE1 (Unigrams)#
Assume we have the following candidate and reference texts:
Reference #1  A fast brown dog jumps over a sleeping fox 
Reference #2  A quick brown dog jumps over the fox 
Candidate  The quick brown fox jumps over the lazy dog 
Step 1: Tokenization & nGrams
Splitting our candidate and reference texts into unigrams, we get the following:
Reference #1  [A , fast , brown , dog , jumps , over , a , sleeping , fox ] 
Reference #2  [A , quick , brown , dog , jumps , over , the , fox ] 
Candidate  [The , quick , brown , fox , jumps , over , the , lazy , dog ] 
Step 2: Calculate ROUGE
Recall that our ROUGEN formula is: \(\frac{\text{# of overlapping ngrams}}{\text{# of unigrams in reference}}\)
There are 5 overlapping unigrams in the first reference and 7 in the second reference, and 9 total unigrams in the first reference and 8 in the second. Thus our calculated ROUGE1 score is \(\frac{12}{17} \approx 0.706\)
ROUGE2 (Bigrams)#
Assume we have the same following candidate and reference texts:
Reference #1  A fast brown dog jumps over a sleeping fox 
Reference #2  A quick brown dog jumps over the fox 
Candidate  The quick brown fox jumps over the lazy dog 
Step 1: Tokenization & nGrams
Splitting our candidate and reference texts into bigrams, we get the following:
[A fast , fast brown , brown dog , dog jumps , jumps over , over a , a sleeping , sleeping fox ] 

[A quick , quick brown , brown dog , dog jumps , jumps over , over the , the fox ] 

Candidate  [The quick , quick brown , brown fox , fox jumps , jumps over , over the , the lazy , lazy dog ] 
Step 2: Calculate ROUGE
Recall that our ROUGEN formula is: \(\frac{\text{# of overlapping ngrams}}{\text{# of unigrams in reference}}\)
There is 1 overlapping bigram in the first reference and 3 in the second reference, and 8 total bigrams in the first reference and 7 in the second. Thus our calculated ROUGE2 score is \(\frac{4}{15} = 0.267\)
Note that our ROUGE2 score is significantly lower than our ROUGE1 score on the given candidate and reference texts. It is always important to consider multiple ngrams when using ROUGEN, as one value of N does not give a holistic view of candidate text quality.
Limitations and Biases#
ROUGEN, like any other ngram based metric, suffers from the following limitations:

Unlike BERTScore, ROUGEN is not able to consider order, context, or semantics when calculating a score. Since it only relies on overlapping ngrams, it can not tell when a synonym is being used or if the placement of two matching ngrams have any meaning on the overall sentence. As a result, the metric may not be a perfect representation of the quality of the text, but rather the "likeness" of the ngrams in two sentences. Take for example, the ROUGE2 score of "This is an example of text" and "Is an example of text this". Both ROUGE1 and ROUGE2 would give this a (nearly) perfect score, but the second sentence makes absolutely no sense!

ROUGEN can not capture global coherence. Given a long paragraph, realistically, having too large of a value for N would not return a meaningful score for two sentences, but having a reasonable number like N = 3 wouldn't be able to capture the flow of the text. The score might yield good results, but the entire paragraph might not flow smoothly at all. This is a weakness of ngram based metrics, as they are limited to short context windows.
That being said, ROUGEN has some advantages over embeddingsbased metrics. First of all, it is very simple and easy to compute — it is able to calculate scores for large corpuses efficiently with no specialized hardware. ROUGEN is also relatively easy to interpret. The N value can be adjusted to measure the granularity of measurements, and higher scores indicate greater overlap with the reference text. In fact, ROUGE is very widely used in NLP which allows engineers to benchmark their models against others on most opensource NLP datasets. Lastly, it can be used in complement with other ngram based metrics like BLEU to provide a more holistic view of test results — since BLEU provides a precisionrelated score, and ROUGE provides a recallrelated score, it makes it easier to pinpoint potential failure cases.