WER, CER, and MER#
Word Error Rate (WER), Character Error Rate (CER), and Match Error Rate (MER) are essential metrics used in the evaluation of speech recognition and natural language processing systems. From a high level, they each quantify the similarity between reference and candidate texts, with zero being a perfect score. While word and character error rate can be infinitely high, match error rate is always between 0 and 1. Each of these metrics have their nuances that reveal different errors within texts.
Example
To see and example of WER, CER and MER, checkout the LibriSpeech on app.kolena.com/try.
Substitutions, Deletions, and Insertions#
The building blocks of each metric include substitution, deletion, and insertion errors. These errors reveal different failures in candidate texts, and are aggregated to calculate the word, character, and match error rate.
Substitution Errors
Substitutions occur when a candidate text contains a word or sequence of words that is different from the corresponding word or sequence of words in the reference text. Substitutions can be counted on a word, character, or sentence level depending on the application.
Example:
Reference: Amidst the emerald meadow, butterflies whispered secrets in the breeze.
Candidate: Amidst the emerald shadow, butterflies whistled secrets on the breeze.
Word-level Substitutions:
Amidst the emerald
shadow, butterflies
whistled secrets
on the breeze.
Character-level Substitutions:
Amidst the emerald
shadow, butterflies
whistled secrets
on the breeze.
In the above example, there are 3 word-level substitutions and 9 character-level substitutions.
Deletion Errors
Deletions occur when a candidate text is missing a word or sequence of words from the reference text.
Example:
Reference: Amidst the emerald meadow, butterflies whispered secrets in the breeze.
Candidate: Amidst the emerald meadow, butterflies whispered.
Word-level Deletions:
Amidst the emerald meadow, butterflies whispered
secrets in the breeze.
Character-level Deletions:
Amidst the emerald meadow, butterflies whispered
secrets in the breeze.
In the above example, there are 4 word-level deletions and 18 character-level deletions.
Insertion Errors
Insertions occur when a candidate text contains an extra word or sequence of words that is not present in the reference text.
Example:
Reference: Amidst the emerald meadow, butterflies whispered secrets in the breeze.
Candidate: Amidst the emerald meadow, butterflies whispered ethereal secrets in the breeze.
Word-level Insertions:
Amidst the emerald meadow, butterflies whispered
ethereal secrets in the breeze.
Character-level Insertions:
Amidst the emerald meadow, butterflies whispered
ethereal secrets in the breeze.
In the above example, there is 1 word-level insertion and 8 character-level insertions.
Word Error Rate#
Word Error Rate is a fundamental metric that measures the accuracy of a candidate text by considering three types of errors — substitutions, deletions, and insertions. Word-level errors surface mispredicted words, and it can be useful to visualize common word-level failures to flesh out weaknesses in a model.
Formally, it is defined as the rate of word-level errors in a candidate text.
Example#
Let's calculate the word error rate between the following reference and candidate texts:
Reference | Candidate |
---|---|
The bard sang ancient melodies of nature, transforming tranquil meadows into sonnets for enhanced soulful grace. |
The poetic bard echoed ancient melodies, transcending meadows into sonnets for enhanced soulful grace. |
Step 1. Count Errors
Highlighting the substitution, deletion, and insertion errors, we can count each type of error:
The
poetic bard
echoed ancient melodies
of nature,
transcending
tranquil meadows into sonnets for enhanced soulful grace.
In our candidate text, we have 2 substitutions, 3 deletions, and 1 insertion.
Step 2. Calculate WER
With each errors counted, we can calculate our WER. Using the formula,
we arrive at a WER of 0.375 for our candidate text.
It is important to note that WER's range is not bounded above by 1. If we had a reference of "hello
" and
candidate of "bye bye
", assuming we calculate the error using only substitutions, our WER would be 2.0 since
we have 2 errors in the candidate and 1 word in the reference. Generally speaking, we want our WER to be as close
to 0 as possible.
Character Error Rate#
Character Error Rate is another metric that measures the accuracy of a candidate text through substitutions, deletions, and insertions. Unlike word-level errors, character-level errors are useful in surfacing mispronunciations and erroneous phonemes. CER is defined as the rate of character-level errors in a candidate text.
Example#
Let's calculate the character error rate using the same reference and candidate texts as the previous example:
Reference | Candidate |
---|---|
The bard sang ancient melodies of nature, transforming tranquil meadows into sonnets for enhanced soulful grace. |
The poetic bard echoed ancient melodies, transcending meadows into enhanced sonnets for soulful grace. |
Step 1. Count Errors
Highlighting the substitution, deletion, and insertion errors, we can count each type of error:
The
poetic bard
echoed ancient melodies
of
nature, trans
cending
tranquil meadows into sonnets for enhanced soulful grace.
In our candidate text, we have 13 substitutions, 16 deletions, and 6 insertions.
Step 2. Calculate CER
With each errors counted, we can calculate our CER. Using the formula,
we arrive at a CER of 0.318 for our candidate text. Note that the CER is lower than the WER calculated
in the last step. Although the errors are similar between the two calculations, the character-level
substitution only replaces trans
, whereas the word-level substitution replaces the entire transforming
— showing that our model could be weak at recognizing the specific phonemes coming after trans-. However,
this would be hard to confirm without more data.
It is valuable to use CER alongside WER in speech recognition and NLP tasks, as each metric can surface different types of errors. A model with a high WER but low CER can indicate that the model is mainly mispredicting specific phonemes rather than entire words, whereas a balanced WER and CER can indicate poor ability to make predictions at the word level.
Match Error Rate#
While WER and CER focus on errors, Match Error Rate takes a slightly different approach by placing more emphasis on correct matches. Similar to WER, it is calculated using word-level substitutions, deletions, and insertions.
Example#
Let's calculate the match error rate using the same reference and candidate texts as the previous examples:
Reference | Candidate |
---|---|
The bard sang ancient melodies of nature, transforming tranquil meadows into sonnets for enhanced soulful grace. |
The poetic bard echoed ancient melodies, transcending meadows into enhanced sonnets for soulful grace. |
Step 1. Count Errors
Highlighting the substitution, deletion, and insertion errors, we can count each type of error:
The poetic bard
echoed ancient melodies
of
nature,
transcending
tranquil meadows into sonnets for enhanced soulful grace.
In our candidate text, we have 2 substitutions, 3 deletions, and 1 insertion.
Step 2. Calculate MER
With each errors counted, we can calculate our MER. Using the formula,
we arrive at a MER of 0.353 for our candidate text. This is roughly in line with what we had for CER and WER.
In general, all three metrics are similar, yet reveal slightly different hidden errors within the candidate text.
Limitations and Biases#
As seen in its formula, WER, CER, and MER only accept perfect matches between words, while placing no consideration on alternate spellings. For example, a candidate text could be penalized if it had the word "gray" while the reference text had the word "grey". Though the two spellings are perfectly acceptable and do not change a sentence's meaning, these metrics fail to consider this. Although these types of false errors can be mitigated through a rule-based error calculation, it adds extra complexity and provides no guarantee of mitigating all false errors.