There's more...

Evaluating the quality of generated text works similarly to evaluating labels. The Bilingual Evaluation Understudy (BLEU) score is a popular metric for comparing a generated translation of a piece of text to a reference translation and varies from 0 to 1. The closer the generated text is to the original text, the higher the score, with 1 being the score of a perfect match. Through the BLEU score metric, the n-grams of the candidate text are compared with the n-grams of the reference translation, along with the number of matches; these matches are also position-independent. Also, the n-grams matching is modified in such a way that it does not reward any such translation that generates only a few reasonable words. This technique is referred to as modified n-gram precision.

For example, let's say we have the following reference text and generated text:

  • Reference text: I am feeling very enthusiastic.
  • Generated text 1: I am feeling very very enthusiastic enthusiastic.
  • Generated text 2: I feel enthusiastic.

In this example, we can see that candidate text 2 has a better prediction of the reference text, although the precision of candidate text 1 might be more. Modified n-gram precision solves this issue for us. The BLEU score is fast and easy to calculate.
However, there are a few challenges with this metric. For example, it considers neither the meaning of the sentences nor their structures and so it doesn't map well to human judgment.

There are a few other metrics that we can use to evaluate the quality of generated translation compared to one or more reference translations, such as Recall Oriented Understudy for Gisting Evaluation (ROGUE) and Metric for Evaluation for Translation with Explicit Ordering (METEOR).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset