Text Summarization Evaluation - BLEU vs ROUGE

前端 未结 3 1755
傲寒
傲寒 2021-01-31 15:54

With the results of two different summary systems (sys1 and sys2) and the same reference summaries, I evaluated them with both BLEU and ROUGE. The problem is: All ROUGE scores o

3条回答
  •  傲寒
    傲寒 (楼主)
    2021-01-31 16:21

    ROGUE and BLEU are both set of metrics applicable for the task of creating the text summary. Originally BLEU was needed for machine translation, but it is perfectly applicable for the text summary task.

    It is best to understand the concepts using examples. First, we need to have summary candidate (machine learning created summary) like this:

    the cat was found under the bed

    And the gold standard summary (usually created by human):

    the cat was under the bed

    Let's find precision and recall for the unigram (each word) case. We use words as metrics.

    Machine learning summary has 7 words (mlsw=7), gold standard summary has 6 words (gssw=6), and the number of overlapping words is again 6 (ow=6).

    The recall for the machine learning would be: ow/gssw=6/6=1 The precision for the machine learning would be: ow/mlsw=6/7=0.86

    Similarly we can compute precision and recall scores on grouped unigrams, bigrams, n-grams...

    For the ROGUE we know it uses both recall and precision, and also the F1 score which is the harmonic mean of these.

    For BLEU, well it also use precision twinned with recall but uses geometric mean and brevity penalty.

    Subtle differences, but it is important to note they both use precision and recall.

提交回复
热议问题