With the results of two different summary systems (sys1 and sys2) and the same reference summaries, I evaluated them with both BLEU and ROUGE. The problem is: All ROUGE scores o
Both ROUGE and BLEU are based on n-gram to measure the similar between the summaries of systems and the summaries of human. So why there are differences in results of evaluation like that? And what's the main different of ROUGE vs BLEU to explain this issue?
There exist both the ROUGE-n precision and the ROUGE-n precision recall. the original ROUGE implementation from the paper that introduced ROUGE {3} computes both, as well as the resulting F1-score.
From http://text-analytics101.rxnlp.com/2017/01/how-rouge-works-for-evaluation-of.html (mirror):
ROUGE recall:
ROUGE precision:
(The original ROUGE implementation from the paper that introduced ROUGE {1} may perform a few more things such as stemming.)
The ROUGE-n precision and recall are easy to interpret, unlike BLEU (see Interpreting ROUGE scores).
The difference between the ROUGE-n precision and BLEU is that BLEU introduces a brevity penalty term, and also compute the n-gram match for several size of n-grams (unlike the ROUGE-n, where there is only one chosen n-gram size). Stack Overflow does not support LaTeX so I won't go into more formulas to compare against BLEU. {2} explains BLEU clearly.
References: