News clustering

前端 未结 3 757
深忆病人
深忆病人 2021-01-30 18:34

How does Google News and Techmeme cluster news items that are similar? Are there any well know algorithm that is used to achieve this?

Appreciate your help.

Than

3条回答
  •  梦如初夏
    2021-01-30 19:19

    The algorithmic basis is agglomerative clustering or something similar. But there are a number of heuristics on top of that. For example, the vector space is surely comprised of words and phrases (word n-grams). Limiting the search in a strict time period is also very important. And identifying names, and weighing more the title and the paragraph headings are also key parts.

    On a tangentially related note. If you are interested in finding near-duplicate articles then there are a number of easier to implement approaches, such as the one described here

提交回复
热议问题