Building or Finding a “relevant terms” suggestion feature

后端 未结 3 567
谎友^
谎友^ 2021-02-02 03:12

Given a few words of input, I want to have a utility that will return a diverse set of relevant terms, phrases, or concepts. A caveat is that it would need to have a large grap

3条回答
  •  -上瘾入骨i
    2021-02-02 03:51

    Peter Norvig (director of research at Google) spoke about how they do this at Google (specifically mentioning Google Sets) in a Facebook Tech Talk. The idea is that a relatively simple algorithm on a huge dataset (e.g. the entire web) is much better than a complicated algorithm on a small data set.

    You could look at Google's n-gram collection as a starting point. You'd start to see what concepts are grouped together. Norvig hinted that internally Google has up to 7-grams for use in things like Google Translate.

    If you're more ambitious, you could download all of Wikipedia's articles in the language you desire and create your own n-gram database.

    The problem is even more complicated if you just have a single word; check out this recent thesis for more details on word sense disambiguation.

    It's not an easy problem, but it is useful as you mentioned. In the end, I think you'll find that a really successful implementation will have a relatively simple algorithm and a whole lot of data.

提交回复
热议问题