Building or Finding a “relevant terms” suggestion feature

后端 未结 3 568
谎友^
谎友^ 2021-02-02 03:12

Given a few words of input, I want to have a utility that will return a diverse set of relevant terms, phrases, or concepts. A caveat is that it would need to have a large grap

相关标签:
3条回答
  • 2021-02-02 03:42

    You might be interested in WordNet. It takes a bit of linguistic knowledge to understand the API, but basically the system is a database of meaning-based links between English words, which is more or less what you're searching for. I'm sure I can dig up more information if you want it.

    0 讨论(0)
  • 2021-02-02 03:47

    Take a look at the following two papers:

  • Clustering User Queries of a Search Engine [pdf]
  • Topic Detection by Clustering Keywords [pdf]
  • Here is my attempt at a very simplified explanation:

    If we have a database of past user queries, we can define a similarity function between two queries. For example: number of words in common. Now for each query in our database, we compute its similarity with each other query, and remember the k most similar queries. The non-overlapping words from these can be returned as "related terms".

    We can also take this approach with a database of documents containing information users might be searching for. We can define the similarity between two search terms as the number of documents containing both divided by the number of documents containing either. To decide which terms to test, we can scan the documents and throw out words that are either too common ('and', 'the', etc.) or that are too obscure.

    If our data permits, then we could see which queries led users to choosing which results, instead of comparing documents by content. For example if we had data that showed us that users searching for "Celtics" and "Lakers" both ended up clicking on espn.com, then we could call these related terms.

    If you're starting from scratch with no data about past user queries, then you can try Wikipedia, or the Bag of Words dataset as a database of documents. If you are looking for a database of user search terms and results, and if you are feeling adventurous, then you can take a look at the AOL Search Data.

0 讨论(0)
  • 2021-02-02 03:51

    Peter Norvig (director of research at Google) spoke about how they do this at Google (specifically mentioning Google Sets) in a Facebook Tech Talk. The idea is that a relatively simple algorithm on a huge dataset (e.g. the entire web) is much better than a complicated algorithm on a small data set.

    You could look at Google's n-gram collection as a starting point. You'd start to see what concepts are grouped together. Norvig hinted that internally Google has up to 7-grams for use in things like Google Translate.

    If you're more ambitious, you could download all of Wikipedia's articles in the language you desire and create your own n-gram database.

    The problem is even more complicated if you just have a single word; check out this recent thesis for more details on word sense disambiguation.

    It's not an easy problem, but it is useful as you mentioned. In the end, I think you'll find that a really successful implementation will have a relatively simple algorithm and a whole lot of data.

    0 讨论(0)
  • 提交回复
    热议问题