Algorithm to find related words in a text

前端 未结 5 513
后悔当初
后悔当初 2021-02-03 13:35

I would like to have a word (e.g. \"Apple) and process a text (or maybe more). I\'d like to come up with related terms. For example: process a document for Apple and find that i

5条回答
  •  野的像风
    2021-02-03 14:13

    As a starting point: your question relates to text mining.

    There are two ways: a statistical approach, and one form natural language processing (nlp).

    I do not know much about nlp, but can say something about the statistical approach:

    1. You need some vector space representation of your documents, see http://en.wikipedia.org/wiki/Vector_space_model http://en.wikipedia.org/wiki/Document-term_matrix http://en.wikipedia.org/wiki/Tf%E2%80%93idf

    2. In order to learn semantics, that is: different words mean the same, or one word can have different meanings, you need a large text corpus for learning. As I said this is a statistical approach, so you need lots of samples. http://www.daviddlewis.com/resources/testcollections/

      Maybe you have lots of documents from the context you are going to use. That is the best situation.

    3. You have to retrieve latent factors from this corpus. Most common are:

      • LSA (http://en.wikipedia.org/wiki/Latent_semantic_analysis)
      • PLSA (http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis)
      • nonnegative matrix factorization (http://en.wikipedia.org/wiki/Non-negative_matrix_factorization)
      • latent dirichlet allocation (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

      These methods involve lots of math. Either you dig it, or you have to find good libraries.

    I can recommend the following books:

    • http://www.oreilly.de/catalog/9780596529321/toc.html
    • http://www.oreilly.de/catalog/9780596516499/index.html

提交回复
热议问题