How do I compare phrases for similarity?

前端 未结 4 1050
无人共我
无人共我 2021-02-03 10:36

When entering a question, stackoverflow presents you with a list of questions that it thinks likely to cover the same topic. I have seen similar features on other sites or in ot

4条回答
  •  迷失自我
    2021-02-03 10:49

    One approach is the so called bag-of-words model.

    As you guessed, first you count how many times words appear in the text (usually called document in the NLP-lingo). Then you throw out the so called stop words, such as "the", "a", "or" and so on.

    You're left with words and word counts. Do this for a while and you get a comprehensive set of words that appear in your documents. You can then create an index for these words: "aardvark" is 1, "apple" is 2, ..., "z-index" is 70092.

    Now you can take your word bags and turn them into vectors. For example, if your document contains two references for aardvarks and nothing else, it would look like this:

    [2 0 0 ... 70k zeroes ... 0].
    

    After this you can count the "angle" between the two vectors with a dot product. The smaller the angle, the closer the documents are.

    This is a simple version and there other more advanced techniques. May the Wikipedia be with you.

提交回复
热议问题