How do I compare phrases for similarity?

前端未结

关注

 4  1050

无人共我 2021-02-03 10:36

When entering a question, stackoverflow presents you with a list of questions that it thinks likely to cover the same topic. I have seen similar features on other sites or in ot

4条回答

迷失自我 (楼主)

2021-02-03 10:49
One approach is the so called bag-of-words model.

As you guessed, first you count how many times words appear in the text (usually called document in the NLP-lingo). Then you throw out the so called stop words, such as "the", "a", "or" and so on.

You're left with words and word counts. Do this for a while and you get a comprehensive set of words that appear in your documents. You can then create an index for these words: "aardvark" is 1, "apple" is 2, ..., "z-index" is 70092.

Now you can take your word bags and turn them into vectors. For example, if your document contains two references for aardvarks and nothing else, it would look like this:
```
[2 0 0 ... 70k zeroes ... 0].
```
After this you can count the "angle" between the two vectors with a dot product. The smaller the angle, the closer the documents are.

This is a simple version and there other more advanced techniques. May the Wikipedia be with you.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...