Algorithm to detect similar documents in python script [closed]

前端未结

关注

 10  1520

时光说笑

相关标签:

10条回答

鱼传尺愫

2020-12-24 04:30

If you're prepared to index the files that you want to search amongst, Xapian is an excellent engine, and provides Python bindings:

http://xapian.org/

http://xapian.org/docs/bindings/python/

0 讨论(0)
发布评论:

提交评论
- 加载中...

南笙

2020-12-24 04:34

Bayesian filters have exactly this purpose. That's the techno you'll find in most tools that identify spam.

Example, to detect a language (from http://sebsauvage.net/python/snyppets/#bayesian) :

from reverend.thomas import Bayes
guesser = Bayes()
guesser.train('french','La souris est rentrée dans son trou.')
guesser.train('english','my tailor is rich.')
guesser.train('french','Je ne sais pas si je viendrai demain.')
guesser.train('english','I do not plan to update my website soon.')

>>> print guesser.guess('Jumping out of cliffs it not a good idea.')
[('english', 0.99990000000000001), ('french', 9.9999999999988987e-005)]

>>> print guesser.guess('Demain il fera très probablement chaud.')
[('french', 0.99990000000000001), ('english', 9.9999999999988987e-005)]

But it works to detect any type you will train it for : technical text, songs, jokes, etc. As long as you can provide enought material to let the tool learn what does you document looks like.

0 讨论(0)

日久生厌

2020-12-24 04:37

You need to make your question more concrete. If you've already read the fingerprinting papers, you already know the principles at work, so describing common approaches here would not be beneficial. If you haven't, you should also check out papers on "duplicate detection" and various web spam detection related papers that have come out of Stanford, Google, Yahoo, and MS in recent years.

Are you having specific problems with coding the described algorithms?

Trouble getting started?

The first thing I'd probably do is separate the tokenization (the process of extracting "words" or other sensible sequences) from the duplicate detection logic, so that it is easy to plug in different parsers for different languages and keep the duplicate detection piece the same.

0 讨论(0)
发布评论:

提交评论
- 加载中...
清酒与你

2020-12-24 04:37

If you are trying to detect the documents that are talking about the same topic, you could try collecting the most frequently used words, throw away the stop words . Documents that have a similar distribution of the most frequently used words are probably talking about similar things. You may need to do some stemming and extend the concept to n-grams if you want higher accuracy. For more advanced techniques, look into machine learning.

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2

热议问题