What are some good methods to find the “relatedness” of two bodies of text?

后端未结

关注

 7  833

Here\'s the problem -- I have a few thousand small text snippets, anywhere from a few words to a few sentences - the largest snippet is about 2k on disk. I want to be able to c

相关标签:

7条回答

清酒与你

2021-02-02 04:04

See Manning and Raghavan course notes about MinHashing and searching for similar items, and a C#(?) version. I believe the techniques come from Ullman and Motwani's research.

0 讨论(0)
发布评论:

提交评论
- 加载中...
你的背包

2021-02-02 04:12

I've never used it, but you might want to look into Levenshtein distance

0 讨论(0)
发布评论:

提交评论
- 加载中...
囚心锁ツ

2021-02-02 04:12

Phonetic algorithms

The article, Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, shows how to install and use the SimMetrics library into SQL Server. This library lets you find relative similarity between strings and includes numerous algorithms.

I ended up mostly using Jaro Winkler to match on names. Here's more information where I asked about matching names on SO: Matching records based on Person Name

A few algorithms based on Levenshtein Distance are also available in the SimMetric library and would probably be useful in your application.

0 讨论(0)
发布评论:

提交评论
- 加载中...
情深已故

2021-02-02 04:18

This book may be relevant.

Edit: here is a related SO question

0 讨论(0)
发布评论:

提交评论
- 加载中...
情深已故

2021-02-02 04:28

Jeff talked about something like this on the pod cast to find the Related questions listed on the right side here. (in podcast 32)

One big tip was to remove all common words, like "the" "and" "this" etc. This will leave you with more meaningful words to compare.

And here is a similar question Is there an algorithm that tells the semantic similarity of two phrases

0 讨论(0)
发布评论:

提交评论
- 加载中...
情书的邮戳

2021-02-02 04:29

These articles on semantic relatedness and semantic similarity may be helpful. And this SO question about Latent Semantic Analysis.

You could also look into Soundex for words that "sound alike" phonetically.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页