问题
For illustration purposes, let's assume this is a forum service. I need to calculate the "similarity" among each users' posts, so that the result would be something like:
among posts by user A, similarity 60%
among posts by user B, similarity 20%
...
I'm dealing with multibyte strings, so I guess I'm stuck with search engines here. We already use Solr, already have moreLikeThis implemented, but I'm not quite sure how to construct the query. Any help appreciated!
回答1:
Possibly Carrot2 will interest you (and this blog related to it)
回答2:
strange question in two ways: 1. Why do you have to deal with SOLR? 2. The kind of similarity depends on the target problem. Your question sounds too generic to me. There is research going on in the area of semantic similarity. There is edit-distance algorithm, which is probably not what you want.
So, define you question more precisely and you get better answers.
回答3:
There are several measures of similarity, a simple and effective one is Cosine similarity. There are more sophisticated ones such as Smith-Waterman etc,
Look at http://sourceforge.net/projects/simmetrics/
来源:https://stackoverflow.com/questions/6069922/measuring-similarity-between-document-sets