Computing similarity between two lists

后端 未结 7 2118
失恋的感觉
失恋的感觉 2020-12-08 15:37

EDIT: as everyone is getting confused, I want to simplify my question. I have two ordered lists. Now, I just want to compute how similar one list is to the other.

Eg

相关标签:
7条回答
  • 2020-12-08 15:52

    In addition to what has already been said, I would like to point you to the following excellent paper: W. Webber et al, A Similarity Measure for Indefinite Rankings (2010). Besides containing a good review of existing measures (such as above-mentioned Kendall Tau and Spearman's footrule), the authors propose an intuitively appealing probabilistic measure that is applicable for varying length of result lists and when not all items occur in both lists. Roughly speaking, it is parameterized by a "persistence" probability p that a user scans item k+1 after having inspected item k (rather than abandoning). Rank-Biased Overlap (RBO) is the expected overlap ratio of results at the point the user stops reading.

    The implementation of RBO is slightly more involved; you can take a peek at an implementation in Apache Pig here.

    Another simple measure is cosine similarity, the cosine between two vectors with dimensions corresponding to items, and inverse ranks as weights. However, it doesn't handle items gracefully that only occur in one of the lists (see the implementation in the link above).

    1. For each item i in list 1, let h_1(i) = 1/rank_1(i). For each item i in list 2 not occurring in list 1, let h_1(i) = 0. Do the same for h_2 with respect to list 2.
    2. Compute v12 = sum_i h_1(i) * h_2(i); v11 = sum_i h_1(i) * h_1(i); v22 = sum_i h_2(i) * h_2(i)
    3. Return v12 / sqrt(v11 * v22)

    For your example, this gives a value of 0.7252747.

    Please let me give you some practical advice beyond your immediate question. Unless your 'production system' baseline is perfect (or we are dealing with a gold set), it is almost always better to compare a quality measure (such as above-mentioned nDCG) rather than similarity; a new ranking will be sometimes better, sometimes worse than the baseline, and you want to know if the former case happens more often than the latter. Secondly, similarity measures are not trivial to interpret on an absolute scale. For example, if you get a similarity score of say 0.72, does this mean it is really similar or significantly different? Similarity measures are more helpful in saying that e.g. a new ranking method 1 is closer to production than another new ranking method 2.

    0 讨论(0)
  • 2020-12-08 15:56

    Is the list of documents exhaustive? That is, is every document rank ordered by system 1 also rank ordered by system 2? If so a Spearman's rho may serve your purposes. When they don't share the same documents, the big question is how to interpret that result. I don't think there is a measurement that answers that question, although there may be some that implement an implicit answer to it.

    0 讨论(0)
  • 2020-12-08 16:00

    I suppose you are talking about comparing two Information Retrieval System which trust me is not something trivial. It is a complex Computer Science problem.

    For measuring relevance or doing kind of A/B testing you need to have couple of things:

    1. A competitor to measure relevance. As you have two systems than this prerequisite is met.

    2. You need to manually rate the results. You can ask your colleagues to rate query/url pairs for popular queries and then for the holes(i.e. query/url pair not rated you can have some dynamic ranking function by using "Learning to Rank" Algorithm http://en.wikipedia.org/wiki/Learning_to_rank. Dont be surprised by that but thats true (please read below of an example of Google/Bing).

    Google and Bing are competitors in the horizontal search market. These search engines employ manual judges around the world and invest millions on them, to rate their results for queries. So for each query/url pairs generally top 3 or top 5 results are rated. Based on these ratings they may use a metric like NDCG (Normalized Discounted Cumulative Gain) , which is one of finest metric and the one of most popular one.

    According to wikipedia:

    Discounted cumulative gain (DCG) is a measure of effectiveness of a Web search engine algorithm or related applications, often used in information retrieval. Using a graded relevance scale of documents in a search engine result set, DCG measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom with the gain of each result discounted at lower ranks.

    Wikipedia explains NDCG in a great manner. It is a short article, please go through that.

    0 讨论(0)
  • 2020-12-08 16:01

    The DCG [Discounted Cumulative Gain] and nDCG [normalized DCG] are usually a good measure for ranked lists.

    It gives the full gain for relevant document if it is ranked first, and the gain decreases as rank decreases.

    Using DCG/nDCG to evaluate the system compared to the SOA base line:

    Note: If you set all results returned by "state of the art system" as relevant, then your system is identical to the state of the art if they recieved the same rank using DCG/nDCG.

    Thus, a possible evaluation could be: DCG(your_system)/DCG(state_of_the_art_system)

    To further enhance it, you can give a relevance grade [relevance will not be binary] - and will be determined according to how each document was ranked in the state of the art. For example rel_i = 1/log(1+i) for each document in the state of the art system.

    If the value recieved by this evaluation function is close to 1: your system is very similar to the base line.

    Example:

    mySystem = [1,2,5,4,6,7]
    stateOfTheArt = [1,2,4,5,6,9]
    

    First you give score to each document, according to the state of the art system [using the formula from above]:

    doc1 = 1.0
    doc2 = 0.6309297535714574
    doc3 = 0.0
    doc4 = 0.5
    doc5 = 0.43067655807339306
    doc6 = 0.38685280723454163
    doc7 = 0
    doc8 = 0
    doc9 = 0.3562071871080222
    

    Now you calculate DCG(stateOfTheArt), and use the relevance as stated above [note relevance is not binary here, and get DCG(stateOfTheArt)= 2.1100933062283396
    Next, calculate it for your system using the same relecance weights and get: DCG(mySystem) = 1.9784040064803783

    Thus, the evaluation is DCG(mySystem)/DCG(stateOfTheArt) = 1.9784040064803783 / 2.1100933062283396 = 0.9375907693942939

    0 讨论(0)
  • 2020-12-08 16:01

    I actually know four different measures for that purpose.

    Three have already been mentioned:

    • NDCG
    • Kendall's Tau
    • Spearman's Rho

    But if you have more than two ranks that have to be compared, use Kendall's W.

    0 讨论(0)
  • 2020-12-08 16:03

    Kendalls tau is the metric you want. It measures the number of pairwise inversions in the list. Spearman's foot rule does the same, but measures distance rather than inversion. They are both designed for the task at hand, measuring the difference in two rank-ordered lists.

    0 讨论(0)
提交回复
热议问题