efficient algorithm to compare similarity between sets of numbers?

前端 未结 12 674
甜味超标
甜味超标 2021-02-01 11:13

I have a large number of sets of numbers. Each set contains 10 numbers and I need to remove all sets that have 5 or more number (unordered) matches with any other set.

F

12条回答
  •  走了就别回头了
    2021-02-01 11:53

    You should rethink your requirements because as it is, the operation does not even have a well-defined result. For example, take these sets:

    set 1: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} 
    set 2: {6, 7, 8, 9, 10, 11, 12, 13, 14, 15} 
    set 3: {11, 12, 13, 14, 15, 16, 17, 18, 19, 20}
    

    If you first consider 1 and 2 to be "duplicates" and eliminate set 1, then 2 and 3 are also "duplicates" and you are left with only one remaining set. But if you instead eliminate set 2 first, then 1 and 3 have no matches and you are left with two sets remaining.

    You can easily expand this to your full 10,000 sets so that it would be possible that depending on which sets you compare and eliminate first, you could be left with only a single set, or with 5,000 sets. I don't think that is what you want.

    Mathematically speaking, your problem is that you are trying to find equivalence classes, but the relation "similarity" you use to define them is not an equivalence relation. Specifically, it is not transitive. In layman's terms, if set A is "similar" to set B and set B is "similar" to set C, then your definition does not ensure that A is also "similar" to C, and therefore you cannot meaningfully eliminate similar sets.

    You need to first clarify your requirements to deal with this problem before worrying about an efficient implementation. Either find a way to define a transitive similarity, or keep all sets and work only with comparisons (or with a list of similar sets for each single set).

提交回复
热议问题