How to merge similar items in a list

后端 未结 6 592
花落未央
花落未央 2021-01-19 02:50

I haven\'t found anything relevant on Google, so I\'m hoping to find some help here :)

I\'ve got a Python list as follows:

[[\'hoose\', 200], [\"Ba         


        
相关标签:
6条回答
  • 2021-01-19 02:54

    In common with the other comments, I'm not sure that doing this makes much sense, but here's a solution that does what you want, I think. It's very inefficient - O(n2) where n is the number of words in your list - but I'm not sure there's a better way of doing it:

    data = [['hoose', 200],
            ["Bananphone", 10],
            ['House', 200],
            ["Bonerphone", 10],
            ['UniqueValue', 777]]
    
    already_merged = []
    
    for word, score in data:
        added_to_existing = False
        for merged in already_merged:
            for potentially_similar in merged[0]:
                if levenshtein(word, potentially_similar) < 5:
                    merged[0].add(word)
                    merged[1] += score
                    added_to_existing = True
                    break
            if added_to_existing:
                break
        if not added_to_existing:
            already_merged.append([set([word]),score])
    
    print already_merged
    

    The output is:

    [[set(['House', 'hoose']), 400], [set(['Bonerphone', 'Bananphone']), 20], [set(['UniqueValue']), 777]]
    

    One of the obvious problems with this approach is that the word that you're considering might be close enough to many of the different sets of words that you've already considered, but this code will just lump it into the first one it finds. I've voted +1 for Space_C0wb0y's answer ;)

    0 讨论(0)
  • 2021-01-19 02:54

    @Mark Longair I was getting some error in python 3.5, so I corrected them as below:

    import Levenshtein
    data = [['hoose', 200],
           ["Bananphone", 10],
           ['House', 200],
           ["Bonerphone", 10],
           ['UniqueValue', 777]]
    
    already_merged = []
    
    for word, score in data:
        added_to_existing = False
        for merged in already_merged:
            for potentially_similar in merged[0]:
                if Levenshtein.distance(word, potentially_similar) < 5:
                    merged[0].add(word)
                    merged[1] += score
                    added_to_existing = True
                    break
            if added_to_existing:
                break
        if not added_to_existing:
            already_merged.append([set([word]),score])
    
    print (already_merged)
    

    @Mark thanks for such easy solution.

    0 讨论(0)
  • 2021-01-19 02:57

    To bring home the point from my comment, I just grabbed an implementation of that distance from here, and calculated some distances:

    d('House', 'hoose') = 2
    d('House', 'trousers') = 4
    d('trousers', 'hoose') = 5
    

    Now, suppose your threshold is 4. You would have to merge House and hoose, as well as House and trousers, but not trousers and hoose. Are you sure something like this can never happen with your data?

    In the end, I think is more of a clustering problem, so you probably have to look into clustering algorithms. SciPy offers an implementation of hierarchical clustering that works with custom distance functions (be aware that this can be very slow for larger data sets - it also consumes a lot of memory).

    The main problem is to decide on a measure for cluster quality, because there is not one correct solution for your problem. This paper(pdf) gives you a starting point, to understand that problem.

    0 讨论(0)
  • 2021-01-19 03:00

    Blueprint:

    result = dict()
    for item in [[['hoose', 5], 200], [['House', 5], 200], [["Bananaphone", 5], 10], ...]:
    
       key = item[0] # ('hoose', 5)
       value = item[1] # 200
    
       if key in result:
           result[key] = 0
       result[key] += value
    

    It might be necessary to adjust the code for unpacking the inner list items.

    0 讨论(0)
  • 2021-01-19 03:10

    You didn't say the number of items in your list, but I'm guessing n^2 complexity is OK.

    You also didn't say if you wanted all possible pairs to be compared or just the neighboring ones. I assume all pairs.

    So here's the idea:

    1. Take the first item, and calculate the lev score against all other items.
    2. Merge all items which score is less than 5, by removing them from the list and summing their scores.
    3. In the merged list, take the next item, compare that one to all items except the one you just checked.
    4. Repeat until there are no items in the list
    0 讨论(0)
  • 2021-01-19 03:12
    import Levenshtein
    import operator
    import cluster
    
    class Item(object):
        @classmethod
        def fromList(cls,lst):
            return cls(lst[0][0], lst[0][1], lst[1])
    
        def __init__(self, name, val=0, score=0):
            super(Item,self).__init__()
            self.name     = name
            self.val      = val
            self.score    = score
    
        def dist(self, other):
            return 100 if other is self else Levenshtein.distance(self.name, other.name)
    
        def __str__(self):
            return "('{0}', {1})".format(self.name, self.val)
    
    def main():
        myList = [
            [['hoose', 5], 200],
            [['House', 5], 200],
            [["Bananaphone", 5], 10],
            [['trousers', 5], 100]
        ]
        items = [Item.fromList(i) for i in myList]
    
        cl = cluster.HierarchicalClustering(items, (lambda x,y: x.dist(y)))
        for group in cl.getlevel(5):
            groupScore = sum(item.score for item in group)
            groupStr   = ', '.join(str(item) for item in group)
            print "{0}: {1}".format(groupScore, groupStr)
    
    if __name__=="__main__":
        main()
    

    returns

    10: ('Bananaphone', 5)
    500: ('trousers', 5), ('hoose', 5), ('House', 5)
    
    0 讨论(0)
提交回复
热议问题