I have a list of strings and I want to keep only the most unique strings. Here is how I have implemented this (maybe there\'s an issue with the loop),
def filter
The Problem with your logic is that each time when you delete an item from the array, the index gets re-arranged and skips a string in between. Eg:
Assume that this is the array: Description : ["A","A","A","B","C"]
iterartion 1:
i=0 -------------0
description[i]="A"
j=i+1 -------------1
description[j]="A"
similarity_ratio>0.6
del description[j]
Now the array is re-indexed like: Description:["A","A","B","C"]. The next step is:
j=j+1 ------------1+1= 2
Description[2]="B"
You have skipped Description[1]="A"
To fix this : Replace
j+=1
With
j=i+1
if deleted. Else do the normal j=j+1 iteration
If you invert your logic, you can escape having to modify the list in place and still reduce the number of comparisons needed. That is, start with an empty output/unique list and iterate over your descriptions seeing if you can add each one. So for the first description you can add it immediately as it cannot be similar to anything in an empty list. The second description only needs to be compared to the first as opposed to all other descriptions. Later iterations can short circuit as soon as they find a previous description with which they are similar to (and have the candidate description be discarded). ie.
import operator
def unique(items, compare=operator.eq):
# compare is a function that returns True if its two arguments are deemed similar to
# each other and False otherwise.
unique_items = []
for item in items:
if not any(compare(item, uniq) for uniq in unique_items):
# any will stop as soon as compare(item, uniq) returns True
# you could also use `if all(not compare(item, uniq) ...` if you prefer
unique_items.append(item)
return unique_items
Examples:
assert unique([2,3,4,5,1,2,3,3,2,1]) == [2, 3, 4, 5, 1]
# note that order is preserved
assert unique([1, 2, 0, 3, 4, 5], compare=(lambda x, y: abs(x - y) <= 1))) == [1, 3, 5]
# using a custom comparison function we can exclude items that are too similar to previous
# items. Here 2 and 0 are excluded because they are too close to 1 which was accepted
# as unique first. Change the order of 3 and 4, and then 5 would also be excluded.
With your code your comparison function would look like:
MAX_SIMILAR_ALLOWED = 0.6 #40% unique and 60% similar
def description_cmp(candidate_desc, unique_desc):
# use unique_desc as first arg as this keeps the argument order the same as with your filter
# function where the first description is the one that is retained if the two descriptions
# are deemed to be too similar
similarity_ratio = SequenceMatcher(None, unique_desc, candidate_desc).ratio()
return similarity_ratio > MAX_SIMILAR_ALLOWED
def filter_descriptions(descriptions):
# This would be the new definition of your filter_descriptions function
return unique(descriptions, compare=descriptions_cmp)
The number of comparisons should be exactly the same. That is, in your implementation the first element is compared to all the others, and the second element is only compared to elements that were deemed not similar to the first element and so on. In this implementation the first item is not compared to anything initially, but all other items must be compared to it to be allowed to be added to the unique list. Only items deemed not similar to the first item will be compared to the second unique item, and so on.
The unique
implementation will do less copying as it only has to copy the unique list when the backing array runs out of space. Whereas, with the del
statement parts of the list must be copied each time it is used (to move all subsequent items into their new correct position). This will likely have a negligible impact on performance though, as the bottleneck is probably the ratio calculation in the sequence matcher.
The value of j
should not change when an item from the list is deleted (since a different list item will be present on that spot in the next iteration). Doing j=i+1
restarts the iteration every time an item is deleted (which is not what is desired). The updated code now only increments j
in the else condition.
def filter_descriptions(descriptions):
MAX_SIMILAR_ALLOWED = 0.6 #40% unique and 60% similar
i = 0
while i < len(descriptions):
print("Processing {}/{}...".format(i + 1, len(descriptions)))
desc_to_evaluate = descriptions[i]
j = i + 1
while j < len(descriptions):
similarity_ratio = SequenceMatcher(None, desc_to_evaluate, descriptions[j]).ratio()
if similarity_ratio > MAX_SIMILAR_ALLOWED:
del descriptions[j]
else:
j += 1
i += 1
return descriptions