Searching a list based on values in another list

匿名 (未验证) 提交于 2019-12-03 00:45:01

问题:

I have a list of names which I'm trying to pull out of a list of strings. I keep getting false positives such as partial matches. The other caveat is that I'd like it to also grab a last name where applicable.

names = ['Chris', 'Jack', 'Kim'] target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS']  desired_output = ['Chris Smith', 'Kimberly', 'CHRIS'] 

I've tried this code:

[i for e in names for i in target if i.startswith(e)] 

This predictably returns Chris Smith, Christmas is here, and Kimberly.

How would I best approach this? Using regex or can it be done with list comprehensions? Performance may be an issue as the real names list is ~880,000 names long.

(python 2.7)

EDIT: I've realized that my criteria in this example are unrealistic given that the impossible request of wanting to include Kimberly while excluding Christmas is here. To mitigate this issue, I've found a more complete names list which would include variations (both Kim and Kimberly are included).

回答1:

Complete guess (again) since I can't see how you can not have Christmas is here given any reasonable criteria:

This'll match any targets that have any word that starts with a word from names...

names = ['Chris', 'Jack', 'Kim'] target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS']  import re matches = [targ for targ in target if any(re.search(r'\b{}'.format(name), targ, re.I) for name in names)] print matches # ['Chris Smith', 'Kimberly', 'Christmas is here', 'CHRIS'] 

If you change it to \b{}\b' - then you'll get ['Chris Smith', 'CHRIS'] so you lose Kim...



回答2:

Does this work?

names = ['Chris', 'Jack', 'Kim'] target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS']  res = [] for tof in target:     for name in names:         if tof.lower().startswith(name.lower()):             res.append(tof)             break print res 


回答3:

According to your description, I got the rule is that:

  1. ignore the case;
  2. the target word must be initial with the key word.
  3. if the the target word is not exactly the key word, then the target word must be the only word in the sentence.

Try this:

names = ['Chris', 'Jack', 'Kim'] target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS'] desired_output = ['Chris Smith', 'Kimberly', 'CHRIS']  actual_output = [] for key in names:     for words in target:         for word in words.split():             if key.lower() == word.lower():                 actual_output.append(words)             elif key.lower() == word.lower()[:len(key)] and len(words.split()) == 1:                 actual_output.append(words) print(actual_output) 

It will output EXACTLY as your desired output (btw, are you sure you really want this?). Don't be frustrated by the 3-layer loop. If you have N names and M sentences, and the number of words in each sentence is limited, then the complexity of this code is O(mn) which can't be better.



回答4:

There is no deterministic way to drop the match 'Christmas is here', as it may not be possible for the system to determine if Christmas is a name or something else. Instead if you want to speed up the process, you can try this O(n) approach. I have not timed it, but definitely faster than your or the proposed solutions.

from difflib import SequenceMatcher names = ['Chris', 'Jack', 'Kim'] target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS'] def foo(names, target):     #Create a generator to search the names     def bar(names, target):             #which for each target         for t in target:                     #finds the matching blocks, a triplet, (i, j, n), and means that a[i:i+n] == b[j:j+n]             match = SequenceMatcher(None,names, t).get_matching_blocks()[0]                     #match.size == 0 means no match                     #and match.b > 0 means match does not happens at the start             if match.size > 0 and match.b == 0:                             #and generate the matching target                 yield t     #Join the names to create a single string     names = ','.join(names)     #and call the generator and return a list of the resultant generator     return list(bar(names, target))  >>> foo(names, target) ['Chris Smith', 'Kimberly', 'Christmas is here', 'CHRIS'] 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!