可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a list of names which I'm trying to pull out of a list of strings. I keep getting false positives such as partial matches. The other caveat is that I'd like it to also grab a last name where applicable.

names = ['Chris', 'Jack', 'Kim'] target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS']  desired_output = ['Chris Smith', 'Kimberly', 'CHRIS']

I've tried this code:

[i for e in names for i in target if i.startswith(e)]

This predictably returns Chris Smith, Christmas is here, and Kimberly.

How would I best approach this? Using regex or can it be done with list comprehensions? Performance may be an issue as the real names list is ~880,000 names long.

(python 2.7)

EDIT: I've realized that my criteria in this example are unrealistic given that the impossible request of wanting to include Kimberly while excluding Christmas is here. To mitigate this issue, I've found a more complete names list which would include variations (both Kim and Kimberly are included).

回答1:

Complete guess (again) since I can't see how you can not have Christmas is here given any reasonable criteria:

This'll match any targets that have any word that starts with a word from names...

names = ['Chris', 'Jack', 'Kim'] target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS']  import re matches = [targ for targ in target if any(re.search(r'\b{}'.format(name), targ, re.I) for name in names)] print matches # ['Chris Smith', 'Kimberly', 'Christmas is here', 'CHRIS']

If you change it to \b{}\b' - then you'll get ['Chris Smith', 'CHRIS'] so you lose Kim...

回答2:

Does this work?

names = ['Chris', 'Jack', 'Kim'] target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS']  res = [] for tof in target:     for name in names:         if tof.lower().startswith(name.lower()):             res.append(tof)             break print res

回答3:

According to your description, I got the rule is that:

ignore the case;
the target word must be initial with the key word.
if the the target word is not exactly the key word, then the target word must be the only word in the sentence.

Try this:

names = ['Chris', 'Jack', 'Kim'] target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS'] desired_output = ['Chris Smith', 'Kimberly', 'CHRIS']  actual_output = [] for key in names:     for words in target:         for word in words.split():             if key.lower() == word.lower():                 actual_output.append(words)             elif key.lower() == word.lower()[:len(key)] and len(words.split()) == 1:                 actual_output.append(words) print(actual_output)

It will output EXACTLY as your desired output (btw, are you sure you really want this?). Don't be frustrated by the 3-layer loop. If you have N names and M sentences, and the number of words in each sentence is limited, then the complexity of this code is O(mn) which can't be better.

回答4:

There is no deterministic way to drop the match 'Christmas is here', as it may not be possible for the system to determine if Christmas is a name or something else. Instead if you want to speed up the process, you can try this O(n) approach. I have not timed it, but definitely faster than your or the proposed solutions.

from difflib import SequenceMatcher names = ['Chris', 'Jack', 'Kim'] target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS'] def foo(names, target):     #Create a generator to search the names     def bar(names, target):             #which for each target         for t in target:                     #finds the matching blocks, a triplet, (i, j, n), and means that a[i:i+n] == b[j:j+n]             match = SequenceMatcher(None,names, t).get_matching_blocks()[0]                     #match.size == 0 means no match                     #and match.b > 0 means match does not happens at the start             if match.size > 0 and match.b == 0:                             #and generate the matching target                 yield t     #Join the names to create a single string     names = ','.join(names)     #and call the generator and return a list of the resultant generator     return list(bar(names, target))  >>> foo(names, target) ['Chris Smith', 'Kimberly', 'Christmas is here', 'CHRIS']

文章来源: Searching a list based on values in another list

标签

target

chris