可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a list of names which I'm trying to pull out of a list of strings. I keep getting false positives such as partial matches. The other caveat is that I'd like it to also grab a last name where applicable.
names = ['Chris', 'Jack', 'Kim'] target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS'] desired_output = ['Chris Smith', 'Kimberly', 'CHRIS']
I've tried this code:
[i for e in names for i in target if i.startswith(e)]
This predictably returns Chris Smith, Christmas is here, and Kimberly.
How would I best approach this? Using regex or can it be done with list comprehensions? Performance may be an issue as the real names list is ~880,000 names long.
(python 2.7)
EDIT: I've realized that my criteria in this example are unrealistic given that the impossible request of wanting to include Kimberly while excluding Christmas is here. To mitigate this issue, I've found a more complete names list which would include variations (both Kim and Kimberly are included).
回答1:
Complete guess (again) since I can't see how you can not have Christmas is here
given any reasonable criteria:
This'll match any targets that have any word that starts with a word from names...
names = ['Chris', 'Jack', 'Kim'] target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS'] import re matches = [targ for targ in target if any(re.search(r'\b{}'.format(name), targ, re.I) for name in names)] print matches # ['Chris Smith', 'Kimberly', 'Christmas is here', 'CHRIS']
If you change it to \b{}\b' - then you'll get ['Chris Smith', 'CHRIS']
so you lose Kim
...
回答2:
Does this work?
names = ['Chris', 'Jack', 'Kim'] target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS'] res = [] for tof in target: for name in names: if tof.lower().startswith(name.lower()): res.append(tof) break print res
回答3:
According to your description, I got the rule is that:
- ignore the case;
- the target word must be initial with the key word.
- if the the target word is not exactly the key word, then the target word must be the only word in the sentence.
Try this:
names = ['Chris', 'Jack', 'Kim'] target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS'] desired_output = ['Chris Smith', 'Kimberly', 'CHRIS'] actual_output = [] for key in names: for words in target: for word in words.split(): if key.lower() == word.lower(): actual_output.append(words) elif key.lower() == word.lower()[:len(key)] and len(words.split()) == 1: actual_output.append(words) print(actual_output)
It will output EXACTLY as your desired output (btw, are you sure you really want this?). Don't be frustrated by the 3-layer loop. If you have N names and M sentences, and the number of words in each sentence is limited, then the complexity of this code is O(mn)
which can't be better.
回答4:
There is no deterministic way to drop the match 'Christmas is here', as it may not be possible for the system to determine if Christmas is a name or something else. Instead if you want to speed up the process, you can try this O(n) approach. I have not timed it, but definitely faster than your or the proposed solutions.
from difflib import SequenceMatcher names = ['Chris', 'Jack', 'Kim'] target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS'] def foo(names, target): #Create a generator to search the names def bar(names, target): #which for each target for t in target: #finds the matching blocks, a triplet, (i, j, n), and means that a[i:i+n] == b[j:j+n] match = SequenceMatcher(None,names, t).get_matching_blocks()[0] #match.size == 0 means no match #and match.b > 0 means match does not happens at the start if match.size > 0 and match.b == 0: #and generate the matching target yield t #Join the names to create a single string names = ','.join(names) #and call the generator and return a list of the resultant generator return list(bar(names, target)) >>> foo(names, target) ['Chris Smith', 'Kimberly', 'Christmas is here', 'CHRIS']