How to efficiently search for a list of strings in another list of strings using Python?

问题

I have two list of names (strings) that look like this:

executives = ['Brian Olsavsky', 'Some Guy', 'Some Lady']

analysts = ['Justin Post', 'Some Dude', 'Some Chick']

I need to find where those names occur in a list of strings that looks like this:

str = ['Justin Post - Bank of America',
 "Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.", 
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
 'Brian Olsavsky - Amazon.com',
 "Thank you, Justin. Yeah, let me just remind you a couple of things from last year.", 
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
 "I'll just remind you that the units  those do not count",
 "In-stock is very strong, especially as we head into the holiday period.",
 'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores.

The reason I need to do this is so that I can concatenate the conversation strings together (which are separated by the names). How would I go about doing this efficiently?

I looked at some similar questions and tried the solutions to no avail, such as this:

if any(x in str for x in executives):
    print('yes')

and this ...

match = next((x for x in executives if x in str), False)
match

回答1:

I'm not sure if that is what you are looking for:

executives = ['Brian Olsavsky', 'Some Guy', 'Some Lady']
text = ['Justin Post - Bank of America',
 "Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.", 
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
 'Brian Olsavsky - Amazon.com',
 "Thank you, Justin. Yeah, let me just remind you a couple of things from last year.", 
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
 "I'll just remind you that the units  those do not count",
 "In-stock is very strong, especially as we head into the holiday period.",
 'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores."]

result = [s for s in text if any(ex in s for ex in executives)]
print(result)

output: ['Brian Olsavsky - Amazon.com']

回答2:

str = ['Justin Post - Bank of America',
 "Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.", 
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
 'Brian Olsavsky - Amazon.com',
 "Thank you, Justin. Yeah, let me just remind you a couple of things from last year.", 
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
 "I'll just remind you that the units  those do not count",
 "In-stock is very strong, especially as we head into the holiday period.",
 'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores"]

executives = ['Brian Olsavsky', 'Justin', 'Some Guy', 'Some Lady']

As an addition, if you need the exact location, you could use this:

print([[i, str.index(q), q.index(i)] for i in executives for q in str if i in q ])

this outputs

[['Brian Olsavsky', 3, 0], ['Justin', 0, 0], ['Justin', 4, 11], ['Justin', 9, 5]]

回答3:

TLDR

This answer is focusing on efficiency. Use the other answers if it is not a key issue. If it is, make a dict from the corpus you are searching in, then use this dict to find what you are looking for.

#import stuff we need later

import string
import random
import numpy as np
import time
import matplotlib.pyplot as plt

Create example corpus

First we create a list of strings which we will search in.

Create random words, by which I mean random sequence of characters, with a length drawn from a Poisson distribution, with this function:

def poissonlength_words(lam_word): #generating words, length chosen from a Poisson distrib
    return ''.join([random.choice(string.ascii_lowercase) for _ in range(np.random.poisson(lam_word))])

(lam_word being the parameter of the Poisson distribution.)

Let's create number_of_sentences variable length sentences from these words (by sentence I mean list of the randomly generated words separated by spaces).

The length of the sentences can also be drawn from a Poisson distribution.

lam_word=5
lam_sentence=1000
number_of_sentences = 10000

sentences = [' '.join([poissonlength_words(lam_word) for _ in range(np.random.poisson(lam_sentence))])
             for x in range(number_of_sentences)]

sentences[0] now will start like this:

tptt lxnwf iem fedg wbfdq qaa aqrys szwx zkmukc...

Let's also create names, which we will search for. Let these names be bigrams. The first name (ie first element of bigram) will be n characters, last name (second bigram element) will be m character long, and it will consist of random characters:

def bigramgen(n,m):
    return ''.join([random.choice(string.ascii_lowercase) for _ in range(n)])+' '+\
           ''.join([random.choice(string.ascii_lowercase) for _ in range(m)])

The task

Let's say we want to find sentences where bigrams such as ab c appears. We don't want to find dab c or ab cd, only where ab c stands alone.

To test how fast a method is, let's find an ever-increasing number of bigrams, and measure the elapsed time. The number of bigrams we search for can be, for example:

number_of_bigrams_we_search_for = [10,30,50,100,300,500,1000,3000,5000,10000]

Brute force method

Just loop through each bigram, loop through each sentence, use in to find matches. Meanwhile, measure elapsed time with time.time().

bruteforcetime=[]
for number_of_bigrams in number_of_bigrams_we_search_for:
    bigrams = [bigramgen(2,1) for _ in range(number_of_bigrams)]
    start = time.time()
    for bigram in bigrams:
        #the core of the brute force method starts here
        reslist=[]
        for sentencei, sentence in enumerate(sentences):
            if ' '+bigram+' ' in sentence:
                reslist.append([bigram,sentencei])
        #and ends here
    end = time.time()
    bruteforcetime.append(end-start)

bruteforcetime will hold the number of seconds necessary to find 10, 30, 50 ... bigrams.

Warning: this might take a long time for high number of bigrams.

The sort your stuff to make it quicker method

Let's create an empty set for every word appearing in any of the sentences (using dict comprehension):

worddict={word:set() for sentence in sentences for word in sentence.split(' ')}

To each of these sets, add the index of each word in which it appears:

for sentencei, sentence in enumerate(sentences):
    for wordi, word in enumerate(sentence.split(' ')):
        worddict[word].add(sentencei)

Note that we only do this once, no matter how many bigrams we search later.

Using this dictionary, we can seach for the sentences where the each part of the bigram appears. This is very fast, since calling a dict value is very fast. We then take the intersection of these sets. When we are searching for ab c, we will have a set of sentence indexes where ab and c both appears.

for bigram in bigrams:
    reslist=[]
    setlist = [worddict[gram] for gram in target.split(' ')]
    intersection = set.intersection(*setlist)
    for candidate in intersection:
        if bigram in sentences[candidate]:
            reslist.append([bigram, candidate])

Let's put the whole thing together, and measure the time elapsed:

logtime=[]
for number_of_bigrams in number_of_bigrams_we_search_for:
    
    bigrams = [bigramgen(2,1) for _ in range(number_of_bigrams)]
    
    start_time=time.time()
    
    worddict={word:set() for sentence in sentences for word in sentence.split(' ')}

    for sentencei, sentence in enumerate(sentences):
        for wordi, word in enumerate(sentence.split(' ')):
            worddict[word].add(sentencei)

    for bigram in bigrams:
        reslist=[]
        setlist = [worddict[gram] for gram in bigram.split(' ')]
        intersection = set.intersection(*setlist)
        for candidate in intersection:
            if bigram in sentences[candidate]:
                reslist.append([bigram, candidate])

    end_time=time.time()
    
    logtime.append(end_time-start_time)

Warning: this might take a long time for high number of bigrams, but less than the brute force method.

Results

We can plot how much time each method took.

plt.plot(number_of_bigrams_we_search_for, bruteforcetime,label='linear')
plt.plot(number_of_bigrams_we_search_for, logtime,label='log')
plt.legend()
plt.xlabel('Number of bigrams searched')
plt.ylabel('Time elapsed (sec)')

Or, plotting the y axis on a log scale:

plt.plot(number_of_bigrams_we_search_for, bruteforcetime,label='linear')
plt.plot(number_of_bigrams_we_search_for, logtime,label='log')
plt.yscale('log')
plt.legend()
plt.xlabel('Number of bigrams searched')
plt.ylabel('Time elapsed (sec)')

Giving us the plots:

Making the worddict dictionary takes a lot of time, and is a disadvantage when searching for a small number of names. There is a point however, that the corpus is big enough and the number of names we are searching for is high enough that this time is compensated by the speed of searching in it, compared to the brute force method. So, if these conditions are satisfied, I recommend using this method.

(Notebook available here.)

来源：https://stackoverflow.com/questions/54492158/how-to-efficiently-search-for-a-list-of-strings-in-another-list-of-strings-using

标签

python

python-3.x

string

performance