Search a list of strings for any sub-string from another list

前端 未结 5 366
挽巷
挽巷 2020-12-29 06:05

Given these 3 lists of data and a list of keywords:

good_data1 = [\'hello, world\', \'hey, world\']
good_data2 = [\'hey, man\', \'whats up\']
bad_data = [\'h         


        
相关标签:
5条回答
  • 2020-12-29 06:58

    In your example, with so few items, it doesn't really matter. But if you have a list of several thousand items, this might help.

    Since you don't care which element in the list contains the keyword, you can scan the whole list once (as one string) instead of one item at the time. For that you need a join character that you know won't occur in the keyword, in order to avoid false positives. I use the newline in this example.

    def check_data(data):
        s = "\n".join(data);
        for k in keywords:
            if k in s:
                return True
    
        return False
    

    In my completely unscientific test, my version checked a list of 5000 items 100000 times in about 30 seconds. I stopped your version after 3 minutes -- got tired of waiting to post =)

    0 讨论(0)
  • 2020-12-29 06:59

    If you have many keywords, you might want to try a suffix tree [1]. Insert all the words from the three data lists, storing which list each word comes from in it's terminating node. Then you can perform queries on the tree for each keyword really, really fast.

    Warning: suffix trees are very complicated to implement!

    [1] http://en.wikipedia.org/wiki/Suffix_tree

    0 讨论(0)
  • 2020-12-29 06:59

    You may be able to improve matters by building your list of keywords as a regular expression.

    This may allow them to be tested in parallel, but will very much depend on what the keywords are (eg. some work may be reused testing for "hello" and "hell", rather than searching every phrase from the start for each word.

    You could do this by executing:

    import re
    keyword_re = re.compile("|".join(map(re.escape, keywords)))
    

    Then:

    >>> bool(keyword_re.search('hello, world'))
    True
    >>> bool(keyword_re.search('hi, earth'))
    False
    

    (It will actually return a match object on found, and None if not found - this might be useful if you need to know which keyword matched)

    However, how much (if anything) this gains you will depend on the keywords. If you only have one or two, keep your current approach. If you have a large list, it may be worth tring and profiling to see which performs better.

    [Edit] For reference, here's how the approaches do for your example:

                   good1   good2  good3   bad1   bad2
    original     : 0.206   0.233  0.229   0.390   63.879
    gnud (join)  : 0.257   0.347  4.600   0.281    6.706
    regex        : 0.766   1.018  0.397   0.764  124.351
    regex (join) : 0.345   0.337  3.305   0.481   48.666
    

    Obviously for this case, your approach performs far better than the regex one. Whether this will always be the case depends a lot on the number and complexity of keywords, and the input data that will be checked. For large numbers of keywords, and lengthy lists or rarely matching phrases, regexes may work better, but do get timing information, and perhaps try even simpler optimisations (like moving the most common words to the front of your keyword list) first. Sometimes the simplest approach really is the best.

    [Edit2] Updated the table with gnud's solution, and a similar approach before applying the regexes. I also added 2 new tests:

    good_data3 = good_data2 * 500  # 1000 items, the first of which matches.
    bad_data2 = bad_data * 500     # 1000 items, none of which matches.
    

    Which show up the various strengths and weaknesses. Joining does do worse when a match would immediately be found (as there is an always paid, up-front cost in joining the list - this is a best possible case for the linear search method), however for non-matching lists, it performs better. Much better when there are a large number of items in the list.case).

    0 讨论(0)
  • 2020-12-29 07:06

    I think this is pretty efficient and clear, though you could use map() to avoid the many nests. I agree with ross on the dictionary idea for larger lists.

    0 讨论(0)
  • 2020-12-29 07:09

    Are you looking for

    any( k in s for k in keywords )
    

    It's more compact, but might be less efficient.

    0 讨论(0)
提交回复
热议问题