Pandas filtering for multiple substrings in series

前端 未结 3 1787
说谎
说谎 2020-11-22 04:08

I need to filter rows in a pandas dataframe so that a specific string column contains at least one of a list of provided substrings. The substrings may have unu

相关标签:
3条回答
  • 2020-11-22 04:26

    Using a simpler example & ignore case (upper or lowercase)

    Filtering and getting a binary vector:

    I want to find all elements of a pd.Series, v, that contain "at" or "Og". And get 1 if the element contains the pattern or 0 if it doesn't.

    I'll use the re:
    import re
    

    My vector:

    v=pd.Series(['cAt','dog','the rat','mouse','froG'])
    
    [Out]:
    
    0        cAt
    1        dog
    2    the rat
    3      mouse
    4       froG
    

    I want to find all elements of v that contain "at" or "Og". This is, I can define my pattern as:

    pattern='at|Og'
    

    Since I want a vector with 1s if the item contains the pattern or 0 if don't.

    I create an unitary vector with the same length as v:

    v_binary=[1]*len(v)
    

    I obtain a boolenean s that is Trueif one element of vcontains the patternor Falseif it doesn't contain it.

    s=v.str.contains(pattern, flags=re.IGNORECASE, regex=True)
    

    To obtain the binary vector I multiply the v_binary*s:

    v_binary*s
    
    [Out]
    
    0    1
    1    1
    2    1
    3    0
    4    1
    
    0 讨论(0)
  • 2020-11-22 04:32

    You could try using the Aho-Corasick algorithm. In the average case, it is O(n+m+p) where n is length of the search strings and m is the length of the searched text and p is the number of output matches.

    The Aho-Corasick algorithm is often used to find multiple patterns (needles) in an input text (the haystack).

    pyahocorasick is a Python wrapper around a C implementation of the algorithm.


    Let's compare how fast it is versus some alternatives. Below is a benchmark showing using_aho_corasick to be over 30x faster than the original method (shown in the question) on a 50K-row DataFrame test case:

    |                    |     speed factor | ms per loop |
    |                    | compared to orig |             |
    |--------------------+------------------+-------------|
    | using_aho_corasick |            30.7x |         140 |
    | using_regex        |             2.7x |        1580 |
    | orig               |             1.0x |        4300 |
    

    In [89]: %timeit using_ahocorasick(col, lst)
    10 loops, best of 3: 140 ms per loop
    
    In [88]: %timeit using_regex(col, lst)
    1 loop, best of 3: 1.58 s per loop
    
    In [91]: %timeit orig(col, lst)
    1 loop, best of 3: 4.3 s per loop
    

    Here the setup used for the benchmark. It also verifies that the output matches the result returned by orig:

    import numpy as np
    import random
    import pandas as pd
    import ahocorasick
    import re
    
    random.seed(321)
    
    def orig(col, lst):
        mask = np.logical_or.reduce([col.str.contains(i, regex=False, case=False) 
                                     for i in lst])
        return mask
    
    def using_regex(col, lst):
        """https://stackoverflow.com/a/48590850/190597 (Alex Riley)"""
        esc_lst = [re.escape(s) for s in lst]
        pattern = '|'.join(esc_lst)
        mask = col.str.contains(pattern, case=False)
        return mask
    
    def using_ahocorasick(col, lst):
        A = ahocorasick.Automaton(ahocorasick.STORE_INTS)
        for word in lst:
            A.add_word(word.lower())
        A.make_automaton() 
        col = col.str.lower()
        mask = col.apply(lambda x: bool(list(A.iter(x))))
        return mask
    
    N = 50000
    # 100 substrings of 5 characters
    lst = [''.join([chr(random.randint(0, 256)) for _ in range(5)]) for _ in range(100)]
    
    # N strings of 20 characters
    strings = [''.join([chr(random.randint(0, 256)) for _ in range(20)]) for _ in range(N)]
    # make about 10% of the strings match a string from lst; this helps check that our method works
    strings = [_ if random.randint(0, 99) < 10 else _+random.choice(lst) for _ in strings]
    
    col = pd.Series(strings)
    
    expected = orig(col, lst)
    for name, result in [('using_regex', using_regex(col, lst)),
                         ('using_ahocorasick', using_ahocorasick(col, lst))]:
        status = 'pass' if np.allclose(expected, result) else 'fail'
        print('{}: {}'.format(name, status))
    
    0 讨论(0)
  • 2020-11-22 04:37

    If you're sticking to using pure-pandas, for both performance and practicality I think you should use regex for this task. However, you will need to properly escape any special characters in the substrings first to ensure that they are matched literally (and not used as regex meta characters).

    This is easy to do using re.escape:

    >>> import re
    >>> esc_lst = [re.escape(s) for s in lst]
    

    These escaped substrings can then be joined using a regex pipe |. Each of the substrings can be checked against a string until one matches (or they have all been tested).

    >>> pattern = '|'.join(esc_lst)
    

    The masking stage then becomes a single low-level loop through the rows:

    df[col].str.contains(pattern, case=False)
    

    Here's a simple setup to get a sense of performance:

    from random import randint, seed
    
    seed(321)
    
    # 100 substrings of 5 characters
    lst = [''.join([chr(randint(0, 256)) for _ in range(5)]) for _ in range(100)]
    
    # 50000 strings of 20 characters
    strings = [''.join([chr(randint(0, 256)) for _ in range(20)]) for _ in range(50000)]
    
    col = pd.Series(strings)
    esc_lst = [re.escape(s) for s in lst]
    pattern = '|'.join(esc_lst)
    

    The proposed method takes about 1 second (so maybe up to 20 seconds for 1 million rows):

    %timeit col.str.contains(pattern, case=False)
    1 loop, best of 3: 981 ms per loop
    

    The method in the question took approximately 5 seconds using the same input data.

    It's worth noting that these times are 'worst case' in the sense that there were no matches (so all substrings were checked). If there are matches than the timing will improve.

    0 讨论(0)
提交回复
热议问题