Pandas filtering for multiple substrings in series

前端 未结 3 1796
说谎
说谎 2020-11-22 04:08

I need to filter rows in a pandas dataframe so that a specific string column contains at least one of a list of provided substrings. The substrings may have unu

3条回答
  •  梦毁少年i
    2020-11-22 04:37

    If you're sticking to using pure-pandas, for both performance and practicality I think you should use regex for this task. However, you will need to properly escape any special characters in the substrings first to ensure that they are matched literally (and not used as regex meta characters).

    This is easy to do using re.escape:

    >>> import re
    >>> esc_lst = [re.escape(s) for s in lst]
    

    These escaped substrings can then be joined using a regex pipe |. Each of the substrings can be checked against a string until one matches (or they have all been tested).

    >>> pattern = '|'.join(esc_lst)
    

    The masking stage then becomes a single low-level loop through the rows:

    df[col].str.contains(pattern, case=False)
    

    Here's a simple setup to get a sense of performance:

    from random import randint, seed
    
    seed(321)
    
    # 100 substrings of 5 characters
    lst = [''.join([chr(randint(0, 256)) for _ in range(5)]) for _ in range(100)]
    
    # 50000 strings of 20 characters
    strings = [''.join([chr(randint(0, 256)) for _ in range(20)]) for _ in range(50000)]
    
    col = pd.Series(strings)
    esc_lst = [re.escape(s) for s in lst]
    pattern = '|'.join(esc_lst)
    

    The proposed method takes about 1 second (so maybe up to 20 seconds for 1 million rows):

    %timeit col.str.contains(pattern, case=False)
    1 loop, best of 3: 981 ms per loop
    

    The method in the question took approximately 5 seconds using the same input data.

    It's worth noting that these times are 'worst case' in the sense that there were no matches (so all substrings were checked). If there are matches than the timing will improve.

提交回复
热议问题