Check if pandas column contains all elements from a list

后端 未结 7 1920
忘掉有多难
忘掉有多难 2020-12-09 03:14

I have a df like this:

frame = pd.DataFrame({\'a\' : [\'a,b,c\', \'a,c,f\', \'b,d,f\',\'a,z,c\']})

And a list of items:

let         


        
相关标签:
7条回答
  • 2020-12-09 03:34

    IIUC, explode and a boolean filter

    the idea is to create a single series then we can groupby the index the count the true occurrences of your list using a cumulative sum

    s = frame['a'].str.split(',').explode().isin(letters).groupby(level=0).cumsum()
    
    print(s)
    
    0    1.0
    0    1.0
    0    2.0
    1    1.0
    1    2.0
    1    2.0
    2    0.0
    2    0.0
    2    0.0
    3    1.0
    3    1.0
    3    2.0
    

    frame.loc[s[s.ge(2)].index.unique()]
    
    out:
    
           a
    0  a,b,c
    1  a,c,f
    3  a,z,c
    
    0 讨论(0)
  • 2020-12-09 03:36

    This also solves it:

    frame[frame['a'].apply(lambda x: np.all([*map(lambda l: l in x, letters)]))]
    
    0 讨论(0)
  • 2020-12-09 03:39

    One way is to split the column values into lists using str.split, and check if set(letters) is a subset of the obtained lists:

    letters_s = set(letters)
    frame[frame.a.str.split(',').map(letters_s.issubset)]
    
         a
    0  a,b,c
    1  a,c,f
    3  a,z,c
    ​
    

    Benchmark:

    def serge(frame):
        contains = [frame['a'].str.contains(i) for i in letters]
        return frame[np.all(contains, axis=0)]
    
    def yatu(frame):
        letters_s = set(letters)
        return frame[frame.a.str.split(',').map(letters_s.issubset)]
    
    def austin(frame):
        mask =  frame.a.apply(lambda x: np.intersect1d(x.split(','), letters).size > 0)
        return frame[mask]
    
    def datanovice(frame):
        s = frame['a'].str.split(',').explode().isin(letters).groupby(level=0).cumsum()
        return frame.loc[s[s.ge(2)].index.unique()]
    
    perfplot.show(
        setup=lambda n: pd.concat([frame]*n, axis=0).reset_index(drop=True), 
    
        kernels=[
            lambda df: serge(df),
            lambda df: yatu(df),
            lambda df: df[df['a'].apply(lambda x: np.all([*map(lambda l: l in x, letters)]))],
            lambda df: austin(df),
            lambda df: datanovice(df),
        ],
    
        labels=['serge', 'yatu', 'bruno','austin', 'datanovice'],
        n_range=[2**k for k in range(0, 18)],
        equality_check=lambda x, y: x.equals(y),
        xlabel='N'
    )
    

    0 讨论(0)
  • 2020-12-09 03:47

    I would build a list of Series, and then apply a vectorized np.all:

    contains = [frame['a'].str.contains(i) for i in letters]
    resul = frame[np.all(contains, axis=0)]
    

    It gives as expected:

           a
    0  a,b,c
    1  a,c,f
    3  a,z,c
    
    0 讨论(0)
  • 2020-12-09 03:47
    frame.iloc[[x for x in range(len(frame)) if set(letters).issubset(frame.iloc[x,0])]]
    

    output:

            a
     0  a,b,c
     1  a,c,f
     3  a,z,c
    

    timeit

    %%timeit
    #hermes
    frame.iloc[[x for x in range(len(frame)) if set(letters).issubset(frame.iloc[x,0])]]
    

    output

    300 µs ± 32.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    0 讨论(0)
  • 2020-12-09 03:51

    You can use np.intersect1d:

    import pandas as pd
    import numpy as np
    
    frame = pd.DataFrame({'a' : ['a,b,c', 'a,c,f', 'b,d,f','a,z,c']})
    letters = ['a','c']
    
    mask =  frame.a.apply(lambda x: np.intersect1d(x.split(','), letters).size > 0)
    print(frame[mask])
    
        a
    0  a,b,c
    1  a,c,f
    3  a,z,c
    
    0 讨论(0)
提交回复
热议问题