Delete column from pandas DataFrame

后端 未结 17 1374
一生所求
一生所求 2020-11-22 02:44

When deleting a column in a DataFrame I use:

del df[\'column_name\']

And this works great. Why can\'t I use the following?

         


        
17条回答
  •  借酒劲吻你
    2020-11-22 03:24

    TL;DR

    A lot of effort to find a marginally more efficient solution. Difficult to justify the added complexity while sacrificing the simplicity of df.drop(dlst, 1, errors='ignore')

    df.reindex_axis(np.setdiff1d(df.columns.values, dlst), 1)
    

    Preamble
    Deleting a column is semantically the same as selecting the other columns. I'll show a few additional methods to consider.

    I'll also focus on the general solution of deleting multiple columns at once and allowing for the attempt to delete columns not present.

    Using these solutions are general and will work for the simple case as well.


    Setup
    Consider the pd.DataFrame df and list to delete dlst

    df = pd.DataFrame(dict(zip('ABCDEFGHIJ', range(1, 11))), range(3))
    dlst = list('HIJKLM')
    

    df
    
       A  B  C  D  E  F  G  H  I   J
    0  1  2  3  4  5  6  7  8  9  10
    1  1  2  3  4  5  6  7  8  9  10
    2  1  2  3  4  5  6  7  8  9  10
    

    dlst
    
    ['H', 'I', 'J', 'K', 'L', 'M']
    

    The result should look like:

    df.drop(dlst, 1, errors='ignore')
    
       A  B  C  D  E  F  G
    0  1  2  3  4  5  6  7
    1  1  2  3  4  5  6  7
    2  1  2  3  4  5  6  7
    

    Since I'm equating deleting a column to selecting the other columns, I'll break it into two types:

    1. Label selection
    2. Boolean selection

    Label Selection

    We start by manufacturing the list/array of labels that represent the columns we want to keep and without the columns we want to delete.

    1. df.columns.difference(dlst)

      Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')
      
    2. np.setdiff1d(df.columns.values, dlst)

      array(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype=object)
      
    3. df.columns.drop(dlst, errors='ignore')

      Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')
      
    4. list(set(df.columns.values.tolist()).difference(dlst))

      # does not preserve order
      ['E', 'D', 'B', 'F', 'G', 'A', 'C']
      
    5. [x for x in df.columns.values.tolist() if x not in dlst]

      ['A', 'B', 'C', 'D', 'E', 'F', 'G']
      

    Columns from Labels
    For the sake of comparing the selection process, assume:

     cols = [x for x in df.columns.values.tolist() if x not in dlst]
    

    Then we can evaluate

    1. df.loc[:, cols]
    2. df[cols]
    3. df.reindex(columns=cols)
    4. df.reindex_axis(cols, 1)

    Which all evaluate to:

       A  B  C  D  E  F  G
    0  1  2  3  4  5  6  7
    1  1  2  3  4  5  6  7
    2  1  2  3  4  5  6  7
    

    Boolean Slice

    We can construct an array/list of booleans for slicing

    1. ~df.columns.isin(dlst)
    2. ~np.in1d(df.columns.values, dlst)
    3. [x not in dlst for x in df.columns.values.tolist()]
    4. (df.columns.values[:, None] != dlst).all(1)

    Columns from Boolean
    For the sake of comparison

    bools = [x not in dlst for x in df.columns.values.tolist()]
    
    1. df.loc[: bools]

    Which all evaluate to:

       A  B  C  D  E  F  G
    0  1  2  3  4  5  6  7
    1  1  2  3  4  5  6  7
    2  1  2  3  4  5  6  7
    

    Robust Timing

    Functions

    setdiff1d = lambda df, dlst: np.setdiff1d(df.columns.values, dlst)
    difference = lambda df, dlst: df.columns.difference(dlst)
    columndrop = lambda df, dlst: df.columns.drop(dlst, errors='ignore')
    setdifflst = lambda df, dlst: list(set(df.columns.values.tolist()).difference(dlst))
    comprehension = lambda df, dlst: [x for x in df.columns.values.tolist() if x not in dlst]
    
    loc = lambda df, cols: df.loc[:, cols]
    slc = lambda df, cols: df[cols]
    ridx = lambda df, cols: df.reindex(columns=cols)
    ridxa = lambda df, cols: df.reindex_axis(cols, 1)
    
    isin = lambda df, dlst: ~df.columns.isin(dlst)
    in1d = lambda df, dlst: ~np.in1d(df.columns.values, dlst)
    comp = lambda df, dlst: [x not in dlst for x in df.columns.values.tolist()]
    brod = lambda df, dlst: (df.columns.values[:, None] != dlst).all(1)
    

    Testing

    res1 = pd.DataFrame(
        index=pd.MultiIndex.from_product([
            'loc slc ridx ridxa'.split(),
            'setdiff1d difference columndrop setdifflst comprehension'.split(),
        ], names=['Select', 'Label']),
        columns=[10, 30, 100, 300, 1000],
        dtype=float
    )
    
    res2 = pd.DataFrame(
        index=pd.MultiIndex.from_product([
            'loc'.split(),
            'isin in1d comp brod'.split(),
        ], names=['Select', 'Label']),
        columns=[10, 30, 100, 300, 1000],
        dtype=float
    )
    
    res = res1.append(res2).sort_index()
    
    dres = pd.Series(index=res.columns, name='drop')
    
    for j in res.columns:
        dlst = list(range(j))
        cols = list(range(j // 2, j + j // 2))
        d = pd.DataFrame(1, range(10), cols)
        dres.at[j] = timeit('d.drop(dlst, 1, errors="ignore")', 'from __main__ import d, dlst', number=100)
        for s, l in res.index:
            stmt = '{}(d, {}(d, dlst))'.format(s, l)
            setp = 'from __main__ import d, dlst, {}, {}'.format(s, l)
            res.at[(s, l), j] = timeit(stmt, setp, number=100)
    
    rs = res / dres
    

    rs
    
                              10        30        100       300        1000
    Select Label                                                           
    loc    brod           0.747373  0.861979  0.891144  1.284235   3.872157
           columndrop     1.193983  1.292843  1.396841  1.484429   1.335733
           comp           0.802036  0.732326  1.149397  3.473283  25.565922
           comprehension  1.463503  1.568395  1.866441  4.421639  26.552276
           difference     1.413010  1.460863  1.587594  1.568571   1.569735
           in1d           0.818502  0.844374  0.994093  1.042360   1.076255
           isin           1.008874  0.879706  1.021712  1.001119   0.964327
           setdiff1d      1.352828  1.274061  1.483380  1.459986   1.466575
           setdifflst     1.233332  1.444521  1.714199  1.797241   1.876425
    ridx   columndrop     0.903013  0.832814  0.949234  0.976366   0.982888
           comprehension  0.777445  0.827151  1.108028  3.473164  25.528879
           difference     1.086859  1.081396  1.293132  1.173044   1.237613
           setdiff1d      0.946009  0.873169  0.900185  0.908194   1.036124
           setdifflst     0.732964  0.823218  0.819748  0.990315   1.050910
    ridxa  columndrop     0.835254  0.774701  0.907105  0.908006   0.932754
           comprehension  0.697749  0.762556  1.215225  3.510226  25.041832
           difference     1.055099  1.010208  1.122005  1.119575   1.383065
           setdiff1d      0.760716  0.725386  0.849949  0.879425   0.946460
           setdifflst     0.710008  0.668108  0.778060  0.871766   0.939537
    slc    columndrop     1.268191  1.521264  2.646687  1.919423   1.981091
           comprehension  0.856893  0.870365  1.290730  3.564219  26.208937
           difference     1.470095  1.747211  2.886581  2.254690   2.050536
           setdiff1d      1.098427  1.133476  1.466029  2.045965   3.123452
           setdifflst     0.833700  0.846652  1.013061  1.110352   1.287831
    

    fig, axes = plt.subplots(2, 2, figsize=(8, 6), sharey=True)
    for i, (n, g) in enumerate([(n, g.xs(n)) for n, g in rs.groupby('Select')]):
        ax = axes[i // 2, i % 2]
        g.plot.bar(ax=ax, title=n)
        ax.legend_.remove()
    fig.tight_layout()
    

    This is relative to the time it takes to run df.drop(dlst, 1, errors='ignore'). It seems like after all that effort, we only improve performance modestly.

    If fact the best solutions use reindex or reindex_axis on the hack list(set(df.columns.values.tolist()).difference(dlst)). A close second and still very marginally better than drop is np.setdiff1d.

    rs.idxmin().pipe(
        lambda x: pd.DataFrame(
            dict(idx=x.values, val=rs.lookup(x.values, x.index)),
            x.index
        )
    )
    
                          idx       val
    10     (ridx, setdifflst)  0.653431
    30    (ridxa, setdifflst)  0.746143
    100   (ridxa, setdifflst)  0.816207
    300    (ridx, setdifflst)  0.780157
    1000  (ridxa, setdifflst)  0.861622
    

提交回复
热议问题