Delete column from pandas DataFrame

后端 未结 17 1371
一生所求
一生所求 2020-11-22 02:44

When deleting a column in a DataFrame I use:

del df[\'column_name\']

And this works great. Why can\'t I use the following?

         


        
相关标签:
17条回答
  • 2020-11-22 03:19

    Another way of Deleting a Column in Pandas DataFrame

    if you're not looking for In-Place deletion then you can create a new DataFrame by specifying the columns using DataFrame(...) function as

    my_dict = { 'name' : ['a','b','c','d'], 'age' : [10,20,25,22], 'designation' : ['CEO', 'VP', 'MD', 'CEO']}
    
    df = pd.DataFrame(my_dict)
    

    Create a new DataFrame as

    newdf = pd.DataFrame(df, columns=['name', 'age'])
    

    You get a result as good as what you get with del / drop

    0 讨论(0)
  • 2020-11-22 03:20

    A nice addition is the ability to drop columns only if they exist. This way you can cover more use cases, and it will only drop the existing columns from the labels passed to it:

    Simply add errors='ignore', for example.:

    df.drop(['col_name_1', 'col_name_2', ..., 'col_name_N'], inplace=True, axis=1, errors='ignore')
    
    • This is new from pandas 0.16.1 onward. Documentation is here.
    0 讨论(0)
  • 2020-11-22 03:21

    Pandas 0.21+ answer

    Pandas version 0.21 has changed the drop method slightly to include both the index and columns parameters to match the signature of the rename and reindex methods.

    df.drop(columns=['column_a', 'column_c'])
    

    Personally, I prefer using the axis parameter to denote columns or index because it is the predominant keyword parameter used in nearly all pandas methods. But, now you have some added choices in version 0.21.

    0 讨论(0)
  • 2020-11-22 03:21

    If your original dataframe df is not too big, you have no memory constraints, and you only need to keep a few columns, or, if you don't know beforehand the names of all the extra columns that you do not need, then you might as well create a new dataframe with only the columns you need:

    new_df = df[['spam', 'sausage']]
    
    0 讨论(0)
  • 2020-11-22 03:23

    We can Remove or Delete a specified column or sprcified columns by drop() method.

    Suppose df is a dataframe.

    Column to be removed = column0

    Code:

    df = df.drop(column0, axis=1)
    

    To remove multiple columns col1, col2, . . . , coln, we have to insert all the columns that needed to be removed in a list. Then remove them by drop() method.

    Code:

    df = df.drop([col1, col2, . . . , coln], axis=1)
    

    I hope it would be helpful.

    0 讨论(0)
  • 2020-11-22 03:24

    TL;DR

    A lot of effort to find a marginally more efficient solution. Difficult to justify the added complexity while sacrificing the simplicity of df.drop(dlst, 1, errors='ignore')

    df.reindex_axis(np.setdiff1d(df.columns.values, dlst), 1)
    

    Preamble
    Deleting a column is semantically the same as selecting the other columns. I'll show a few additional methods to consider.

    I'll also focus on the general solution of deleting multiple columns at once and allowing for the attempt to delete columns not present.

    Using these solutions are general and will work for the simple case as well.


    Setup
    Consider the pd.DataFrame df and list to delete dlst

    df = pd.DataFrame(dict(zip('ABCDEFGHIJ', range(1, 11))), range(3))
    dlst = list('HIJKLM')
    

    df
    
       A  B  C  D  E  F  G  H  I   J
    0  1  2  3  4  5  6  7  8  9  10
    1  1  2  3  4  5  6  7  8  9  10
    2  1  2  3  4  5  6  7  8  9  10
    

    dlst
    
    ['H', 'I', 'J', 'K', 'L', 'M']
    

    The result should look like:

    df.drop(dlst, 1, errors='ignore')
    
       A  B  C  D  E  F  G
    0  1  2  3  4  5  6  7
    1  1  2  3  4  5  6  7
    2  1  2  3  4  5  6  7
    

    Since I'm equating deleting a column to selecting the other columns, I'll break it into two types:

    1. Label selection
    2. Boolean selection

    Label Selection

    We start by manufacturing the list/array of labels that represent the columns we want to keep and without the columns we want to delete.

    1. df.columns.difference(dlst)

      Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')
      
    2. np.setdiff1d(df.columns.values, dlst)

      array(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype=object)
      
    3. df.columns.drop(dlst, errors='ignore')

      Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')
      
    4. list(set(df.columns.values.tolist()).difference(dlst))

      # does not preserve order
      ['E', 'D', 'B', 'F', 'G', 'A', 'C']
      
    5. [x for x in df.columns.values.tolist() if x not in dlst]

      ['A', 'B', 'C', 'D', 'E', 'F', 'G']
      

    Columns from Labels
    For the sake of comparing the selection process, assume:

     cols = [x for x in df.columns.values.tolist() if x not in dlst]
    

    Then we can evaluate

    1. df.loc[:, cols]
    2. df[cols]
    3. df.reindex(columns=cols)
    4. df.reindex_axis(cols, 1)

    Which all evaluate to:

       A  B  C  D  E  F  G
    0  1  2  3  4  5  6  7
    1  1  2  3  4  5  6  7
    2  1  2  3  4  5  6  7
    

    Boolean Slice

    We can construct an array/list of booleans for slicing

    1. ~df.columns.isin(dlst)
    2. ~np.in1d(df.columns.values, dlst)
    3. [x not in dlst for x in df.columns.values.tolist()]
    4. (df.columns.values[:, None] != dlst).all(1)

    Columns from Boolean
    For the sake of comparison

    bools = [x not in dlst for x in df.columns.values.tolist()]
    
    1. df.loc[: bools]

    Which all evaluate to:

       A  B  C  D  E  F  G
    0  1  2  3  4  5  6  7
    1  1  2  3  4  5  6  7
    2  1  2  3  4  5  6  7
    

    Robust Timing

    Functions

    setdiff1d = lambda df, dlst: np.setdiff1d(df.columns.values, dlst)
    difference = lambda df, dlst: df.columns.difference(dlst)
    columndrop = lambda df, dlst: df.columns.drop(dlst, errors='ignore')
    setdifflst = lambda df, dlst: list(set(df.columns.values.tolist()).difference(dlst))
    comprehension = lambda df, dlst: [x for x in df.columns.values.tolist() if x not in dlst]
    
    loc = lambda df, cols: df.loc[:, cols]
    slc = lambda df, cols: df[cols]
    ridx = lambda df, cols: df.reindex(columns=cols)
    ridxa = lambda df, cols: df.reindex_axis(cols, 1)
    
    isin = lambda df, dlst: ~df.columns.isin(dlst)
    in1d = lambda df, dlst: ~np.in1d(df.columns.values, dlst)
    comp = lambda df, dlst: [x not in dlst for x in df.columns.values.tolist()]
    brod = lambda df, dlst: (df.columns.values[:, None] != dlst).all(1)
    

    Testing

    res1 = pd.DataFrame(
        index=pd.MultiIndex.from_product([
            'loc slc ridx ridxa'.split(),
            'setdiff1d difference columndrop setdifflst comprehension'.split(),
        ], names=['Select', 'Label']),
        columns=[10, 30, 100, 300, 1000],
        dtype=float
    )
    
    res2 = pd.DataFrame(
        index=pd.MultiIndex.from_product([
            'loc'.split(),
            'isin in1d comp brod'.split(),
        ], names=['Select', 'Label']),
        columns=[10, 30, 100, 300, 1000],
        dtype=float
    )
    
    res = res1.append(res2).sort_index()
    
    dres = pd.Series(index=res.columns, name='drop')
    
    for j in res.columns:
        dlst = list(range(j))
        cols = list(range(j // 2, j + j // 2))
        d = pd.DataFrame(1, range(10), cols)
        dres.at[j] = timeit('d.drop(dlst, 1, errors="ignore")', 'from __main__ import d, dlst', number=100)
        for s, l in res.index:
            stmt = '{}(d, {}(d, dlst))'.format(s, l)
            setp = 'from __main__ import d, dlst, {}, {}'.format(s, l)
            res.at[(s, l), j] = timeit(stmt, setp, number=100)
    
    rs = res / dres
    

    rs
    
                              10        30        100       300        1000
    Select Label                                                           
    loc    brod           0.747373  0.861979  0.891144  1.284235   3.872157
           columndrop     1.193983  1.292843  1.396841  1.484429   1.335733
           comp           0.802036  0.732326  1.149397  3.473283  25.565922
           comprehension  1.463503  1.568395  1.866441  4.421639  26.552276
           difference     1.413010  1.460863  1.587594  1.568571   1.569735
           in1d           0.818502  0.844374  0.994093  1.042360   1.076255
           isin           1.008874  0.879706  1.021712  1.001119   0.964327
           setdiff1d      1.352828  1.274061  1.483380  1.459986   1.466575
           setdifflst     1.233332  1.444521  1.714199  1.797241   1.876425
    ridx   columndrop     0.903013  0.832814  0.949234  0.976366   0.982888
           comprehension  0.777445  0.827151  1.108028  3.473164  25.528879
           difference     1.086859  1.081396  1.293132  1.173044   1.237613
           setdiff1d      0.946009  0.873169  0.900185  0.908194   1.036124
           setdifflst     0.732964  0.823218  0.819748  0.990315   1.050910
    ridxa  columndrop     0.835254  0.774701  0.907105  0.908006   0.932754
           comprehension  0.697749  0.762556  1.215225  3.510226  25.041832
           difference     1.055099  1.010208  1.122005  1.119575   1.383065
           setdiff1d      0.760716  0.725386  0.849949  0.879425   0.946460
           setdifflst     0.710008  0.668108  0.778060  0.871766   0.939537
    slc    columndrop     1.268191  1.521264  2.646687  1.919423   1.981091
           comprehension  0.856893  0.870365  1.290730  3.564219  26.208937
           difference     1.470095  1.747211  2.886581  2.254690   2.050536
           setdiff1d      1.098427  1.133476  1.466029  2.045965   3.123452
           setdifflst     0.833700  0.846652  1.013061  1.110352   1.287831
    

    fig, axes = plt.subplots(2, 2, figsize=(8, 6), sharey=True)
    for i, (n, g) in enumerate([(n, g.xs(n)) for n, g in rs.groupby('Select')]):
        ax = axes[i // 2, i % 2]
        g.plot.bar(ax=ax, title=n)
        ax.legend_.remove()
    fig.tight_layout()
    

    This is relative to the time it takes to run df.drop(dlst, 1, errors='ignore'). It seems like after all that effort, we only improve performance modestly.

    If fact the best solutions use reindex or reindex_axis on the hack list(set(df.columns.values.tolist()).difference(dlst)). A close second and still very marginally better than drop is np.setdiff1d.

    rs.idxmin().pipe(
        lambda x: pd.DataFrame(
            dict(idx=x.values, val=rs.lookup(x.values, x.index)),
            x.index
        )
    )
    
                          idx       val
    10     (ridx, setdifflst)  0.653431
    30    (ridxa, setdifflst)  0.746143
    100   (ridxa, setdifflst)  0.816207
    300    (ridx, setdifflst)  0.780157
    1000  (ridxa, setdifflst)  0.861622
    
    0 讨论(0)
提交回复
热议问题