How to remove a row from pandas dataframe based on the length of the column values?

后端 未结 5 1940
失恋的感觉
失恋的感觉 2021-01-17 10:16

In the following pandas.DataFframe:

df = 
    alfa    beta   ceta
    a,b,c   c,d,e  g,e,h
    a,b     d,e,f  g,h,k
    j,k     c,k,l  f,k,n
         


        
5条回答
  •  北荒
    北荒 (楼主)
    2021-01-17 11:10

    Here is an option that is the easiest to remember and still embracing the DataFrame which is the "bleeding heart" of Pandas:

    1) Create a new column in the dataframe with a value for the length:

    df['length'] = df.alfa.str.len()
    

    2) Index using the new column:

    df = df[df.length < 3]
    

    Then the comparison to the above timings, which are not really relevant in this case as the data is very small, and usually is less important than how likely you're going to remember how to do something and not having to interrupt your workflow:

    step 1:

    %timeit df['length'] = df.alfa.str.len()
    

    359 µs ± 6.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

    step 2:

    df = df[df.length < 3]
    

    627 µs ± 76.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

    The good news is that when the size grows, time does not grow linearly. For example doing the same operation with 30,000 rows of data takes about 3ms (so 10,000x data, 3x speed increase). Pandas DataFrame is like a train, takes energy to get it going (so not great for small things under absolute comparison, but objectively does not matter much does it...as with small data things are fast anyways).

提交回复
热议问题