How to determine the length of lists in a pandas dataframe column?

前端 未结 2 1155
独厮守ぢ
独厮守ぢ 2020-11-27 18:34

How can the length of the lists in the column be determine without iteration?

I have a dataframe like this:

                                                    


        
相关标签:
2条回答
  • 2020-11-27 18:58
    • pandas.Series.map(len) and pandas.Series.apply(len) are equivalent in execution time, and slightly faster than pandas.Series.str.len().

      • pandas.Series.map
      • pandas.Series.apply
      • pandas.Series.str.len
    • Difference between map, applymap and apply methods in Pandas

    import pandas as pd
    
    data = {'os': [['ubuntu', 'mac-osx', 'syslinux'], ['ubuntu', 'mod-rewrite', 'laconica', 'apache-2.2'], ['ubuntu', 'nat', 'squid', 'mikrotik']]}
    index = ['2013-12-22 15:25:02', '2009-12-14 14:29:32', '2013-12-22 15:42:00']
    
    df = pd.DataFrame(data, index)
    
    # create Length column
    df['Length'] = df.os.map(len)
    
    # display(df)
                                                                  os  Length
    2013-12-22 15:25:02                  [ubuntu, mac-osx, syslinux]       3
    2009-12-14 14:29:32  [ubuntu, mod-rewrite, laconica, apache-2.2]       4
    2013-12-22 15:42:00               [ubuntu, nat, squid, mikrotik]       4
    

    %timeit

    import pandas as pd
    import random
    import string
    
    random.seed(365)
    
    ser = pd.Series([random.sample(string.ascii_letters, random.randint(1, 20)) for _ in range(10**6)])
    
    %timeit ser.str.len()
    252 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    %timeit ser.map(len)
    220 ms ± 7.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    %timeit ser.apply(len)
    222 ms ± 8.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    0 讨论(0)
  • 2020-11-27 19:00

    You can use the str accessor for some list operations as well. In this example,

    df['CreationDate'].str.len()
    

    returns the length of each list. See the docs for str.len.

    df['Length'] = df['CreationDate'].str.len()
    df
    Out: 
                                                        CreationDate  Length
    2013-12-22 15:25:02                  [ubuntu, mac-osx, syslinux]       3
    2009-12-14 14:29:32  [ubuntu, mod-rewrite, laconica, apache-2.2]       4
    2013-12-22 15:42:00               [ubuntu, nat, squid, mikrotik]       4
    

    For these operations, vanilla Python is generally faster. pandas handles NaNs though. Here are timings:

    ser = pd.Series([random.sample(string.ascii_letters, 
                                   random.randint(1, 20)) for _ in range(10**6)])
    
    %timeit ser.apply(lambda x: len(x))
    1 loop, best of 3: 425 ms per loop
    
    %timeit ser.str.len()
    1 loop, best of 3: 248 ms per loop
    
    %timeit [len(x) for x in ser]
    10 loops, best of 3: 84 ms per loop
    
    %timeit pd.Series([len(x) for x in ser], index=ser.index)
    1 loop, best of 3: 236 ms per loop
    
    0 讨论(0)
提交回复
热议问题