Pandas split column of lists into multiple columns

后端 未结 8 1645
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-11-21 06:28

I have a pandas DataFrame with one column:

import pandas as pd

df = pd.DataFrame(
    data={
        \"teams\": [
            


        
相关标签:
8条回答
  • 2020-11-21 07:03

    Based on the previous answers, here is another solution which returns the same result as df2.teams.apply(pd.Series) with a much faster run time:

    pd.DataFrame([{x: y for x, y in enumerate(item)} for item in df2['teams'].values.tolist()], index=df2.index)
    

    Timings:

    In [1]:
    import pandas as pd
    d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
                    ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
    df2 = pd.DataFrame(d1)
    df2 = pd.concat([df2]*1000).reset_index(drop=True)
    
    In [2]: %timeit df2['teams'].apply(pd.Series)
    
    8.27 s ± 2.73 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [3]: %timeit pd.DataFrame([{x: y for x, y in enumerate(item)} for item in df2['teams'].values.tolist()], index=df2.index)
    
    35.4 ms ± 5.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    0 讨论(0)
  • 2020-11-21 07:04

    You can use DataFrame constructor with lists created by to_list:

    import pandas as pd
    
    d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
                    ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
    df2 = pd.DataFrame(d1)
    print (df2)
           teams
    0  [SF, NYG]
    1  [SF, NYG]
    2  [SF, NYG]
    3  [SF, NYG]
    4  [SF, NYG]
    5  [SF, NYG]
    6  [SF, NYG]
    

    df2[['team1','team2']] = pd.DataFrame(df2.teams.tolist(), index= df2.index)
    print (df2)
           teams team1 team2
    0  [SF, NYG]    SF   NYG
    1  [SF, NYG]    SF   NYG
    2  [SF, NYG]    SF   NYG
    3  [SF, NYG]    SF   NYG
    4  [SF, NYG]    SF   NYG
    5  [SF, NYG]    SF   NYG
    6  [SF, NYG]    SF   NYG
    

    And for new DataFrame:

    df3 = pd.DataFrame(df2['teams'].to_list(), columns=['team1','team2'])
    print (df3)
      team1 team2
    0    SF   NYG
    1    SF   NYG
    2    SF   NYG
    3    SF   NYG
    4    SF   NYG
    5    SF   NYG
    6    SF   NYG
    

    Solution with apply(pd.Series) is very slow:

    #7k rows
    df2 = pd.concat([df2]*1000).reset_index(drop=True)
    
    In [121]: %timeit df2['teams'].apply(pd.Series)
    1.79 s ± 52.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [122]: %timeit pd.DataFrame(df2['teams'].to_list(), columns=['team1','team2'])
    1.63 ms ± 54.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    0 讨论(0)
  • 2020-11-21 07:05

    list comprehension

    simple implementation with list comprehension ( my favorite)

    df = pd.DataFrame([pd.Series(x) for x in df.teams])
    df.columns = ['team_{}'.format(x+1) for x in df.columns]
    

    timing on output:

    CPU times: user 0 ns, sys: 0 ns, total: 0 ns
    Wall time: 2.71 ms
    
    

    output:

    team_1  team_2
    0   SF  NYG
    1   SF  NYG
    2   SF  NYG
    3   SF  NYG
    4   SF  NYG
    5   SF  NYG
    6   SF  NYG
    
    0 讨论(0)
  • 2020-11-21 07:13

    There seems to be a syntactically simpler way, and therefore easier to remember, as opposed to the proposed solutions. I'm assuming that the column is called 'meta' in a dataframe df:

    df2 = pd.DataFrame(df['meta'].str.split().values.tolist())
    
    0 讨论(0)
  • 2020-11-21 07:14

    The above solutions didn't work for me since I have nan observations in my dataframe. In my case df2[['team1','team2']] = pd.DataFrame(df2.teams.values.tolist(), index= df2.index) yields:

    object of type 'float' has no len()
    

    I solve this using list comprehension. Here the replicable example:

    import pandas as pd
    import numpy as np
    d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
                ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
    df2 = pd.DataFrame(d1)
    df2.loc[2,'teams'] = np.nan
    df2.loc[4,'teams'] = np.nan
    df2
    

    output:

            teams
    0   [SF, NYG]
    1   [SF, NYG]
    2   NaN
    3   [SF, NYG]
    4   NaN
    5   [SF, NYG]
    6   [SF, NYG]
    
    df2['team1']=np.nan
    df2['team2']=np.nan
    

    solving with list comprehension:

    for i in [0,1]:
        df2['team{}'.format(str(i+1))]=[k[i] if isinstance(k,list) else k for k in df2['teams']]
    
    df2
    

    yields:

        teams   team1   team2
    0   [SF, NYG]   SF  NYG
    1   [SF, NYG]   SF  NYG
    2   NaN        NaN  NaN
    3   [SF, NYG]   SF  NYG
    4   NaN        NaN  NaN
    5   [SF, NYG]   SF  NYG
    6   [SF, NYG]   SF  NYG
    
    0 讨论(0)
  • 2020-11-21 07:15

    Here's another solution using df.transform and df.set_index:

    >>> (df['teams']
           .transform([lambda x:x[0], lambda x:x[1]])
           .set_axis(['team1','team2'],
                      axis=1,
                      inplace=False)
        )
    
      team1 team2
    0    SF   NYG
    1    SF   NYG
    2    SF   NYG
    3    SF   NYG
    4    SF   NYG
    5    SF   NYG
    6    SF   NYG
    
    0 讨论(0)
提交回复
热议问题