Expand pandas DataFrame column into multiple rows

前端 未结 7 1590
星月不相逢
星月不相逢 2020-12-05 04:50

If I have a DataFrame such that:

pd.DataFrame( {\"name\" : \"John\", 
               \"days\" : [[1, 3, 5, 7]]
              })
<
相关标签:
7条回答
  • 2020-12-05 05:21

    A 'native' pandas solution - we unstack the column into a series, then join back on based on index:

    import pandas as pd #import
    x2 = x.days.apply(lambda x: pd.Series(x)).unstack() #make an unstackeded series, x2
    x.drop('days', axis = 1).join(pd.DataFrame(x2.reset_index(level=0, drop=True))) #drop the days column, join to the x2 series
    
    0 讨论(0)
  • 2020-12-05 05:21

    another solution:

    In [139]: (df.apply(lambda x: pd.Series(x.days), axis=1)
       .....:    .stack()
       .....:    .reset_index(level=1, drop=1)
       .....:    .to_frame('day')
       .....:    .join(df['name'])
       .....: )
    Out[139]:
       day  name
    0    1  John
    0    3  John
    0    5  John
    0    7  John
    
    0 讨论(0)
  • 2020-12-05 05:37

    You could use df.itertuples to iterate through each row, and use a list comprehension to reshape the data into the desired form:

    import pandas as pd
    
    df = pd.DataFrame( {"name" : ["John", "Eric"], 
                   "days" : [[1, 3, 5, 7], [2,4]]})
    result = pd.DataFrame([(d, tup.name) for tup in df.itertuples() for d in tup.days])
    print(result)
    

    yields

       0     1
    0  1  John
    1  3  John
    2  5  John
    3  7  John
    4  2  Eric
    5  4  Eric
    

    Divakar's solution, using_repeat, is fastest:

    In [48]: %timeit using_repeat(df)
    1000 loops, best of 3: 834 µs per loop
    
    In [5]: %timeit using_itertuples(df)
    100 loops, best of 3: 3.43 ms per loop
    
    In [7]: %timeit using_apply(df)
    1 loop, best of 3: 379 ms per loop
    
    In [8]: %timeit using_append(df)
    1 loop, best of 3: 3.59 s per loop
    

    Here is the setup used for the above benchmark:

    import numpy as np
    import pandas as pd
    
    N = 10**3
    df = pd.DataFrame( {"name" : np.random.choice(list('ABCD'), size=N), 
                        "days" : [np.random.randint(10, size=np.random.randint(5))
                                  for i in range(N)]})
    
    def using_itertuples(df):
        return  pd.DataFrame([(d, tup.name) for tup in df.itertuples() for d in tup.days])
    
    def using_repeat(df):
        lens = [len(item) for item in df['days']]
        return pd.DataFrame( {"name" : np.repeat(df['name'].values,lens), 
                              "days" : np.concatenate(df['days'].values)})
    
    def using_apply(df):
        return (df.apply(lambda x: pd.Series(x.days), axis=1)
                .stack()
                .reset_index(level=1, drop=1)
                .to_frame('day')
                .join(df['name']))
    
    def using_append(df):
        df2 = pd.DataFrame(columns = df.columns)
        for i,r in df.iterrows():
            for e in r.days:
                new_r = r.copy()
                new_r.days = e
                df2 = df2.append(new_r)
        return df2
    
    0 讨论(0)
  • 2020-12-05 05:42

    Thanks to Divakar's solution, wrote it as a wrapper function to flatten a column, handling np.nan and DataFrames with multiple columns

    def flatten_column(df, column_name):
         repeat_lens = [len(item) if item is not np.nan else 1 for item in df[column_name]]
         df_columns = list(df.columns)
         df_columns.remove(column_name)
         expanded_df = pd.DataFrame(np.repeat(df.drop(column_name, axis=1).values, repeat_lens, axis=0), columns=df_columns)
         flat_column_values = np.hstack(df[column_name].values)
         expanded_df[column_name] = flat_column_values
         expanded_df[column_name].replace('nan', np.nan, inplace=True)
         return expanded_df
    
    0 讨论(0)
  • 2020-12-05 05:43

    New since pandas 0.25 you can use the function explode()

    https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html

    import pandas as pd
    df = pd.DataFrame( {"name" : "John", 
                   "days" : [[1, 3, 5, 7]]})
    
    print(df.explode('days'))
    

    prints

       name days
    0  John    1
    0  John    3
    0  John    5
    0  John    7
    
    0 讨论(0)
  • 2020-12-05 05:44

    Here's something with NumPy -

    lens = [len(item) for item in df['days']]
    df_out = pd.DataFrame( {"name" : np.repeat(df['name'].values,lens), 
                   "days" : np.hstack(df['days'])
                  })
    

    As pointed in @unutbu's solution np.concatenate(df['days'].values) would be faster than np.hstack(df['days']).

    It uses a loop-comprehension to extract the lengths of each 'days' element, which must be minimal runtime-wise.

    Sample run -

    >>> df
               days  name
    0  [1, 3, 5, 7]  John
    1        [2, 4]  Eric
    >>> lens = [len(item) for item in df['days']]
    >>> pd.DataFrame( {"name" : np.repeat(df['name'].values,lens), 
    ...                "days" : np.hstack(df['days'])
    ...               })
       days  name
    0     1  John
    1     3  John
    2     5  John
    3     7  John
    4     2  Eric
    5     4  Eric
    
    0 讨论(0)
提交回复
热议问题