Splitting dataframe into multiple dataframes

后端 未结 11 1167
南方客
南方客 2020-11-22 01:16

I have a very large dataframe (around 1 million rows) with data from an experiment (60 respondents).

I would like to split the dataframe into 60 dataframes (a datafra

11条回答
  •  面向向阳花
    2020-11-22 01:34

    • First, the method in the OP works, but isn't efficient. It may have seemed to run forever, because the dataset was long.
    • Use .groupby on the 'method' column, and create a dict of DataFrames with unique 'method' values as the keys, with a dict-comprehension.
      • .groupby returns a groupby object, that contains information about the groups, where g is the unique value in 'method' for each group, and d is the DataFrame for that group.
    • The value of each key in df_dict, will be a DataFrame, which can be accessed in the standard way, df_dict['key'].
    • The original question wanted a list of DataFrames, which can be done with a list-comprehension
      • df_list = [d for _, d in df.groupby('method')]
    import pandas as pd
    import seaborn as sns  # for test dataset
    
    # load data for example
    df = sns.load_dataset('planets')
    
    # display(df.head())
                method  number  orbital_period   mass  distance  year
    0  Radial Velocity       1         269.300   7.10     77.40  2006
    1  Radial Velocity       1         874.774   2.21     56.95  2008
    2  Radial Velocity       1         763.000   2.60     19.84  2011
    3  Radial Velocity       1         326.030  19.40    110.62  2007
    4  Radial Velocity       1         516.220  10.50    119.47  2009
    
    
    # Using a dict-comprehension, the unique 'method' value will be the key
    df_dict = {g: d for g, d in df.groupby('method')}
    
    print(df_dict.keys())
    [out]:
    dict_keys(['Astrometry', 'Eclipse Timing Variations', 'Imaging', 'Microlensing', 'Orbital Brightness Modulation', 'Pulsar Timing', 'Pulsation Timing Variations', 'Radial Velocity', 'Transit', 'Transit Timing Variations'])
    
    # or a specific name for the key, using enumerate (e.g. df1, df2, etc.)
    df_dict = {f'df{i}': d for i, (g, d) in enumerate(df.groupby('method'))}
    
    print(df_dict.keys())
    [out]:
    dict_keys(['df0', 'df1', 'df2', 'df3', 'df4', 'df5', 'df6', 'df7', 'df8', 'df9'])
    
    • df_dict['df1].head(3) or df_dict['Astrometry'].head(3)
    • There are only 2 in this group
             method  number  orbital_period  mass  distance  year
    113  Astrometry       1          246.36   NaN     20.77  2013
    537  Astrometry       1         1016.00   NaN     14.98  2010
    
    • df_dict['df2].head(3) or df_dict['Eclipse Timing Variations'].head(3)
                           method  number  orbital_period  mass  distance  year
    32  Eclipse Timing Variations       1         10220.0  6.05       NaN  2009
    37  Eclipse Timing Variations       2          5767.0   NaN    130.72  2008
    38  Eclipse Timing Variations       2          3321.0   NaN    130.72  2008
    
    • df_dict['df3].head(3) or df_dict['Imaging'].head(3)
         method  number  orbital_period  mass  distance  year
    29  Imaging       1             NaN   NaN     45.52  2005
    30  Imaging       1             NaN   NaN    165.00  2007
    31  Imaging       1             NaN   NaN    140.00  2004
    
    • For more information about the seaborn datasets
      • NASA Exoplanets

    Alternatively

    • This is a manual method to create separate DataFrames using pandas: Boolean Indexing
    • This is similar to the accepted answer, but .loc is not required.
    • This is an acceptable method for creating a couple extra DataFrames.
    • The pythonic way to create multiple objects, is by placing them in a container (e.g. dict, list, generator, etc.), as shown above.
    df1 = df[df.method == 'Astrometry']
    df2 = df[df.method == 'Eclipse Timing Variations']
    

提交回复
热议问题