Splitting dataframe into multiple dataframes

后端 未结 11 1100
南方客
南方客 2020-11-22 01:16

I have a very large dataframe (around 1 million rows) with data from an experiment (60 respondents).

I would like to split the dataframe into 60 dataframes (a datafra

相关标签:
11条回答
  • 2020-11-22 01:33

    I had similar problem. I had a time series of daily sales for 10 different stores and 50 different items. I needed to split the original dataframe in 500 dataframes (10stores*50stores) to apply Machine Learning models to each of them and I couldn't do it manually.

    This is the head of the dataframe:

    I have created two lists; one for the names of dataframes and one for the couple of array [item_number, store_number].

        list=[]
        for i in range(1,len(items)*len(stores)+1):
        global list
        list.append('df'+str(i))
    
        list_couple_s_i =[]
        for item in items:
              for store in stores:
                      global list_couple_s_i
                      list_couple_s_i.append([item,store])
    

    And once the two lists are ready you can loop on them to create the dataframes you want:

             for name, it_st in zip(list,list_couple_s_i):
                       globals()[name] = df.where((df['item']==it_st[0]) & 
                                                    (df['store']==(it_st[1])))
                       globals()[name].dropna(inplace=True)
    

    In this way I have created 500 dataframes.

    Hope this will be helpful!

    0 讨论(0)
  • 2020-11-22 01:34

    The method based on list comprehension and groupby- Which stores all the split dataframe in list variable and can be accessed using the index.

    Example

    ans = [pd.DataFrame(y) for x, y in DF.groupby('column_name', as_index=False)]
    
    ans[0]
    ans[0].column_name
    
    0 讨论(0)
  • 2020-11-22 01:34
    • First, the method in the OP works, but isn't efficient. It may have seemed to run forever, because the dataset was long.
    • Use .groupby on the 'method' column, and create a dict of DataFrames with unique 'method' values as the keys, with a dict-comprehension.
      • .groupby returns a groupby object, that contains information about the groups, where g is the unique value in 'method' for each group, and d is the DataFrame for that group.
    • The value of each key in df_dict, will be a DataFrame, which can be accessed in the standard way, df_dict['key'].
    • The original question wanted a list of DataFrames, which can be done with a list-comprehension
      • df_list = [d for _, d in df.groupby('method')]
    import pandas as pd
    import seaborn as sns  # for test dataset
    
    # load data for example
    df = sns.load_dataset('planets')
    
    # display(df.head())
                method  number  orbital_period   mass  distance  year
    0  Radial Velocity       1         269.300   7.10     77.40  2006
    1  Radial Velocity       1         874.774   2.21     56.95  2008
    2  Radial Velocity       1         763.000   2.60     19.84  2011
    3  Radial Velocity       1         326.030  19.40    110.62  2007
    4  Radial Velocity       1         516.220  10.50    119.47  2009
    
    
    # Using a dict-comprehension, the unique 'method' value will be the key
    df_dict = {g: d for g, d in df.groupby('method')}
    
    print(df_dict.keys())
    [out]:
    dict_keys(['Astrometry', 'Eclipse Timing Variations', 'Imaging', 'Microlensing', 'Orbital Brightness Modulation', 'Pulsar Timing', 'Pulsation Timing Variations', 'Radial Velocity', 'Transit', 'Transit Timing Variations'])
    
    # or a specific name for the key, using enumerate (e.g. df1, df2, etc.)
    df_dict = {f'df{i}': d for i, (g, d) in enumerate(df.groupby('method'))}
    
    print(df_dict.keys())
    [out]:
    dict_keys(['df0', 'df1', 'df2', 'df3', 'df4', 'df5', 'df6', 'df7', 'df8', 'df9'])
    
    • df_dict['df1].head(3) or df_dict['Astrometry'].head(3)
    • There are only 2 in this group
             method  number  orbital_period  mass  distance  year
    113  Astrometry       1          246.36   NaN     20.77  2013
    537  Astrometry       1         1016.00   NaN     14.98  2010
    
    • df_dict['df2].head(3) or df_dict['Eclipse Timing Variations'].head(3)
                           method  number  orbital_period  mass  distance  year
    32  Eclipse Timing Variations       1         10220.0  6.05       NaN  2009
    37  Eclipse Timing Variations       2          5767.0   NaN    130.72  2008
    38  Eclipse Timing Variations       2          3321.0   NaN    130.72  2008
    
    • df_dict['df3].head(3) or df_dict['Imaging'].head(3)
         method  number  orbital_period  mass  distance  year
    29  Imaging       1             NaN   NaN     45.52  2005
    30  Imaging       1             NaN   NaN    165.00  2007
    31  Imaging       1             NaN   NaN    140.00  2004
    
    • For more information about the seaborn datasets
      • NASA Exoplanets

    Alternatively

    • This is a manual method to create separate DataFrames using pandas: Boolean Indexing
    • This is similar to the accepted answer, but .loc is not required.
    • This is an acceptable method for creating a couple extra DataFrames.
    • The pythonic way to create multiple objects, is by placing them in a container (e.g. dict, list, generator, etc.), as shown above.
    df1 = df[df.method == 'Astrometry']
    df2 = df[df.method == 'Eclipse Timing Variations']
    
    0 讨论(0)
  • 2020-11-22 01:37

    You can use the groupby command, if you already have some labels for your data.

     out_list = [group[1] for group in in_series.groupby(label_series.values)]
    

    Here's a detailed example:

    Let's say we want to partition a pd series using some labels into a list of chunks For example, in_series is:

    2019-07-01 08:00:00   -0.10
    2019-07-01 08:02:00    1.16
    2019-07-01 08:04:00    0.69
    2019-07-01 08:06:00   -0.81
    2019-07-01 08:08:00   -0.64
    Length: 5, dtype: float64
    

    And its corresponding label_series is:

    2019-07-01 08:00:00   1
    2019-07-01 08:02:00   1
    2019-07-01 08:04:00   2
    2019-07-01 08:06:00   2
    2019-07-01 08:08:00   2
    Length: 5, dtype: float64
    

    Run

    out_list = [group[1] for group in in_series.groupby(label_series.values)]
    

    which returns out_list a list of two pd.Series:

    [2019-07-01 08:00:00   -0.10
    2019-07-01 08:02:00   1.16
    Length: 2, dtype: float64,
    2019-07-01 08:04:00    0.69
    2019-07-01 08:06:00   -0.81
    2019-07-01 08:08:00   -0.64
    Length: 3, dtype: float64]
    

    Note that you can use some parameters from in_series itself to group the series, e.g., in_series.index.day

    0 讨论(0)
  • 2020-11-22 01:38
    In [28]: df = DataFrame(np.random.randn(1000000,10))
    
    In [29]: df
    Out[29]: 
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 1000000 entries, 0 to 999999
    Data columns (total 10 columns):
    0    1000000  non-null values
    1    1000000  non-null values
    2    1000000  non-null values
    3    1000000  non-null values
    4    1000000  non-null values
    5    1000000  non-null values
    6    1000000  non-null values
    7    1000000  non-null values
    8    1000000  non-null values
    9    1000000  non-null values
    dtypes: float64(10)
    
    In [30]: frames = [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
    
    In [31]: %timeit [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
    1 loops, best of 3: 849 ms per loop
    
    In [32]: len(frames)
    Out[32]: 16667
    

    Here's a groupby way (and you could do an arbitrary apply rather than sum)

    In [9]: g = df.groupby(lambda x: x/60)
    
    In [8]: g.sum()    
    
    Out[8]: 
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 16667 entries, 0 to 16666
    Data columns (total 10 columns):
    0    16667  non-null values
    1    16667  non-null values
    2    16667  non-null values
    3    16667  non-null values
    4    16667  non-null values
    5    16667  non-null values
    6    16667  non-null values
    7    16667  non-null values
    8    16667  non-null values
    9    16667  non-null values
    dtypes: float64(10)
    

    Sum is cythonized that's why this is so fast

    In [10]: %timeit g.sum()
    10 loops, best of 3: 27.5 ms per loop
    
    In [11]: %timeit df.groupby(lambda x: x/60)
    1 loops, best of 3: 231 ms per loop
    
    0 讨论(0)
提交回复
热议问题