Splitting a large Pandas Dataframe with minimal memory footprint

后端 未结 3 688
我寻月下人不归
我寻月下人不归 2021-02-04 17:42

I have a large DataFrame, which I would like to split into a test set and a train set for model building. However, I do not want to duplicate the DataFrame because I am reaching

相关标签:
3条回答
  • 2021-02-04 18:21

    As other answers are more focused on the file reading, I guess you also can do something, if for any reason your DataFrame isn't read from a file.

    Maybe you can take a look at the code of the DataFrame.drop method and modify it in order to modify your DataFrame inplace (which the drop method already do) and get the other raws returned :

    class DF(pd.DataFrame):
        def drop(self, labels, axis=0, level=None, inplace=False, errors='raise'):
            axis = self._get_axis_number(axis)
            axis_name = self._get_axis_name(axis)
            axis, axis_ = self._get_axis(axis), axis
    
            if axis.is_unique:
                if level is not None:
                    if not isinstance(axis, pd.MultiIndex):
                        raise AssertionError('axis must be a MultiIndex')
                    new_axis = axis.drop(labels, level=level, errors=errors)
                else:
                    new_axis = axis.drop(labels, errors=errors)
                dropped = self.reindex(**{axis_name: new_axis})
                try:
                    dropped.axes[axis_].set_names(axis.names, inplace=True)
                except AttributeError:
                    pass
                result = dropped
    
            else:
                labels = com._index_labels_to_array(labels)
                if level is not None:
                    if not isinstance(axis, MultiIndex):
                        raise AssertionError('axis must be a MultiIndex')
                    indexer = ~axis.get_level_values(level).isin(labels)
                else:
                    indexer = ~axis.isin(labels)
    
                slicer = [slice(None)] * self.ndim
                slicer[self._get_axis_number(axis_name)] = indexer
    
                result = self.ix[tuple(slicer)]
    
            if inplace:
                dropped = self.ix[labels]
                self._update_inplace(result)
                return dropped
            else:
                return result, self.ix[labels]
    

    Which will work like this:

    df = DF({'one':[1,2,3,4,5,4,3,2,1], 'two':[6,7,8,9,10,9,8,7,6], 'three':[11,12,13,14,15,14,13,12,11]})
    
    dropped = df.drop(range(5), inplace=True)
    # or :
    # partA, partB = df.drop(range(5))
    

    This example isn't probably really memory efficient but maybe you can figure out something better by using some kind of object oriented solution like this.

    0 讨论(0)
  • 2021-02-04 18:36

    If you have the space to add one more column, you could add one with a random value that you could then filter on for your testing. Here I used uniform between 0 and 1, but you could use anything if you wanted a different proportion.

    df = pd.DataFrame({'one':[1,2,3,4,5,4,3,2,1], 'two':[6,7,8,9,10,9,8,7,6], 'three':[11,12,13,14,15,14,13,12,11]})
    df['split'] = np.random.randint(0, 2, size=len(df))
    

    Of course that requires you have space to add an entirely new column - especially if your data is very long, maybe you don't.

    Another option would work, for example, if your data was in csv format and you knew the number of rows. Do similar to the above with the randomint, but pass that list into the skiprows argument of Pandas read_csv():

    num_rows = 100000
    all = range(num_rows)
    
    some = np.random.choice(all, replace=False, size=num_rows/2)
    some.sort()
    trainer_df = pd.read_csv(path, skiprows=some)
    
    rest = [i for i in all if i not in some]
    rest.sort()
    df = pd.read_csv(path, skiprows=rest)
    

    It's a little clunky up front, especially with the loop in the list comprehension, and creating those lists in memory is unfortunate, but it should still be better memory-wide than just creating an entire copy of half the data.

    To make it even more memory friendly you could load the trainer subset, train the model, then overwrite the training dataframe with the rest of the data, then apply the model. You'll be stuck carrying some and rest around, but you'll never have to load both halves of the data at the same time.

    0 讨论(0)
  • 2021-02-04 18:43

    I would do something similar as @jeff-l, i.e. keep your data frame on file. When you read it in as a csv, use the chunksize keyword. The following script illustrates this:

    import pandas
    import numpy
    
    test = 5
    m, n = 2*test, 3
    
    df = pandas.DataFrame(
        data=numpy.random.random((m, n))
    )
    
    df['test'] = [0] * test + [1] * test 
    
    df.to_csv('tmp.csv', index=False)
    
    for chunk in pandas.read_csv('tmp.csv', chunksize=test):
        print chunk
        del chunk
    
    0 讨论(0)
提交回复
热议问题