Splitting a large Pandas Dataframe with minimal memory footprint

后端未结

关注

 3  687

我寻月下人不归 2021-02-04 17:42

I have a large DataFrame, which I would like to split into a test set and a train set for model building. However, I do not want to duplicate the DataFrame because I am reaching

3条回答

醉酒成梦 (楼主)

2021-02-04 18:43

I would do something similar as @jeff-l, i.e. keep your data frame on file. When you read it in as a csv, use the chunksize keyword. The following script illustrates this:

import pandas
import numpy

test = 5
m, n = 2*test, 3

df = pandas.DataFrame(
    data=numpy.random.random((m, n))
)

df['test'] = [0] * test + [1] * test 

df.to_csv('tmp.csv', index=False)

for chunk in pandas.read_csv('tmp.csv', chunksize=test):
    print chunk
    del chunk

0 讨论(0)

查看其它3个回答