Splitting a large Pandas Dataframe with minimal memory footprint

后端 未结 3 687
我寻月下人不归
我寻月下人不归 2021-02-04 17:42

I have a large DataFrame, which I would like to split into a test set and a train set for model building. However, I do not want to duplicate the DataFrame because I am reaching

3条回答
  •  醉酒成梦
    2021-02-04 18:43

    I would do something similar as @jeff-l, i.e. keep your data frame on file. When you read it in as a csv, use the chunksize keyword. The following script illustrates this:

    import pandas
    import numpy
    
    test = 5
    m, n = 2*test, 3
    
    df = pandas.DataFrame(
        data=numpy.random.random((m, n))
    )
    
    df['test'] = [0] * test + [1] * test 
    
    df.to_csv('tmp.csv', index=False)
    
    for chunk in pandas.read_csv('tmp.csv', chunksize=test):
        print chunk
        del chunk
    

提交回复
热议问题