I have a large DataFrame, which I would like to split into a test set and a train set for model building. However, I do not want to duplicate the DataFrame because I am reaching
I would do something similar as @jeff-l, i.e. keep your data frame on file. When you read it in as a csv, use the chunksize
keyword. The following script illustrates this:
import pandas
import numpy
test = 5
m, n = 2*test, 3
df = pandas.DataFrame(
data=numpy.random.random((m, n))
)
df['test'] = [0] * test + [1] * test
df.to_csv('tmp.csv', index=False)
for chunk in pandas.read_csv('tmp.csv', chunksize=test):
print chunk
del chunk