I am trying to read a large csv file (aprox. 6 GB) in pandas and i am getting a memory error:
MemoryError Traceback (most recen
The above answer is already satisfying the topic. Anyway, if you need all the data in memory - have a look at bcolz. Its compressing the data in memory. I have had really good experience with it. But its missing a lot of pandas features
Edit: I got compression rates at around 1/10 or orig size i think, of course depending of the kind of data. Important features missing were aggregates.
I proceeded like this:
chunks=pd.read_table('aphro.csv',chunksize=1000000,sep=';',\
names=['lat','long','rf','date','slno'],index_col='slno',\
header=None,parse_dates=['date'])
df=pd.DataFrame()
%time df=pd.concat(chunk.groupby(['lat','long',chunk['date'].map(lambda x: x.year)])['rf'].agg(['sum']) for chunk in chunks)
You can try sframe, that have the same syntax as pandas but allows you to manipulate files that are bigger than your RAM.
Chunking shouldn't always be the first port of call for this problem.
Is the file large due to repeated non-numeric data or unwanted columns?
If so, you can sometimes see massive memory savings by reading in columns as categories and selecting required columns via pd.read_csv usecols
parameter.
Does your workflow require slicing, manipulating, exporting?
If so, you can use dask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API.
If all else fails, read line by line via chunks.
Chunk via pandas or via csv library as a last resort.
Solution 1:
Using pandas with large data
Solution 2:
TextFileReader = pd.read_csv(path, chunksize=1000) # the number of rows per chunk
dfList = []
for df in TextFileReader:
dfList.append(df)
df = pd.concat(dfList,sort=False)
Before using chunksize option if you want to be sure about the process function that you want to write inside the chunking for-loop as mentioned by @unutbu you can simply use nrows option.
small_df = pd.read_csv(filename, nrows=100)
Once you are sure that the process block is ready, you can put that in the chunking for loop for the entire dataframe.