Memory leak in python pandas reshuffling index

大兔子大兔子 提交于 2021-01-28 00:28:34

问题


I have a memory leak in my code, which is trying to read a csv into pandas that is too large for memory. I use chunksize to iterate, but the amount of memory in use is increasing (by the size of a chunk) for each iteration. After I interrupt the process, and clear the namespace, the python process in my task manager is still hogging n* size of chunk, with n number of iterations finished. Does anyone know which step in the loop creates something in memory that doesnt get removed? And, if so, how do I forcibly remove it?

import pymysql
import pandas as pd
import numpy as np
import sysconn=pymysql.connect(host='localhost', port=3306, user='root', passwd='******', db='')
curr = conn.cursor()
curr.execute('CREATE DATABASE IF NOT EXISTS addclick')
curr.execute('USE addclick')
datachunks = pd.read_csv('train.csv', chunksize=1e5)
i=0
print 'Start loading main database. This may take a while. Chunks:'
for chunk in datachunks:
    i=i+1
    print(i)
    sys.stdout.flush()
    shuffle = chunk.reindex(np.random.permutation(chunk.index))
    validationchunk = shuffle.iloc[:int(1e4)]
    validationchunk.to_sql('validation', conn, if_exists='append', flavor='mysql', index=False)
    trainchunk = shuffle.iloc[int(1e4):]
    trainchunk.to_sql('train', conn, if_exists='append', flavor='mysql', index=False)

The goal is to split the csv file in a training and a validation set, and put them in an sql database for easier access to aggregates.


回答1:


So assuming you are using pandas >= 0.15.0. I think that np.random.permutation is inplace mutating the index of that you are shuffling. This is a no-no as indices are immutable.

In [1]: df = DataFrame(np.random.randn(10000))

In [2]: def f(df):
   ...:     for dfi in np.array_split(df,100):
   ...:         shuffle = dfi.reindex(np.random.permutation(dfi.index))
   ...:         one = shuffle.iloc[:50]
   ...:         two = shuffle.iloc[50:]
   ...:         

In [3]: %memit f(df)
peak memory: 76.64 MiB, increment: 1.47 MiB

In [4]: %memit f(df)
peak memory: 77.07 MiB, increment: 0.43 MiB

In [5]: %memit f(df)
peak memory: 77.48 MiB, increment: 0.41 MiB

In [6]: %memit f(df)
peak memory: 78.09 MiB, increment: 0.61 MiB

In [7]: %memit f(df)
peak memory: 78.49 MiB, increment: 0.40 MiB

In [8]: %memit f(df)
peak memory: 78.79 MiB, increment: 0.27 MiB

So get the values out and you can manipulate them (this returns an ndarray), which can be manipulated.

In [9]: def f2(df):
   ...:     for dfi in np.array_split(df,100):
   ...:         shuffle = dfi.reindex(np.random.permutation(dfi.index.values))
   ...:         one = shuffle.iloc[:50]
   ...:         two = shuffle.iloc[50:]
   ...:         

In [10]: %memit f2(df)
peak memory: 78.79 MiB, increment: 0.00 MiB

In [11]: %memit f2(df)
peak memory: 78.79 MiB, increment: 0.00 MiB

In [12]: %memit f2(df)
peak memory: 78.79 MiB, increment: 0.00 MiB

In [13]: %memit f2(df)
peak memory: 78.79 MiB, increment: 0.00 MiB

In [14]: %memit f2(df)
peak memory: 78.80 MiB, increment: 0.00 MiB

In [15]: %memit f2(df)
peak memory: 78.80 MiB, increment: 0.00 MiB

Not really sure whom is faulting here (e.g. guarantee on permutation or the index).



来源:https://stackoverflow.com/questions/27074469/memory-leak-in-python-pandas-reshuffling-index

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!