Why am I leaking memory with this python loop?

别说谁变了你拦得住时间么 提交于 2019-12-20 12:24:04

问题


I am writing a custom file system crawler, which gets passed millions of globs to process through sys.stdin. I'm finding that when running the script, its memory usage increases massively over time and the whole thing crawls practically to a halt. I've written a minimal case below which shows the problem. Am I doing something wrong, or have I found a bug in Python / the glob module? (I am using python 2.5.2).


#!/usr/bin/env python
import glob
import sys
import gc

previous_num_objects = 0

for count, line in enumerate(sys.stdin):
   glob_result = glob.glob(line.rstrip('\n'))
   current_num_objects = len(gc.get_objects())
   new_objects = current_num_objects - previous_num_objects

   print "(%d) This: %d, New: %d, Garbage: %d, Collection Counts: %s"\
 % (count, current_num_objects, new_objects, len(gc.garbage), gc.get_count())
   previous_num_objects = current_num_objects

The output looks like:

(0) This: 4042, New: 4042, Python Garbage: 0, Python Collection Counts: (660, 5, 0)
(1) This: 4061, New: 19, Python Garbage: 0, Python Collection Counts: (90, 6, 0)
(2) This: 4064, New: 3, Python Garbage: 0, Python Collection Counts: (127, 6, 0)
(3) This: 4067, New: 3, Python Garbage: 0, Python Collection Counts: (130, 6, 0)
(4) This: 4070, New: 3, Python Garbage: 0, Python Collection Counts: (133, 6, 0)
(5) This: 4073, New: 3, Python Garbage: 0, Python Collection Counts: (136, 6, 0)
(6) This: 4076, New: 3, Python Garbage: 0, Python Collection Counts: (139, 6, 0)
(7) This: 4079, New: 3, Python Garbage: 0, Python Collection Counts: (142, 6, 0)
(8) This: 4082, New: 3, Python Garbage: 0, Python Collection Counts: (145, 6, 0)
(9) This: 4085, New: 3, Python Garbage: 0, Python Collection Counts: (148, 6, 0)

Every 100th iteration, 100 objects are freed, so len(gc.get_objects() increases by 200 every 100 iterations. len(gc.garbage) never changes from 0. The 2nd generation collection count increases slowly, while the 0th and 1st counts go up and down.


回答1:


I tracked this down to the fnmatch module. glob.glob calls fnmatch to actually perform the globbing, and fnmatch has a cache of regular expressions which is never cleared. So in this usage, the cache was growing continuously and unchecked. I've filed a bug against the fnmatch library [1].

[1]: http://bugs.python.org/issue7846 Python Bug




回答2:


I cannot reproduce any actual leak on my system, but I think your "every 100th iteration, 100 objects are freed" is you hitting the cache for compiled regular expressions (via the glob module). If you peek at re.py you'll see _MAXCACHE defaults to 100, and by default the entire cache is blown away once you hit that (in _compile). If you call re.purge() before your gc calls you will probably see that effect go away.

(note I'm only suggesting re.purge() here to check that cache is affecting your gc results. It should not be necessary to have that in your actual code.)

I doubt that fixes your massive memory increase problem though.



来源:https://stackoverflow.com/questions/2184063/why-am-i-leaking-memory-with-this-python-loop

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!