lxml parser eats all memory

后端未结

关注

 3  1946

I\'m writing some spider in python and use lxml library for parsing html and gevent library for async. I found that after sometime of work lxml parser starts eats memory up

相关标签:

3条回答

借酒劲吻你

2020-12-16 00:48

There is an excellent article at http://www.lshift.net/blog/2008/11/14/tracing-python-memory-leaks which demonstrates graphical debugging of memory structures; this might help you figure out what's not being released and why.

Edit: I found the article from which I got that link - Python memory leaks

0 讨论(0)
发布评论:

提交评论
- 加载中...
故里飘歌

2020-12-16 00:58

You might be keeping some references which keep the documents alive. Be careful with string results from xpath evaluation for example: by default they are "smart" strings, which provide access to the containing element, thus keeping the tree in memory if you keep a reference to them. See the docs on xpath return values:

There are certain cases where the smart string behaviour is undesirable. For example, it means that the tree will be kept alive by the string, which may have a considerable memory impact in the case that the string value is the only thing in the tree that is actually of interest. For these cases, you can deactivate the parental relationship using the keyword argument smart_strings.

(I have no idea if this is the problem in your case, but it's a candidate. I've been bitten by this myself once ;-))

0 讨论(0)
发布评论:

提交评论
- 加载中...
抹茶落季

2020-12-16 00:59

It seems the issue stems from the library lxml relies on: libxml2 which is written in C language. Here is the first report: http://codespeak.net/pipermail/lxml-dev/2010-December/005784.html This bug hasn't been mentioned either in lxml v2.3 bug fix logs or in libxml2 change logs.

Oh, there is followup mails here: https://bugs.launchpad.net/lxml/+bug/728924

Well, I tried to reproduce the issue, but get nothing abnormal. Guys who can reproduce it may help to clarify the problem.

0 讨论(0)
发布评论:

提交评论
- 加载中...