I was trying to process several web pages with BeautifulSoup4 in python 2.7.3 but after every parse the memory usage goes up and up.
This simplified code produces th
Garbage collection is probably viable, but a context manager seems to handle it pretty well for me without any extra memory usage:
from bs4 import BeautifulSoup as soup
def parse():
with open('testque.xml') as fh:
page = soup(fh.read())
Also, though not entirely necessary, if you're using raw_input
to let it loop while you test I actually find this idiom quite useful:
while not raw_input():
parse()
It'll continue to loop every time you hit enter, but as soon as you enter any non-empty string it'll stop for you.
Try garbage collecting:
from bs4 import BeautifulSoup
import gc
def parse():
f = open("index.html", "r")
page = BeautifulSoup(f.read(), "lxml")
page = None
gc.collect()
f.close()
while True:
parse()
raw_input()
See also:
Python garbage collection
I know this is an old thread, but there's one more thing to keep in mind when parsing pages with beautifulsoup. When navigating a tree, and you are storing a specific value, be sure to get the string and not a bs4 object. For instance this caused a memory leak when used in a loop:
category_name = table_data.find('a').contents[0]
Which could be fixed by changing in into:
category_name = str(table_data.find('a').contents[0])
In the first example the type of category name is bs4.element.NavigableString
Try Beautiful Soup's decompose functionality, which destroys the tree, when you're done working with each file.
from bs4 import BeautifulSoup
def parse():
f = open("index.html", "r")
page = BeautifulSoup(f.read(), "lxml")
# page extraction goes here
page.decompose()
f.close()
while True:
parse()
raw_input()