Python high memory usage with BeautifulSoup

后端 未结 4 1055
后悔当初
后悔当初 2020-12-19 03:24

I was trying to process several web pages with BeautifulSoup4 in python 2.7.3 but after every parse the memory usage goes up and up.

This simplified code produces th

相关标签:
4条回答
  • 2020-12-19 03:59

    Garbage collection is probably viable, but a context manager seems to handle it pretty well for me without any extra memory usage:

    from bs4 import BeautifulSoup as soup
    def parse():
      with open('testque.xml') as fh:
        page = soup(fh.read())
    

    Also, though not entirely necessary, if you're using raw_input to let it loop while you test I actually find this idiom quite useful:

    while not raw_input():
      parse()
    

    It'll continue to loop every time you hit enter, but as soon as you enter any non-empty string it'll stop for you.

    0 讨论(0)
  • 2020-12-19 04:03

    Try garbage collecting:

    from bs4 import BeautifulSoup
    import gc
    
    def parse():
        f = open("index.html", "r")
        page = BeautifulSoup(f.read(), "lxml")
        page = None
        gc.collect()
        f.close()
    
    while True:
        parse()
        raw_input()
    

    See also:

    Python garbage collection

    0 讨论(0)
  • 2020-12-19 04:13

    I know this is an old thread, but there's one more thing to keep in mind when parsing pages with beautifulsoup. When navigating a tree, and you are storing a specific value, be sure to get the string and not a bs4 object. For instance this caused a memory leak when used in a loop:

    category_name = table_data.find('a').contents[0]
    

    Which could be fixed by changing in into:

    category_name = str(table_data.find('a').contents[0])
    

    In the first example the type of category name is bs4.element.NavigableString

    0 讨论(0)
  • 2020-12-19 04:18

    Try Beautiful Soup's decompose functionality, which destroys the tree, when you're done working with each file.

    from bs4 import BeautifulSoup
    
    def parse():
        f = open("index.html", "r")
        page = BeautifulSoup(f.read(), "lxml")
        # page extraction goes here
        page.decompose()
        f.close()
    
    while True:
        parse()
        raw_input()
    
    0 讨论(0)
提交回复
热议问题