how to avoid duplicate download urls in my python spider program?

前端 未结 2 407
忘掉有多难
忘掉有多难 2021-01-24 23:05

I wrote a spider program with python. It can recursively crawl web pages. I want to avoid download the same pages, so I store the urls in a list as follow:

urls          


        
2条回答
  •  北恋
    北恋 (楼主)
    2021-01-24 23:46

    you can make urls into a set:

    urls = set()
    def download(mainPage):  # mainPage is a link
        global urls
        links = getHrefLinks(mainPage)
        for l in links:
            if l not in urls:
                urls.add(l) #instead of append
                downPage(l)
    

    Lookups of objects, i.e., x in s are, in the average case, of complexity O(1), which is better than the average case of the list.

提交回复
热议问题