how to avoid duplicate download urls in my python spider program?

前端 未结 2 408
忘掉有多难
忘掉有多难 2021-01-24 23:05

I wrote a spider program with python. It can recursively crawl web pages. I want to avoid download the same pages, so I store the urls in a list as follow:

urls          


        
相关标签:
2条回答
  • 2021-01-24 23:46

    you can make urls into a set:

    urls = set()
    def download(mainPage):  # mainPage is a link
        global urls
        links = getHrefLinks(mainPage)
        for l in links:
            if l not in urls:
                urls.add(l) #instead of append
                downPage(l)
    

    Lookups of objects, i.e., x in s are, in the average case, of complexity O(1), which is better than the average case of the list.

    0 讨论(0)
  • 2021-01-24 23:47

    In general, as you iterate over your URL results you could store them in a dictionary. The key of this dictionary would be the url, the value could be a boolean if you've seen the url before. In the end print the keys of this dict and it would have all unique urls.

    Also, doing the lookup via a dict will give you O(1) time when checking if the URL has been seen or not.

    # Store mapping of {URL: Bool}
    url_map = {}
    
    # Iterate over url results
    for url in URLs:
        if not url_map.get(url, False):
            url_map[url] = True
    
    # Values of dict will have all unique urls 
    print(url_maps.keys())
    
    0 讨论(0)
提交回复
热议问题