I wrote a spider program with python. It can recursively crawl web pages. I want to avoid download the same pages, so I store the urls in a list as follow:
urls
you can make urls into a set:
urls = set()
def download(mainPage): # mainPage is a link
global urls
links = getHrefLinks(mainPage)
for l in links:
if l not in urls:
urls.add(l) #instead of append
downPage(l)
Lookups of objects, i.e., x in s
are, in the average case, of complexity O(1), which is better than the average case of the list
.
In general, as you iterate over your URL results you could store them in a dictionary. The key of this dictionary would be the url, the value could be a boolean if you've seen the url before. In the end print the keys of this dict and it would have all unique urls.
Also, doing the lookup via a dict will give you O(1) time when checking if the URL has been seen or not.
# Store mapping of {URL: Bool}
url_map = {}
# Iterate over url results
for url in URLs:
if not url_map.get(url, False):
url_map[url] = True
# Values of dict will have all unique urls
print(url_maps.keys())