Multi threaded web scraper using urlretrieve on a cookie-enabled site

若如初见. 提交于 2019-12-01 23:44:26

Creating a multi-threaded web scraper the right way is hard. I'm sure you could handle it, but why not use something that has already been done?

I really really suggest you to check out Scrapy http://scrapy.org/

It is a very flexible open source web scraper framework that will handle most of the stuff you would need here as well. With Scrapy, running concurrent spiders is a configuration issue, not a programming issue (http://doc.scrapy.org/topics/settings.html#concurrent-requests-per-spider). You will also get support for cookies, proxies, HTTP Authentication and much more.

For me, it took around 4 hours to rewrite my scraper in Scrapy. So please ask yourself: do you really want to solve the threading issue yourself or instead climb to the shoulders of others and focus on the issues of web scraping, not threading?

PS. Are you using mechanize now? Please notice this from mechanize FAQ http://wwwsearch.sourceforge.net/mechanize/faq.html:

"Is it threadsafe?

No. As far as I know, you can use mechanize in threaded code, but it provides no synchronisation: you have to provide that yourself."

If you really want to keep using mechanize, start reading through documentation on how to provide synchronization. (e.g. http://effbot.org/zone/thread-synchronization.htm, http://effbot.org/pyfaq/what-kinds-of-global-value-mutation-are-thread-safe.htm)

After working for most of the day, it turns out that Mechanize was not the problem, it looks more like a coding error. After extensive tweaking and cursing, I have gotten the code to work properly.

For future Googlers like myself, I am providing the updated code below:

#manager.py [unchanged from original]
def FetchReports(links,Username,Password,VendorID):
    import Fetch
    import multiprocessing

    pool = multiprocessing.Pool(processes=4, initializer=Fetch._ProcessStart, initargs=(SiteBase,DataPath,Username,Password,VendorID,))
    pool.map(Fetch.DownloadJob,_SplitLinksArray(links))
    pool.close()
    pool.join()


#worker.py
import mechanize
from multiprocessing import current_process

def _ProcessStart(_SiteBase,_DataPath,User,Password,VendorID):
    global cookies
    cookies = mechanize.LWPCookieJar()
    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))

    Login(User,Password,opener)

    global SiteBase
    SiteBase = _SiteBase

    global DataPath
    DataPath = _DataPath

    cookies.save(DataPath+'\\'+current_process().name+'cookies.txt',True,True)

def DownloadJob(link):
    cj = mechanize.LWPCookieJar()
    cj.revert(filename=DataPath+'\\'+current_process().name+'cookies.txt',True,True)
    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))

    mechanize.urlretrieve(url=mechanize.urljoin(SiteBase, link),filename=DataPath+'\\'+filename,data=data)

Because I am just downloading links from a list, the non-threadsafe nature of mechanize doesn't seem to be a problem [full disclosure: I have run this process exactly three times, so a problem may appear under further testing]. The multiprocessing module and it's worker pool does all the heavy lifting. Maintaining cookies in files was important for me because the webserver I am downloading from has to give each thread it's own session ID, but other people implementing this code may not need to use it. I did notice that it seems to "forget" variables between the init call and the run call, so the cookiejar may not make the jump.

In order to enable cookie session in the first code example, add the following code to the function DownloadJob:

cj = mechanize.LWPCookieJar()
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))
mechanize.install_opener(opener)

And then you may retrieve the url as you do:

mechanize.urlretrieve(mechanize.urljoin(SiteBase, link),filename=DataPath+'\\'+filename,data=data)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!