Multi threaded web scraper using urlretrieve on a cookie-enabled site

I am trying to write my first Python script, and with lots of Googling, I think that I am just about done. However, I will need some help getting myself across the finish line.

I need to write a script that logs onto a cookie-enabled site, scrape a bunch of links, and then spawn a few processes to download the files. I have the program running in single-threaded, so I know that the code works. But, when I tried to create a pool of download workers, I ran into a wall.

#manager.py
import Fetch # the module name where worker lives
from multiprocessing import pool

def FetchReports(links,Username,Password,VendorID):
    pool = multiprocessing.Pool(processes=4, initializer=Fetch._ProcessStart, initargs=(SiteBase,DataPath,Username,Password,VendorID,))
    pool.map(Fetch.DownloadJob,links)
    pool.close()
    pool.join()


#worker.py
import mechanize
import atexit

def _ProcessStart(_SiteBase,_DataPath,User,Password,VendorID):
    Login(User,Password)

    global SiteBase
    SiteBase = _SiteBase

    global DataPath
    DataPath = _DataPath

    atexit.register(Logout)

def DownloadJob(link):
    mechanize.urlretrieve(mechanize.urljoin(SiteBase, link),filename=DataPath+'\\'+filename,data=data)
    return True

In this revision, the code fails because the cookies have not been transferred to the worker for urlretrieve to use. No problem, I was able to use mechanize's .cookiejar class to save the cookies in the manager, and pass them to the worker.

#worker.py
import mechanize
import atexit

from multiprocessing import current_process

def _ProcessStart(_SiteBase,_DataPath,User,Password,VendorID):
    global cookies
    cookies = mechanize.LWPCookieJar()

    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))

    Login(User,Password,opener)  # note I pass the opener to Login so it can catch the cookies.

    global SiteBase
    SiteBase = _SiteBase

    global DataPath
    DataPath = _DataPath

    cookies.save(DataPath+'\\'+current_process().name+'cookies.txt',True,True)

    atexit.register(Logout)

def DownloadJob(link):
    cj = mechanize.LWPCookieJar()
    cj.revert(filename=DataPath+'\\'+current_process().name+'cookies.txt', ignore_discard=True, ignore_expires=True)
    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))

    file = open(DataPath+'\\'+filename, "wb")
    file.write(opener.open(mechanize.urljoin(SiteBase, link)).read())
    file.close

But, THAT fails because opener (I think) wants to move the binary file back to the manager for processing, and I get an "unable to pickle object" error message, referring to the webpage it's trying to read to the file.

The obvious solution is to read the cookies in from the cookie jar and manually add them to the header when making the urlretrieve request, but I am trying to avoid that, and that is why I am fishing for suggestions.

Creating a multi-threaded web scraper the right way is hard. I'm sure you could handle it, but why not use something that has already been done?

I really really suggest you to check out Scrapy http://scrapy.org/

It is a very flexible open source web scraper framework that will handle most of the stuff you would need here as well. With Scrapy, running concurrent spiders is a configuration issue, not a programming issue (http://doc.scrapy.org/topics/settings.html#concurrent-requests-per-spider). You will also get support for cookies, proxies, HTTP Authentication and much more.

For me, it took around 4 hours to rewrite my scraper in Scrapy. So please ask yourself: do you really want to solve the threading issue yourself or instead climb to the shoulders of others and focus on the issues of web scraping, not threading?

PS. Are you using mechanize now? Please notice this from mechanize FAQ http://wwwsearch.sourceforge.net/mechanize/faq.html:

"Is it threadsafe?

No. As far as I know, you can use mechanize in threaded code, but it provides no synchronisation: you have to provide that yourself."

If you really want to keep using mechanize, start reading through documentation on how to provide synchronization. (e.g. http://effbot.org/zone/thread-synchronization.htm, http://effbot.org/pyfaq/what-kinds-of-global-value-mutation-are-thread-safe.htm)

After working for most of the day, it turns out that Mechanize was not the problem, it looks more like a coding error. After extensive tweaking and cursing, I have gotten the code to work properly.

For future Googlers like myself, I am providing the updated code below:

#manager.py [unchanged from original]
def FetchReports(links,Username,Password,VendorID):
    import Fetch
    import multiprocessing

    pool = multiprocessing.Pool(processes=4, initializer=Fetch._ProcessStart, initargs=(SiteBase,DataPath,Username,Password,VendorID,))
    pool.map(Fetch.DownloadJob,_SplitLinksArray(links))
    pool.close()
    pool.join()


#worker.py
import mechanize
from multiprocessing import current_process

def _ProcessStart(_SiteBase,_DataPath,User,Password,VendorID):
    global cookies
    cookies = mechanize.LWPCookieJar()
    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))

    Login(User,Password,opener)

    global SiteBase
    SiteBase = _SiteBase

    global DataPath
    DataPath = _DataPath

    cookies.save(DataPath+'\\'+current_process().name+'cookies.txt',True,True)

def DownloadJob(link):
    cj = mechanize.LWPCookieJar()
    cj.revert(filename=DataPath+'\\'+current_process().name+'cookies.txt',True,True)
    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))

    mechanize.urlretrieve(url=mechanize.urljoin(SiteBase, link),filename=DataPath+'\\'+filename,data=data)

Because I am just downloading links from a list, the non-threadsafe nature of mechanize doesn't seem to be a problem [full disclosure: I have run this process exactly three times, so a problem may appear under further testing]. The multiprocessing module and it's worker pool does all the heavy lifting. Maintaining cookies in files was important for me because the webserver I am downloading from has to give each thread it's own session ID, but other people implementing this code may not need to use it. I did notice that it seems to "forget" variables between the init call and the run call, so the cookiejar may not make the jump.

In order to enable cookie session in the first code example, add the following code to the function DownloadJob:

cj = mechanize.LWPCookieJar()
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))
mechanize.install_opener(opener)

And then you may retrieve the url as you do:

mechanize.urlretrieve(mechanize.urljoin(SiteBase, link),filename=DataPath+'\\'+filename,data=data)

来源：https://stackoverflow.com/questions/6111372/multi-threaded-web-scraper-using-urlretrieve-on-a-cookie-enabled-site

标签

python

urllib2

multiprocessing

urllib