Troubles using scrapy with javascript __doPostBack method

时间秒杀一切 提交于 2019-12-23 23:13:52

问题


Trying to automatically grab the search results from a public search, but running into some trouble. The URL is of the form

http://www.website.com/search.aspx?keyword=#&&page=1&sort=Sorting

As I click through the pages, after visiting this page, it changes slightly to

http://www.website.com/search.aspx?keyword=#&&sort=Sorting&page=2

Problem being, if I then try to directly visit the second link without first visiting the first link, I am redirected to the first link. My current attempt at this is defining a long list of start_urls in scrapy.

class websiteSpider(BaseSpider):
    name = "website"
    allowed_domains = ["website.com"]
    baseUrl = "http://www.website.com/search.aspx?keyword=#&&sort=Sorting&page="
    start_urls = [(baseUrl+str(i)) for i in range(1,1000)]

Currently this code simply ends up visiting the first page over and over again. I feel like this is probably straightforward, but I don't quite know how to get around this.

UPDATE: Made some progress investigating this and found that the site updates each page by sending a POST request to the previous page using __doPostBack(arg1, arg2). My question now is how exactly do I mimic this POST request using scrapy. I know how to make a POST request, but not exactly how to pass it the arguments I want.

SECOND UPDATE: I've been making a lot of progress! I think... I looked through examples and documentation and eventually slapped together this version of what I think should do the trick:

def start_requests(self):
    baseUrl = "http://www.website.com/search.aspx?keyword=#&&sort=Sorting&page="
    target = 'ctl00$empcnt$ucResults$pagination'
    requests = []
    for i in range(1, 5):
        url = baseUrl + str(i)
        argument = str(i+1)
        data = {'__EVENTTARGET': target, '__EVENTARGUMENT': argument}
        currentPage = FormRequest(url, data)
        requests.append(currentPage)
    return requests

The idea is that this treats the POST request just like a form and updates accordingly. However, when I actually try to run this I get the following traceback(s) (Condensed for brevity):

2013-03-22 04:03:03-0400 [guru] ERROR: Unhandled error on engine.crawl()
dfd.addCallbacks(request.callback or spider.parse, request.errback)
      File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 280, in addCallbacks
        assert callable(callback)
    exceptions.AssertionError: 

2013-03-22 04:03:03-0400 [-] ERROR: Unhandled error in Deferred:
2013-03-22 04:03:03-0400 [-] Unhandled Error
    Traceback (most recent call last):
    Failure: scrapy.exceptions.IgnoreRequest: Skipped (request already seen)

Changing question to be more directed at what this post has turned into.

Thoughts?

P.S. When the second errors happen scrapy is unable to cleany shutdown and I have to send a SIGINT twice to get things to actually wrap up.


回答1:


FormRequest doesn't have a positional argument in the constructor for formdata:

class FormRequest(Request):
    def __init__(self, *args, **kwargs):
        formdata = kwargs.pop('formdata', None)

so you actually have to say formdata=:

requests.append(FormRequest(url, formdata=data))


来源:https://stackoverflow.com/questions/15560746/troubles-using-scrapy-with-javascript-dopostback-method

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!