Crawling through pages with PostBack data javascript Python Scrapy

后端 未结 1 1259
南方客
南方客 2020-12-29 13:59

I\'m crawling through some directories with ASP.NET programming via Scrapy.

The pages to crawl through are encoded as such:

javascript:__doPostBack(\'c

相关标签:
1条回答
  • 2020-12-29 14:41

    This kind of pagination is not that trivial as it may seem. It was an interesting challenge to solve it. There are several important notes about the solution provided below:

    • the idea here is to follow the pagination page by page passing around the current page in the Request.meta dictionary
    • using a regular BaseSpider since there is some logic involved in the pagination
    • it is important to provide headers pretending to be a real browser
    • it is important to yield FormRequests withdont_filter=True since we are basically making a POST request to the same URL but with different parameters

    The code:

    import re
    
    from scrapy.http import FormRequest
    from scrapy.spider import BaseSpider
    
    
    HEADERS = {
        'X-MicrosoftAjax': 'Delta=true',
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36'
    }
    URL = 'http://exitrealty.com/agent_list.aspx?firstName=&lastName=&country=USA&state=NY'
    
    
    class ExitRealtySpider(BaseSpider):
        name = "exit_realty"
    
        allowed_domains = ["exitrealty.com"]
        start_urls = [URL]
    
        def parse(self, response):
            # submit a form (first page)
            self.data = {}
            for form_input in response.css('form#aspnetForm input'):
                name = form_input.xpath('@name').extract()[0]
                try:
                    value = form_input.xpath('@value').extract()[0]
                except IndexError:
                    value = ""
                self.data[name] = value
    
            self.data['ctl00$MainContent$ScriptManager1'] = 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList'
            self.data['__EVENTTARGET'] = 'ctl00$MainContent$List'
            self.data['__EVENTARGUMENT'] = 'Page$1'
    
            return FormRequest(url=URL,
                               method='POST',
                               callback=self.parse_page,
                               formdata=self.data,
                               meta={'page': 1},
                               dont_filter=True,
                               headers=HEADERS)
    
        def parse_page(self, response):
            current_page = response.meta['page'] + 1
    
            # parse agents (TODO: yield items instead of printing)
            for agent in response.xpath('//a[@class="regtext"]/text()'):
                print agent.extract()
            print "------"
    
            # request the next page
            data = {
                '__EVENTARGUMENT': 'Page$%d' % current_page,
                '__EVENTVALIDATION': re.search(r"__EVENTVALIDATION\|(.*?)\|", response.body, re.MULTILINE).group(1),
                '__VIEWSTATE': re.search(r"__VIEWSTATE\|(.*?)\|", response.body, re.MULTILINE).group(1),
                '__ASYNCPOST': 'true',
                '__EVENTTARGET': 'ctl00$MainContent$agentList',
                'ctl00$MainContent$ScriptManager1': 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList',
                '': ''
            }
    
            return FormRequest(url=URL,
                               method='POST',
                               formdata=data,
                               callback=self.parse_page,
                               meta={'page': current_page},
                               dont_filter=True,
                               headers=HEADERS)
    
    0 讨论(0)
提交回复
热议问题