Crawling through pages with PostBack data javascript Python Scrapy

后端未结

关注

 1  1259

I\'m crawling through some directories with ASP.NET programming via Scrapy.

The pages to crawl through are encoded as such:

javascript:__doPostBack(\'c


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  说谎        
                
              
                            
                2020-12-29 14:41
              
            
            
                                                                       
This kind of pagination is not that trivial as it may seem. It was an interesting challenge to solve it. There are several important notes about the solution provided below:


the idea here is to follow the pagination page by page passing around the current page in the Request.meta dictionary
using a regular BaseSpider since there is some logic involved in the pagination 
it is important to provide headers pretending to be a real browser
it is important to yield FormRequests withdont_filter=True since we are basically making a POST request to the same URL but with different parameters


The code:

import re

from scrapy.http import FormRequest
from scrapy.spider import BaseSpider


HEADERS = {
    'X-MicrosoftAjax': 'Delta=true',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36'
}
URL = 'http://exitrealty.com/agent_list.aspx?firstName=&lastName=&country=USA&state=NY'


class ExitRealtySpider(BaseSpider):
    name = "exit_realty"

    allowed_domains = ["exitrealty.com"]
    start_urls = [URL]

    def parse(self, response):
        # submit a form (first page)
        self.data = {}
        for form_input in response.css('form#aspnetForm input'):
            name = form_input.xpath('@name').extract()[0]
            try:
                value = form_input.xpath('@value').extract()[0]
            except IndexError:
                value = ""
            self.data[name] = value

        self.data['ctl00$MainContent$ScriptManager1'] = 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList'
        self.data['__EVENTTARGET'] = 'ctl00$MainContent$List'
        self.data['__EVENTARGUMENT'] = 'Page$1'

        return FormRequest(url=URL,
                           method='POST',
                           callback=self.parse_page,
                           formdata=self.data,
                           meta={'page': 1},
                           dont_filter=True,
                           headers=HEADERS)

    def parse_page(self, response):
        current_page = response.meta['page'] + 1

        # parse agents (TODO: yield items instead of printing)
        for agent in response.xpath('//a[@class="regtext"]/text()'):
            print agent.extract()
        print "------"

        # request the next page
        data = {
            '__EVENTARGUMENT': 'Page$%d' % current_page,
            '__EVENTVALIDATION': re.search(r"__EVENTVALIDATION\|(.*?)\|", response.body, re.MULTILINE).group(1),
            '__VIEWSTATE': re.search(r"__VIEWSTATE\|(.*?)\|", response.body, re.MULTILINE).group(1),
            '__ASYNCPOST': 'true',
            '__EVENTTARGET': 'ctl00$MainContent$agentList',
            'ctl00$MainContent$ScriptManager1': 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList',
            '': ''
        }

        return FormRequest(url=URL,
                           method='POST',
                           formdata=data,
                           callback=self.parse_page,
                           meta={'page': current_page},
                           dont_filter=True,
                           headers=HEADERS)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复