How to simulate xhr request using Scrapy when trying to crawl data from an ajax-based webstie?

夙愿已清 提交于 2019-12-23 02:43:18

问题


I am new to crawl the webpage using Scrapy and unfortunately chose a dynamic one to start...

I've successfully crawled part (120 links), thanks to someone helping me here, but not links in target website

After doing some research, I know that crawling ajax web is nothing different from those simple ideas:

•open browser developer tools, network tab

•go to the target site

•click submit button and see what XHR request is going to the server

•simulate this XHR request in your spider

The last one sounds obscure to me though---How to simulate XHR request?

I've seen someone using 'headers' or 'formdata' and other parameters to simulate. Can't figure out what does that mean.

Here is part of my code:

class googleAppSpider(scrapy.Spider):
name = "googleApp"
allowed_domains = ['play.google.com']
start_urls = ['https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0']

def start_request(self,response):
    for i in range(0,10): 
        yield FormRequest(url="https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0", method="POST", formdata={'start':str(i+60),'num':'60','numChildren':'0','ipf':'1','xhr':'1','token':'m1VdlomIcpZYfkJT5dktVuqLw2k:1455483261011'}, callback=self.parse)

def parse(self,response):
    links = response.xpath("//a/@href").extract()
    crawledLinks = [ ]
    LinkPattern = re.compile("^/store/apps/details\?id=.")
    for link in links:
        if LinkPattern.match(link) and not link in crawledLinks:
            crawledLinks.append("http://play.google.com"+link+"#release")
    for link in crawledLinks:
            yield scrapy.Request(link, callback=self.parse_every_app)

def parse_every_app(self,response):

The start_request seems to not play any role here. If I delete them, the spider would still crawl the same amount of links.

I've worked on this problem for a week... Highly appreciate it if you could help me out...


回答1:


Try this:

class googleAppSpider(Spider):
    name = "googleApp"
    allowed_domains = ['play.google.com']
    start_urls = ['https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0']

    def parse(self,response):
        for i in range(0,10): 
            yield FormRequest(url="https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0", method="POST", formdata={'start':str(i*60),'num':'60','numChildren':'0','ipf':'1','xhr':'1','token':'m1VdlomIcpZYfkJT5dktVuqLw2k:1455483261011'}, callback=self.data_parse)

    def data_parse(self,response):
        item = googleAppItem()
        map = {}
        links = response.xpath("//a/@href").re(r'/store/apps/details.*')
        for l in links:
            if l not in map:
                map[l] = True
                item['url'] = l
                yield item

Crawl the spider using scrapy crawl -o links.csv or scrapy crawl -o links.json you'll get all the links in a csv file or a json file. To increase the number of pages to crawl, change the range of for loop.



来源:https://stackoverflow.com/questions/35472329/how-to-simulate-xhr-request-using-scrapy-when-trying-to-crawl-data-from-an-ajax

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!