问题
I am new to crawl the webpage using Scrapy and unfortunately chose a dynamic one to start...
I've successfully crawled part (120 links), thanks to someone helping me here, but not links in target website
After doing some research, I know that crawling ajax web is nothing different from those simple ideas:
•open browser developer tools, network tab
•go to the target site
•click submit button and see what XHR request is going to the server
•simulate this XHR request in your spider
The last one sounds obscure to me though---How to simulate XHR request?
I've seen someone using 'headers' or 'formdata' and other parameters to simulate. Can't figure out what does that mean.
Here is part of my code:
class googleAppSpider(scrapy.Spider):
name = "googleApp"
allowed_domains = ['play.google.com']
start_urls = ['https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0']
def start_request(self,response):
for i in range(0,10):
yield FormRequest(url="https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0", method="POST", formdata={'start':str(i+60),'num':'60','numChildren':'0','ipf':'1','xhr':'1','token':'m1VdlomIcpZYfkJT5dktVuqLw2k:1455483261011'}, callback=self.parse)
def parse(self,response):
links = response.xpath("//a/@href").extract()
crawledLinks = [ ]
LinkPattern = re.compile("^/store/apps/details\?id=.")
for link in links:
if LinkPattern.match(link) and not link in crawledLinks:
crawledLinks.append("http://play.google.com"+link+"#release")
for link in crawledLinks:
yield scrapy.Request(link, callback=self.parse_every_app)
def parse_every_app(self,response):
The start_request seems to not play any role here. If I delete them, the spider would still crawl the same amount of links.
I've worked on this problem for a week... Highly appreciate it if you could help me out...
回答1:
Try this:
class googleAppSpider(Spider):
name = "googleApp"
allowed_domains = ['play.google.com']
start_urls = ['https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0']
def parse(self,response):
for i in range(0,10):
yield FormRequest(url="https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0", method="POST", formdata={'start':str(i*60),'num':'60','numChildren':'0','ipf':'1','xhr':'1','token':'m1VdlomIcpZYfkJT5dktVuqLw2k:1455483261011'}, callback=self.data_parse)
def data_parse(self,response):
item = googleAppItem()
map = {}
links = response.xpath("//a/@href").re(r'/store/apps/details.*')
for l in links:
if l not in map:
map[l] = True
item['url'] = l
yield item
Crawl the spider using scrapy crawl -o links.csv
or scrapy crawl -o links.json
you'll get all the links in a csv file or a json file. To increase the number of pages to crawl, change the range of for loop.
来源:https://stackoverflow.com/questions/35472329/how-to-simulate-xhr-request-using-scrapy-when-trying-to-crawl-data-from-an-ajax