The aim is to programme a crawlspider able to:
1) Retrieve the URL of the links that are in the table of this page : http://cordis.europa.eu/fp7/security/projects_en
Yes it is possible to automatically retrieve those urls, but you have to figure out what is the url from which ajax loads content. Here's a simple tutorial.
1. Do your research
In chrome console if you open network tab, and filter by xml requests, you get 'initiator' field. On the right you have javascript files that contain code responsible for generating requests. Chrome console displays lines from which request is being called.
In your case the most important piece of code is in file jquery-projects.js, line 415, the line says something like this:
$.ajax({
async: true,
type: 'GET',
url: URL,
as you see there is an URL variable here. You need to find where it is coded, just a couple lines above:
var URL = '/projects/index.cfm?fuseaction=app.csa'; // production
switch(type) {
...
case 'doc':
URL += '&action=read&xslt-template=projects/xsl/projectdet_' + I18n.locale + '.xslt&rcn=' + me.ref;
break;
}
So the url is generated by adding base url, some string starting with action and then two variables I18n.locale and me.ref. Keep in mind that this url is relative so you need to get also url root.
I18n.locale turns out to be just a string "_en", where does me.ref come from?
Again ctrl + find in console in sources tab and you find this line of jQuery:
// record reference
me.ref = $("#PrjSrch>input[name='REF']").val();
Turns out there is a hidden form there for each url and each time request is generated it takes value from this me.ref field.
Now you only need to apply this knowledge to your scrapy project.
2. Use your knowledge in scrapy spider.
At this point you know what you have to do. You need to start with start url for all projects, get all the links, make requests for those links, then extract ajax url from the content received after each request, and generate requests for urls that we've got from there.
from scrapy.selector import Selector
from scrapy.spider import Spider
from scrapy.http import Request
from eu.items import EuItem
from urlparse import urljoin
class CordisSpider(Spider):
name = 'cordis'
start_urls = ['http://cordis.europa.eu/fp7/security/projects_en.html']
base_url = "http://cordis.europa.eu/projects/"
# template string for ajax request based on what we know from investigating webpage
base_ajax_url = "http://cordis.europa.eu/projects/index.cfm?fuseaction=app.csa&action=read&xslt-template=projects/xsl/projectdet_en.xslt&rcn=%s"
def parse(self, response):
"""
Extract project links from start_url, for each generate GET request,
and then assign a function self.get_ajax_content to handle response.
"""
hxs = Selector(response)
links = hxs.xpath("//ul/li/span/a/@href").extract()
for link in links:
link = urljoin(self.base_url,link)
yield Request(url=link,callback=self.get_ajax_content)
def get_ajax_content(self,response):
"""
Extract AJAX link and make a GET request
for the desired content, assign callback
to handle response from this request.
"""
hxs = Selector(response)
# xpath analogy of jquery line we've seen
ajax_ref = hxs.xpath('//form[@id="PrjSrch"]//input[@name="REF"]/@value').extract()
ajax_ref = "".join(ajax_ref)
ajax_url = self.base_ajax_url % (ajax_ref,)
yield Request(url=ajax_url,callback=self.parse_items)
def parse_items(self,response):
"""
Response here should contain content
normally loaded asynchronously with AJAX.
"""
xhs = Selector(response)
# you can do your processing here
title = xhs.xpath("//div[@class='projttl']//text()").extract()
i = EuItem()
i["title"] = title
return i