How to automatically retrieve URL AJAX calls to?

前端未结

关注

 1  1176

野趣味

The aim is to programme a crawlspider able to:

1) Retrieve the URL of the links that are in the table of this page : http://cordis.europa.eu/fp7/security/projects_en

相关标签:

1条回答

遥遥无期

2021-01-16 10:43

Yes it is possible to automatically retrieve those urls, but you have to figure out what is the url from which ajax loads content. Here's a simple tutorial.

1. Do your research

In chrome console if you open network tab, and filter by xml requests, you get 'initiator' field. On the right you have javascript files that contain code responsible for generating requests. Chrome console displays lines from which request is being called.

enter image description here

In your case the most important piece of code is in file jquery-projects.js, line 415, the line says something like this:

    $.ajax({
        async:      true,
        type:       'GET',
        url:        URL,

as you see there is an URL variable here. You need to find where it is coded, just a couple lines above:

    var URL = '/projects/index.cfm?fuseaction=app.csa'; // production

    switch(type) {
        ...
        case 'doc':
            URL += '&action=read&xslt-template=projects/xsl/projectdet_' + I18n.locale + '.xslt&rcn=' + me.ref;
            break;
    }

So the url is generated by adding base url, some string starting with action and then two variables I18n.locale and me.ref. Keep in mind that this url is relative so you need to get also url root.

I18n.locale turns out to be just a string "_en", where does me.ref come from?

Again ctrl + find in console in sources tab and you find this line of jQuery:

    // record reference
    me.ref = $("#PrjSrch>input[name='REF']").val();

Turns out there is a hidden form there for each url and each time request is generated it takes value from this me.ref field.

Now you only need to apply this knowledge to your scrapy project.

2. Use your knowledge in scrapy spider.

At this point you know what you have to do. You need to start with start url for all projects, get all the links, make requests for those links, then extract ajax url from the content received after each request, and generate requests for urls that we've got from there.

from scrapy.selector import Selector
from scrapy.spider import Spider
from scrapy.http import Request
from eu.items import EuItem
from urlparse import urljoin


class CordisSpider(Spider):
    name = 'cordis'
    start_urls = ['http://cordis.europa.eu/fp7/security/projects_en.html']
    base_url = "http://cordis.europa.eu/projects/"
    # template string for ajax request based on what we know from investigating webpage
    base_ajax_url = "http://cordis.europa.eu/projects/index.cfm?fuseaction=app.csa&action=read&xslt-template=projects/xsl/projectdet_en.xslt&rcn=%s"

    def parse(self, response):
        """
        Extract project links from start_url, for each generate GET request,
        and then assign a function self.get_ajax_content to handle response.
        """
        hxs = Selector(response)
        links = hxs.xpath("//ul/li/span/a/@href").extract()
        for link in links:
            link = urljoin(self.base_url,link)
            yield Request(url=link,callback=self.get_ajax_content)

    def get_ajax_content(self,response):
        """
        Extract AJAX link and make a GET request
        for the desired content, assign callback
        to handle response from this request.
        """
        hxs = Selector(response)
        # xpath analogy of jquery line we've seen
        ajax_ref = hxs.xpath('//form[@id="PrjSrch"]//input[@name="REF"]/@value').extract()
        ajax_ref = "".join(ajax_ref)
        ajax_url = self.base_ajax_url % (ajax_ref,)
        yield Request(url=ajax_url,callback=self.parse_items)

    def parse_items(self,response):
        """
        Response here should contain content
        normally loaded asynchronously with AJAX.
        """
        xhs = Selector(response)
        # you can do your processing here
        title = xhs.xpath("//div[@class='projttl']//text()").extract()
        i = EuItem()
        i["title"] = title
        return i

0 讨论(0)