How to automatically retrieve URL AJAX calls to?

前端 未结 1 1176
野趣味
野趣味 2021-01-16 10:08

The aim is to programme a crawlspider able to:

1) Retrieve the URL of the links that are in the table of this page : http://cordis.europa.eu/fp7/security/projects_en

相关标签:
1条回答
  • 2021-01-16 10:43

    Yes it is possible to automatically retrieve those urls, but you have to figure out what is the url from which ajax loads content. Here's a simple tutorial.

    1. Do your research

    In chrome console if you open network tab, and filter by xml requests, you get 'initiator' field. On the right you have javascript files that contain code responsible for generating requests. Chrome console displays lines from which request is being called.

    enter image description here

    In your case the most important piece of code is in file jquery-projects.js, line 415, the line says something like this:

        $.ajax({
            async:      true,
            type:       'GET',
            url:        URL,
    

    as you see there is an URL variable here. You need to find where it is coded, just a couple lines above:

        var URL = '/projects/index.cfm?fuseaction=app.csa'; // production
    
        switch(type) {
            ...
            case 'doc':
                URL += '&action=read&xslt-template=projects/xsl/projectdet_' + I18n.locale + '.xslt&rcn=' + me.ref;
                break;
        }
    

    So the url is generated by adding base url, some string starting with action and then two variables I18n.locale and me.ref. Keep in mind that this url is relative so you need to get also url root.

    I18n.locale turns out to be just a string "_en", where does me.ref come from?

    Again ctrl + find in console in sources tab and you find this line of jQuery:

        // record reference
        me.ref = $("#PrjSrch>input[name='REF']").val();
    

    Turns out there is a hidden form there for each url and each time request is generated it takes value from this me.ref field.

    Now you only need to apply this knowledge to your scrapy project.

    2. Use your knowledge in scrapy spider.

    At this point you know what you have to do. You need to start with start url for all projects, get all the links, make requests for those links, then extract ajax url from the content received after each request, and generate requests for urls that we've got from there.

    from scrapy.selector import Selector
    from scrapy.spider import Spider
    from scrapy.http import Request
    from eu.items import EuItem
    from urlparse import urljoin
    
    
    class CordisSpider(Spider):
        name = 'cordis'
        start_urls = ['http://cordis.europa.eu/fp7/security/projects_en.html']
        base_url = "http://cordis.europa.eu/projects/"
        # template string for ajax request based on what we know from investigating webpage
        base_ajax_url = "http://cordis.europa.eu/projects/index.cfm?fuseaction=app.csa&action=read&xslt-template=projects/xsl/projectdet_en.xslt&rcn=%s"
    
        def parse(self, response):
            """
            Extract project links from start_url, for each generate GET request,
            and then assign a function self.get_ajax_content to handle response.
            """
            hxs = Selector(response)
            links = hxs.xpath("//ul/li/span/a/@href").extract()
            for link in links:
                link = urljoin(self.base_url,link)
                yield Request(url=link,callback=self.get_ajax_content)
    
        def get_ajax_content(self,response):
            """
            Extract AJAX link and make a GET request
            for the desired content, assign callback
            to handle response from this request.
            """
            hxs = Selector(response)
            # xpath analogy of jquery line we've seen
            ajax_ref = hxs.xpath('//form[@id="PrjSrch"]//input[@name="REF"]/@value').extract()
            ajax_ref = "".join(ajax_ref)
            ajax_url = self.base_ajax_url % (ajax_ref,)
            yield Request(url=ajax_url,callback=self.parse_items)
    
        def parse_items(self,response):
            """
            Response here should contain content
            normally loaded asynchronously with AJAX.
            """
            xhs = Selector(response)
            # you can do your processing here
            title = xhs.xpath("//div[@class='projttl']//text()").extract()
            i = EuItem()
            i["title"] = title
            return i  
    
    0 讨论(0)
提交回复
热议问题