Scraping Javascript generated data

问题

I'm working on a project with the World Bank analyzing their procurement processes.

The WB maintains websites for each of their projects, containing links and data for the associated contracts issued (example). Contract-related data is available under the procurement tab.

I'd like to be able to pull a project's contract information from this site, but the links and associated data are generated using embedded Javascript, and the URLs of the pages displaying contract awards and other data don't seem to follow a discernable schema (example).

Is there any way I can scrape the browser rendered data in the first example through R?

回答1:

The main page calls a javascript function

javascript:callTabContent('p','P090644','','en','procurement','procurementId');

The main thing here is the project id P090644. This together with the required language en are passed as parameters to a form at http://www.worldbank.org/p2e/procurement.html.

This form call can be replicated with a url http://www.worldbank.org/p2e/procurement.html?lang=en&projId=P090644.

Code to extract relevant project description urls follows:

projID<-"P090644"
projDetails<-paste0("http://www.worldbank.org/p2e/procurement.html?lang=en&projId=",projID)

require(XML)

pdData<-htmlParse(projDetails)
pdDescribtions<-xpathSApply(pdData,'//*/table[@id="contractawards"]//*/@href')

#> pdDescribtions
                                                                href 
#"http://search.worldbank.org/wcontractawards/procdetails/OP00005718" 
                                                                href 
#"http://search.worldbank.org/wcontractawards/procdetails/OP00005702" 
                                                                href 
#"http://search.worldbank.org/wcontractawards/procdetails/OP00005709" 
                                                                href 
#"http://search.worldbank.org/wcontractawards/procdetails/OP00005715"

it should be noted that excel links are provided which maybe of use to you also. They may contain the data you intend to scrap from the description links

procNotice<-paste0("http://search.worldbank.org/wprocnotices/projectdetails/",projID,".xls")
conAward<-paste0("http://search.worldbank.org/wcontractawards/projectdetails/",projID,".xls")
conData<-paste0("http://search.worldbank.org/wcontractdata/projectdetails/",projID,".xls")

require(gdata)

pnData<-read.xls(procNotice)
caData<-read.xls(conAward)
cdData<-read.xls(conData)

UPDATE:

To find what is being posted we can examine what happens when the javascript function is called. Using Firebug or something similar we intercept the request header which starts:

POST /p2e/procurement.html HTTP/1.1
Host: www.worldbank.org

and has parameters:

lang=en
projId=P090644

Alternatively we can examine the javascript at http://siteresources.worldbank.org/cached/extapps/cver116/p2e/js/script.js and look at the function callTabContent:

function callTabContent(tabparam, projIdParam, contextPath, langCd, htmlId, anchorTagId) {
    if (tabparam == 'n' || tabparam == 'h') {
        $.ajax( {
            type : "POST",
            url : contextPath + "/p2e/"+htmlId+".html",
            data : "projId=" + projIdParam + "&lang=" + langCd,
            success : function(msg) {
                if(tabparam=="n"){
                    $("#newsfeed").replaceWith(msg);
                } else{
                    $("#cycle").replaceWith(msg);
                }
                stickNotes();
            }
        });
    } else {
        $.ajax( {
            type : "POST",
            url : contextPath + "/p2e/"+htmlId+".html",
            data : "projId=" + projIdParam + "&lang=" + langCd,
            success : function(msg) {
                $("#tabContent").replaceWith(msg);
                $('#map_container').hide();
                changeAlternateColors();
                $("#tab_menu a").removeClass("selected");
                $('#'+anchorTagId).addClass("selected");                
                stickNotes();
            }
        });
    }
}

examining the content of the function we can see it is simply posting relevant parameters to a form then updating the webpage.

回答2:

I am not sure I have understood every details of your problem. But what I know for sure is that casperJS works great for javascript generated content.

You can have a look at it here: http://casperjs.org/

It's written in Javascript and has a bunch of useful functions very well documented on the link I provided.

I have used it myself lately for a personal project and can be set up easily with a few lines of code.

Give it a go! Hope, that helps..

来源：https://stackoverflow.com/questions/15330393/scraping-javascript-generated-data

标签

scrape