parsing html and following a javascript link

前端未结

关注

 2  1123

I have been asked to extract info by an academic colleague from a website where I need to link the content of a webpage in a table - not too hard with the contents of a text fil

相关标签:

2条回答

故里飘歌

2021-02-04 21:15

This is somewhat tricky, and not fully integrated in R, but some system()-fiddling will get you started.

Download and install phantom javascript: http://code.google.com/p/phantomjs/
Check the short script on http://menne-biomed.de/uni/JavaButton.html, which emulates your case. When you click the javascript anchor, it redirects http://cran.at.r-project.org/ via doPostBack(inaccessibleJavascriptVar).
Save the following script locally as javabutton.js


var page = new WebPage();
page.open('http://www.menne-biomed.de/uni/JavaButton.html', function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var ua = page.evaluate(function () {
            var t =  document.getElementById('tk1').href;
            var re = new RegExp('\((.*)\)');
            return eval(re.exec(t)[1]);

        });
        console.log(ua);// Outputs http://cran.at.r-project.org/
    }
    phantom.exit();
});

With phantomjs on path, call

phantomjs javabutton.js

The link will be displayed on the console. Use any method to get it into Rcurl.

Not elegant, but maybe someones wraps phantomjs into R one day. In case the link to JaveButton.html should be lost, here it is as code.

<!DOCTYPE html >
<head>
<script>
inaccesibleJavascriptVar = 'http://' + 'cran.at.r-project.org/';
function doPostBack(myref)
          {
            window.location.href= myref;
            return false;
        }
</script>
</head>
<body>
<a id="tk1" href="javascript:doPostBack(inaccesibleJavascriptVar)" >Click here</a>
</body>
</html>

0 讨论(0)

醉话见心

2021-02-04 21:16

Have a look at the RCurl package:

http://www.omegahat.org/RCurl/

0 讨论(0)
发布评论:

提交评论
- 加载中...