parsing html and following a javascript link

前端 未结 2 1123
[愿得一人]
[愿得一人] 2021-02-04 20:43

I have been asked to extract info by an academic colleague from a website where I need to link the content of a webpage in a table - not too hard with the contents of a text fil

相关标签:
2条回答
  • 2021-02-04 21:15

    This is somewhat tricky, and not fully integrated in R, but some system()-fiddling will get you started.

    • Download and install phantom javascript: http://code.google.com/p/phantomjs/
    • Check the short script on http://menne-biomed.de/uni/JavaButton.html, which emulates your case. When you click the javascript anchor, it redirects http://cran.at.r-project.org/ via doPostBack(inaccessibleJavascriptVar).
    • Save the following script locally as javabutton.js

    
    var page = new WebPage();
    page.open('http://www.menne-biomed.de/uni/JavaButton.html', function (status) {
        if (status !== 'success') {
            console.log('Unable to access network');
        } else {
            var ua = page.evaluate(function () {
                var t =  document.getElementById('tk1').href;
                var re = new RegExp('\((.*)\)');
                return eval(re.exec(t)[1]);
    }); console.log(ua);// Outputs http://cran.at.r-project.org/ } phantom.exit(); });

    • With phantomjs on path, call

      phantomjs javabutton.js

    The link will be displayed on the console. Use any method to get it into Rcurl.

    Not elegant, but maybe someones wraps phantomjs into R one day. In case the link to JaveButton.html should be lost, here it is as code.

    <!DOCTYPE html >
    <head>
    <script>
    inaccesibleJavascriptVar = 'http://' + 'cran.at.r-project.org/';
    function doPostBack(myref)
              {
                window.location.href= myref;
                return false;
            }
    </script>
    </head>
    <body>
    <a id="tk1" href="javascript:doPostBack(inaccesibleJavascriptVar)" >Click here</a>
    </body>
    </html>
    
    0 讨论(0)
  • 2021-02-04 21:16

    Have a look at the RCurl package:

    http://www.omegahat.org/RCurl/

    0 讨论(0)
提交回复
热议问题