How to webscrape secured pages in R (https links) (using readHTMLTable from XML package)?

前端 未结 3 1605
迷失自我
迷失自我 2020-12-01 08:04

There are good answers on SO about how to use readHTMLTable from the XML package and I did that with regular http pages, however I am not able to solve my problem with https

相关标签:
3条回答
  • 2020-12-01 08:58

    The new package httr provides a wrapper around RCurl to make it easier to scrape all kinds of pages.

    Still, this page gave me a fair amount of trouble. The following works, but no doubt there are easier ways of doing it.

    library("httr")
    library("XML")
    
    # Define certicificate file
    cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")
    
    # Read page
    page <- GET(
      "https://ned.nih.gov/", 
      path="search/ViewDetails.aspx", 
      query="NIHID=0010121048",
      config(cainfo = cafile)
    )
    
    # Use regex to extract the desired table
    x <- text_content(page)
    tab <- sub('.*(<table class="grid".*?>.*</table>).*', '\\1', x)
    
    # Parse the table
    readHTMLTable(tab)
    

    The results:

    $ctl00_ContentPlaceHolder_dvPerson
                    V1                                      V2
    1      Legal Name:                    Dr Francis S Collins
    2  Preferred Name:                      Dr Francis Collins
    3          E-mail:                 francis.collins@nih.gov
    4        Location: BG 1 RM 1261 CENTER DRBETHESDA MD 20814
    5       Mail Stop:                                       Â
    6           Phone:                            301-496-2433
    7             Fax:                                       Â
    8              IC:             OD (Office of the Director)
    9    Organization:            Office of the Director (HNA)
    10 Classification:                                Employee
    11            TTY:                                       Â
    

    Get httr here: http://cran.r-project.org/web/packages/httr/index.html


    EDIT: Useful page with FAQ about the RCurl package: http://www.omegahat.org/RCurl/FAQ.html

    0 讨论(0)
  • 2020-12-01 08:59

    This is the function I have to deal with this problem. Detects if https in url and uses httr if it is.

    readHTMLTable2=function(url, which=NULL, ...){
     require(httr)
     require(XML)
     if(str_detect(url,"https")){
        page <- GET(url, user_agent("httr-soccer-ranking"))
        doc = htmlParse(text_content(page))
        if(is.null(which)){
          tmp=readHTMLTable(doc, ...)
          }else{
            tableNodes = getNodeSet(doc, "//table")
            tab=tableNodes[[which]]
            tmp=readHTMLTable(tab, ...) 
          }
      }else{
        tmp=readHTMLTable(url, which=which, ...) 
      }
      return(tmp)
    }
    
    0 讨论(0)
  • 2020-12-01 09:04

    Using Andrie's great way to get past the https

    a way to get at the data without readHTMLTable is also below.

    A table in HTML may have an ID. In this case the table has one nice one and the XPath in getNodeSet function does it nicely.

    # Define certicificate file
    cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")
    # Read page
    page <- GET(
      "https://ned.nih.gov/", 
      path="search/ViewDetails.aspx", 
      query="NIHID=0010121048",
      config(cainfo = cafile, ssl.verifypeer = FALSE)
    )
    
    h = htmlParse(page)
    ns <- getNodeSet(h, "//table[@id = 'ctl00_ContentPlaceHolder_dvPerson']")
    ns
    

    I still need to extract the IDs behind the hyperlinks.

    for example instead of collen baros as manager, I need to get to the ID 0010080638

    Manager:Colleen Barros

    0 讨论(0)
提交回复
热议问题