R - form web scraping with rvest

淺唱寂寞╮ 提交于 2019-12-10 21:54:23

问题


First I'd like to take a moment and thank the SO community, You helped me many times in the past without me needing to even create an account.

My current problem involves web scraping with R. Not my strong point.

I would like to scrap http://www.cbs.dtu.dk/services/SignalP/

what I have tried:

    library(rvest)
    url <- "http://www.cbs.dtu.dk/services/SignalP/"
    seq <- "MTSKTCLVFFFSSLILTNFALAQDRAPHGLAYETPVAFSPSAFDFFHTQPENPDPTFNPCSESGCSPLPVAAKVQGASAKAQESDIVSISTGTRSGIEEHGVVGIIFGLAFAVMM"

    session <- rvest::html_session(url)
    form <- rvest::html_form(session)[[2]]
    form <- rvest::set_values(form, `SEQPASTE` = seq)
    form_res_cbs <- rvest::submit_form(session, form)
    #rvest prints out:
    Submitting with 'trunc'

rvest::html_text(rvest::html_nodes(form_res_cbs, "head")) 
#ouput:
"Configuration error"

rvest::html_text(rvest::html_nodes(form_res_cbs, "body"))

#ouput:
"Exception:WebfaceConfigErrorPackage:Webface::service : 358Message:Unhandled #parameter 'NULL' in form "

I am unsure what is the unhandled parameter. Is the problem in the submit button? I can not seem to force:

form_res_cbs <- rvest::submit_form(session, form, submit = "submit")
#rvest prints out
Error: Unknown submission name 'submit'.
Possible values: trunc

is the problem the submit$name is NULL?

form[["fields"]][[23]] 

I tried defining the fake submit button as suggested here: Submit form with no submit button in rvest

with no luck.

I am open to solutions using rvest or RCurl/httr, I would like to avoid using RSelenium

EDIT: thanks to hrbrmstr awesome answer I was able to build a function for this task. It is available in the package ragp: https://github.com/missuse/ragp


回答1:


Well, this is doable. But it's going to require elbow grease.

This part:

library(rvest)
library(httr)
library(tidyverse)

POST(
  url = "http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi",
  encode = "form",
  body=list(
    `configfile` = "/usr/opt/www/pub/CBS/services/SignalP-4.1/SignalP.cf",
    `SEQPASTE` = "MTSKTCLVFFFSSLILTNFALAQDRAPHGLAYETPVAFSPSAFDFFHTQPENPDPTFNPCSESGCSPLPVAAKVQGASAKAQESDIVSISTGTRSGIEEHGVVGIIFGLAFAVMM",
    `orgtype` = "euk",
    `Dcut-type` = "default",
    `Dcut-noTM` = "0.45",
    `Dcut-TM` = "0.50",
    `graphmode` = "png",
    `format` = "summary",
    `minlen` = "",
    `method` = "best",
    `trunc` = ""
  ),
  verbose()
) -> res

Makes the request you made. I left verbose() in so you can watch what happens. It's missing the "filename" field, but you specified the string, so it's a good mimic of what you did.

Now, the tricky part is that it uses an intermediary redirect page that gives you a chance to enter an e-mail address for notification when the query is done. It does do a regular (every ~10s or so) check to see if the query is finished and will redirect quickly if so.

That page has the query id which can be extracted via:

content(res, as="parsed") %>% 
  html_nodes("input[name='jobid']") %>% 
  html_attr("value") -> jobid

Now, we can mimic the final request, but I'd add in a Sys.sleep(20) before doing so to ensure the report is done.

GET(
  url = "http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi",
  query = list(
    jobid = jobid,
    wait = "20"
  ),
  verbose()
) -> res2

That grabs the final results page:

html_print(HTML(content(res2, as="text")))

You can see images are missing because GET only retrieves the HTML content. You can use functions from rvest/xml2 to parse through the page and scrape out the tables and the URLs that you can then use to get new content.

To do all this, I used burpsuite to intercept a browser session and then my burrp R package to inspect the results. You can also visually inspect in burpsuite and build things more manually.



来源:https://stackoverflow.com/questions/46091447/r-form-web-scraping-with-rvest

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!