rvest Webscraping in R with form inputs

眉间皱痕 提交于 2021-02-07 10:10:57

问题


I can't get my head around this problem in R and I would really appreciate if you could leave a piece of advice for me here.

I am trying to scrape historical bond yield data from https://www.investing.com/rates-bonds/spain-5-year-bond-yield-historical-data for personal use only (of course).

The solution provided here works really well but only goes as far as to scrape the first 24 time stamps of daily data: webscraping data tables and data from a web page

What I am trying to achieve is to change the date range in order to scrape more historical data. Based on the SelectorGadget tool, the input form id for the date range is called //*[(@id = "widgetFieldDateRange")]

I have also tried using the following lines of code to change the date values but without success:

library(rvest)
 
url1 <- "https://www.investing.com/rates-bonds/spain-5-year-bond-yield-historical-data" #Spain 5yr yield

session <- html_session(url1)
pgform <- html_form(session)[[1]]

pgform$fields[[3]]$value <- "01/01/2010 - 09/10/2020"
result <- submit_form(session, pgform)

Question: Any idea how to submit the new date range correctly and retrieve the extended time series?

Thank you very much in advance for your help!

PS: Unfortunately, the URL does not change based on the date range.


回答1:


You can perform the POST request directly :

POST https://www.investing.com/instruments/HistoricalDataAjax

You need to scrape a few information from the page that are necessary in the request :

  • the pair_ids attribute from a div tag
  • the header value from h2 tag inside .instrumentHeader class

The full code :

library(rvest)
library(httr)

startDate <- as.Date("2020-06-01")
endDate <- Sys.Date() #today

userAgent <- "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
mainUrl <- "https://www.investing.com/rates-bonds/spain-5-year-bond-yield-historical-data"

s <- html_session(mainUrl)

pair_ids <- s %>% 
    html_nodes("div[pair_ids]") %>%
    html_attr("pair_ids")

header <- s %>% html_nodes(".instrumentHeader h2") %>% html_text()

resp <- s %>% rvest:::request_POST(
    "https://www.investing.com/instruments/HistoricalDataAjax",
    add_headers('X-Requested-With'= 'XMLHttpRequest'),
    user_agent(userAgent),
    body = list(
        curr_id = pair_ids,
        header = header[[1]],
        st_date = format(startDate, format="%m/%d/%Y"),
        end_date = format(endDate, format="%m/%d/%Y"),
        interval_sec = "Daily",
        sort_col = "date",
        sort_ord = "DESC",
        action = "historical_data"
    ), 
    encode = "form") %>%
    html_table

print(resp[[1]])

Output :

            Date  Price   Open   High    Low Change %
1   Oct 09, 2020 -0.339 -0.338 -0.333 -0.361    2.42%
2   Oct 08, 2020 -0.331 -0.306 -0.306 -0.338    7.47%
3   Oct 07, 2020 -0.308 -0.323 -0.300 -0.324   -0.65%
4   Oct 06, 2020 -0.310 -0.288 -0.278 -0.319    7.27%
5   Oct 05, 2020 -0.289 -0.323 -0.278 -0.331  -10.39%
6   Oct 03, 2020 -0.322 -0.322 -0.322 -0.322    1.42%
7   Oct 02, 2020 -0.318 -0.311 -0.302 -0.320    5.65%
.....................................................
.....................................................
96  Jun 08, 2020 -0.162 -0.152 -0.133 -0.173   13.29%
97  Jun 05, 2020 -0.143 -0.129 -0.127 -0.154   13.49%
98  Jun 04, 2020 -0.126 -0.089 -0.063 -0.148   38.46%
99  Jun 03, 2020 -0.091 -0.120 -0.087 -0.128  -35.00%
100 Jun 02, 2020 -0.140 -0.148 -0.137 -0.166   14.75%
101 Jun 01, 2020 -0.122 -0.140 -0.101 -0.150  -17.57%

This also works for any page if you replace the value of mainUrl variable for instance this one



来源:https://stackoverflow.com/questions/64298886/rvest-webscraping-in-r-with-form-inputs

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!