问题
I can't get my head around this problem in R and I would really appreciate if you could leave a piece of advice for me here.
I am trying to scrape historical bond yield data from https://www.investing.com/rates-bonds/spain-5-year-bond-yield-historical-data for personal use only (of course).
The solution provided here works really well but only goes as far as to scrape the first 24 time stamps of daily data: webscraping data tables and data from a web page
What I am trying to achieve is to change the date range in order to scrape more historical data.
Based on the SelectorGadget tool, the input form id for the date range is called //*[(@id = "widgetFieldDateRange")]
I have also tried using the following lines of code to change the date values but without success:
library(rvest)
url1 <- "https://www.investing.com/rates-bonds/spain-5-year-bond-yield-historical-data" #Spain 5yr yield
session <- html_session(url1)
pgform <- html_form(session)[[1]]
pgform$fields[[3]]$value <- "01/01/2010 - 09/10/2020"
result <- submit_form(session, pgform)
Question: Any idea how to submit the new date range correctly and retrieve the extended time series?
Thank you very much in advance for your help!
PS: Unfortunately, the URL does not change based on the date range.
回答1:
You can perform the POST request directly :
POST https://www.investing.com/instruments/HistoricalDataAjax
You need to scrape a few information from the page that are necessary in the request :
- the
pair_ids
attribute from adiv
tag - the header value from
h2
tag inside.instrumentHeader
class
The full code :
library(rvest)
library(httr)
startDate <- as.Date("2020-06-01")
endDate <- Sys.Date() #today
userAgent <- "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
mainUrl <- "https://www.investing.com/rates-bonds/spain-5-year-bond-yield-historical-data"
s <- html_session(mainUrl)
pair_ids <- s %>%
html_nodes("div[pair_ids]") %>%
html_attr("pair_ids")
header <- s %>% html_nodes(".instrumentHeader h2") %>% html_text()
resp <- s %>% rvest:::request_POST(
"https://www.investing.com/instruments/HistoricalDataAjax",
add_headers('X-Requested-With'= 'XMLHttpRequest'),
user_agent(userAgent),
body = list(
curr_id = pair_ids,
header = header[[1]],
st_date = format(startDate, format="%m/%d/%Y"),
end_date = format(endDate, format="%m/%d/%Y"),
interval_sec = "Daily",
sort_col = "date",
sort_ord = "DESC",
action = "historical_data"
),
encode = "form") %>%
html_table
print(resp[[1]])
Output :
Date Price Open High Low Change %
1 Oct 09, 2020 -0.339 -0.338 -0.333 -0.361 2.42%
2 Oct 08, 2020 -0.331 -0.306 -0.306 -0.338 7.47%
3 Oct 07, 2020 -0.308 -0.323 -0.300 -0.324 -0.65%
4 Oct 06, 2020 -0.310 -0.288 -0.278 -0.319 7.27%
5 Oct 05, 2020 -0.289 -0.323 -0.278 -0.331 -10.39%
6 Oct 03, 2020 -0.322 -0.322 -0.322 -0.322 1.42%
7 Oct 02, 2020 -0.318 -0.311 -0.302 -0.320 5.65%
.....................................................
.....................................................
96 Jun 08, 2020 -0.162 -0.152 -0.133 -0.173 13.29%
97 Jun 05, 2020 -0.143 -0.129 -0.127 -0.154 13.49%
98 Jun 04, 2020 -0.126 -0.089 -0.063 -0.148 38.46%
99 Jun 03, 2020 -0.091 -0.120 -0.087 -0.128 -35.00%
100 Jun 02, 2020 -0.140 -0.148 -0.137 -0.166 14.75%
101 Jun 01, 2020 -0.122 -0.140 -0.101 -0.150 -17.57%
This also works for any page if you replace the value of mainUrl
variable for instance this one
来源:https://stackoverflow.com/questions/64298886/rvest-webscraping-in-r-with-form-inputs