I\'m trying to download the content from a page and I\'m finding that the response data is either malformed or incomplete, as if GET or getURL are pulling before those data
It's important to understand that when you scrape a webpage, you are getting the raw HTML source code for that page; this isn't necessarily exactly what you will be interacting with in a web browser. When you call GET(url)
you are getting the actual html/text
that is the source of that page. This is what is being sent directly from the server. Nowadays most web pages also assume the browser will not only display the HMTL, but will also execute the javascript code on that page. This is especially true when a lot of in-page content is generated later by javascript. That's exactly what's going on in this page. The "content" on the page isn't found in the html source of that page; it is downloaded later via javascript.
Neither httr
nor RCurl
will execute the javascript required to "fill" the page with the table you are actually viewing. There is a package called RSelenium
which is capable of interacting with a browser to execute javascript, but in this case we actually can get around that.
First, just a side note on why getURL
didn't work. It seems this web server sniffs the user-agent sent by the requesting program to send different content back. Whatever the default user-agent used by RCurl is isn't deemed "good" enough to get the html from the server. You can get around this by specifying a different user agent. For example
d2 <- getURL(url, .opts=list(useragent="Mozila 5.0"))
seems to work.
But getting back to the main problem. When working on problems like this, i strongly recommend you use the Chrome Developer tools (or whatever the equivalent is in your favorite browser). In the Chrome developer tools, specifically on the Network tab, you can see all requests made by Chrome to get the data
If you click on the first one ("etfs.html") you can see the headers and response for that request. On the response sub-tab, you should see exactly the same content that is found by GET
or getURL
. Then we download a bunch of CSS and javascript files. The file that looked most interesting was "GetETFJson.js". This actually seems to hold most of the data in an almost JSON like format. It actually has some true javascript in front the JSON block that kind of gets in the way. But we can download that file with
d3 <- GET("https://www.vanguardcanada.ca/individual/mvc/GetETFJson.js")
and extract the content as text with
p3 <- content(d3, as="text")
and then turn it into an R object with
library(jsonlite)
r3 <- fromJSON(substr(p3,13,nchar(p3)))
again, we are using substr
above to strip off the non-JSON stuff at the beginning to make it easier to parse.
Now, you can explore the object returned. But it looks like the data you want is stored in the following vectors
cbind(r3$fundData$Fund$profile$portId, r3$fundData$Fund$profile$benchMark)
[,1] [,2]
[1,] "9548" "FTSE All World ex Canada Index in CAD"
[2,] "9561" "FTSE Canada All Cap Index in CAD"
[3,] "9554" "Spliced Canada Index"
[4,] "9559" "FTSE Canada All Cap Real Estate Capped 25% Index"
[5,] "9560" "FTSE Canada High Dividend Yield Index"
[6,] "9550" "FTSE Developed Asia Pacific Index in CAD"
[7,] "9549" "FTSE Developed Europe Index in CAD"
[8,] "9558" "FTSE Developed ex North America Index in CAD"
[9,] "9555" "Spliced FTSE Developed ex North America Index Hedged in CAD"
[10,] "9556" "Spliced Emerging Markets Index in CAD"
[11,] "9563" "S&P 500 Index in CAD"
[12,] "9562" "S&P 500 Index in CAD Hedged"
[13,] "9566" "NASDAQ US Dividend Achievers Select Index in CAD"
[14,] "9564" "NASDAQ US Dividend Achievers Select Index Hedged in CAD"
[15,] "9557" "CRSP US Total Market Index in CAD"
[16,] "9551" "Spliced US Total Market Index Hedged in CAD"
[17,] "9552" "Barclays Global Aggregate CAD Float Adjusted Index in CAD"
[18,] "9553" "Barclays Global Aggregate CAD 1-5 Year Govt/Credit Float Adj Ix in CAD"
[19,] "9565" "Barclays Global Aggregate Canadian 1-5 Year Credit Float Adjusted Index in CAD"
[20,] "9568" "Barclays Global Aggregate ex-USD Float Adjusted RIC Capped Index Hedged in CAD"
[21,] "9567" "Barclays U.S. Aggregate Float Adjusted Index Hedged in CAD"
So hopefully that should be sufficient to extract the data you need to identify the path to the URL with more data.