问题
I need to extract a table as a data.frame from the following HTML Page:
https://www.forbes.com/powerful-brands/list/#tab:rank.html
回答1:
That table has live content, so you need a headless browser, Rselenium should be your first choice. Also, you need rvest to extract the table
Note: After you navigate to that page, there will be a transition page, you can click continue manually or just wait a few seconds.
Code:
library(rvest)
library(RSelenium)
remDr <-rsDriver(port = 4445L,browser = "chrome")
myclient <- remDr$client
#navigate to that page
#After navigate to that page,you need to manually click "continue button" or select and click it with css or just wait a few seconds
myclient$navigate("https://www.forbes.com/powerful-brands/list/#tab:rank")
#you need to scroll down several times, or you will get only top 10 in the list
replicate(20,myclient$sendKeysToActiveElement(list(key="page_down")))
#get pagesource
mypagesource <- unlist(myclient$getPageSource())
#Using rvest to extract table
mytable <-read_html(mypagesource) %>% html_node("#the_list") %>% html_table()
> str(mytable)
'data.frame': 109 obs. of 8 variables:
$ : logi NA NA NA NA NA NA ...
$ Rank : chr "#1" "#2" "#3" "#4" ...
$ Brand : chr "Apple" "Google" "Microsoft" "Facebook" ...
$ Brand Value : chr "$170 B" "$101.8 B" "$87 B" "$73.5 B" ...
$ 1-Yr Value Change : chr "10%" "23%" "16%" "40%" ...
$ Brand Revenue : chr "$214.2 B" "$80.5 B" "$85.3 B" "$25.6 B" ...
$ Company Advertising: chr "$1.8 B" "$3.9 B" "$1.6 B" "$310 M" ...
$ Industry : chr "Technology" "Technology" "Technology" "Technology" ...
Then you can clean the data afterwards:
Introduction and tutorials to those packages:
https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-basics.html
https://stat4701.github.io/edav/2015/04/02/rvest_tutorial/
来源:https://stackoverflow.com/questions/49959431/how-do-i-get-extract-a-table-from-an-html-page-as-a-data-frame-using-xml-and-rcu