问题
I've been trying to extract a table from a webpage. The data is a flight track data from live flight tracking website (https://flightaware.com/live/flight/WJA1508/history/20150814/1720Z/CYYC/KSFO/tracklog).
I've tried XML, RCurl and Curl packages, but I didn't work. I believe most likely because I couldn't figure out how to avoid the SSL as well as the columns that contains notes on the flight status (i. e., first two from the top and third from the bottom of the table).
Can any one knows how extract this table int R?
回答1:
As noted by @hrbrmstr in the comments above, this violates FlightAware's TOS, but what you do with your code is your business. :) This should get you most of the way there using the rvest
package:
library(rvest)
u <- "https://flightaware.com/live/flight/WJA1508/history/20150814/1720Z/CYYC/KSFO/tracklog"
html_read <- html(u)
tbl <- html_table(
html_nodes(html_read, "table"),
fill=TRUE,
header=FALSE,
trim=TRUE
)[[2]]
## Subset to the first row of data and remove all extra
## columns:
tbl_o <- tbl[6:nrow(tbl), ]
tbl_o <- tbl_o[,colSums(is.na(tbl_o))!=nrow(tbl_o)]
names(tbl_o) <- c(
"Time", "Lat", "Lon",
"Course", "Direction",
"KTS", "MPH", "Alt",
"Rate", "Location"
)
str(tbl_o)
Which yields:
'data.frame': 292 obs. of 10 variables:
$ Time : chr "Fri 01:41:34 PM" "Fri 01:48:59 PM" "Fri 01:49:14 PM" "Fri 01:50:05 PM" ...
$ Lat : chr "51.0833" "51.1551" "51.1683" "51.2235" ...
$ Lon : chr "-113.9667" "-114.0209" "-114.0209" "-114.0220" ...
$ Course : chr "335°" "0°" "0°" "358°" ...
$ Direction: chr "Northwest" "North" "North" "North" ...
$ KTS : chr "20" "201" "219" "149" ...
$ MPH : chr "23" "231" "252" "171" ...
$ Alt : chr "3,500" "4,900" "5,200" "6,800" ...
$ Rate : chr "" "222" "1,727" "1,701" ...
$ Location : chr "Edmonton Center" "FlightAware ADS-B (CYYC)" "FlightAware ADS-B (CYYC)" "FlightAware ADS-B (CEG2)" ...
来源:https://stackoverflow.com/questions/32021051/extracting-html-table-into-r