问题
I'm somewhat new to scraping with R, but I'm getting an error message that I can't make sense of. My code:
url <- "https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session"
leg <- read_html(url)
testdata <- leg %>%
html_nodes('table') %>%
.[6] %>%
html_table()
To which I get the response:
Error in out[j + k, ] : subscript out of bounds
When I swap out html_table with html_text I don't get the error. Any idea what I'm doing wrong?
Thanks!
回答1:
Hope this helps!
library(htmltab)
library(dplyr)
library(tidyr)
url <- "https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session"
url %>%
htmltab(6, rm_nodata_cols = F) %>%
.[,-1] %>%
replace_na(list(Notes = "", "Term-limited?" = "")) %>%
`rownames<-` (seq_len(nrow(.)))
Output is:
District Name Party Residence Term-limited? Notes
1 1 Ted Gaines Republican El Dorado Hills
2 2 Mike McGuire Democratic Healdsburg
3 3 Bill Dodd Democratic Napa
4 4 Jim Nielsen Republican Gerber
5 5 Cathleen Galgiani Democratic Stockton
6 6 Richard Pan Democratic Sacramento
...
回答2:
Why not just target the table better?
library(rvest)
wp_url <- "https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session"
leg <- read_html(wp_url)
html_node(leg, xpath=".//table[contains(., 'District')]") %>%
html_table()
## Position Position Name Party District
## 1 Lieutenant Governor Gavin Newsom Democratic
## 2 President pro tempore Kevin de León Democratic 24th–Los Angeles
## 3 Majority leader Bill Monning Democratic 17th–Carmel
## 4 Majority whip Nancy Skinner Democratic 9th–Berkeley
## 5 Majority caucus chair Connie Leyva Democratic 20th–Chino
## 6 Majority caucus vice chair Mike McGuire Democratic 2nd–Healdsburg
## 7 Minority leader Patricia Bates Republican 36th–Laguna Niguel
## 8 Minority caucus chair Jim Nielsen Republican 4th–Gerber
## 9 Minority whip Ted Gaines Republican 1st–El Dorado Hills
## 10 Secretary Secretary Daniel Alvarez Daniel Alvarez Daniel Alvarez
## 11 Sergeant-at-Arms Sergeant-at-Arms Debbie Manning Debbie Manning Debbie Manning
## 12 Chaplain Chaplain Sister Michelle Gorman Sister Michelle Gorman Sister Michelle Gorman
ARGH! Wrong table. It's still unwise to just use numeric indexes like that. We can still target the table you want better:
library(rvest)
library(purrr)
wp_url <- "https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session"
leg <- read_html(wp_url)
target_table <- html_node(leg, xpath=".//span[@id='Members']/../following-sibling::table")
But, rvest::html_table()
is causing the error and you should absolutely file a bug report on the GH page for it.
The htmltab
pkg in used in the other answer looks handy (and feel free to accept that answer vs this one since it's shorter and works).
We'll do it the old-fashioned way, but will need a helper function to make better column names:
mcga <- function(x) {
x <- tolower(x)
x <- gsub("[[:punct:][:space:]]+", "_", x)
x <- gsub("_+", "_", x)
x <- gsub("(^_|_$)", "", x)
make.unique(x, sep = "_")
}
Now, we extract the header row and the data rows:
header_row <- html_node(target_table, xpath=".//tr[th]")
data_rows <- html_nodes(target_table, xpath=".//tr[td]")
We peek at the header row and see that there's an evil colspan
in there. We'll make use of this knowledge later.
html_children(header_row)
## {xml_nodeset (6)}
## [1] <th scope="col" width="30" colspan="2">District</th>
## [2] <th scope="col" width="170">Name</th>
## [3] <th scope="col" width="70">Party</th>
## [4] <th scope="col" width="130">Residence</th>
## [5] <th scope="col" width="60">Term-limited?</th>
## [6] <th scope="col" width="370">Notes</th>
Get the column names, and make them tidy:
html_children(header_row) %>%
html_text() %>%
tolower() %>%
mcga() -> col_names
Now, iterate over the rows, pull out the values, remove the extra first value and turn the whole thing into a data frame:
map_df(data_rows, ~{
kid_txt <- html_children(.x) %>% html_text()
as.list(setNames(kid_txt[-1], col_names))
})
## # A tibble: 40 x 6
## district name party residence term_limited notes
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 Ted Gaines Republican El Dorado Hills
## 2 2 Mike McGuire Democratic Healdsburg
## 3 3 Bill Dodd Democratic Napa
## 4 4 Jim Nielsen Republican Gerber
## 5 5 Cathleen Galgiani Democratic Stockton
## 6 6 Richard Pan Democratic Sacramento
## 7 7 Steve Glazer Democratic Orinda
## 8 8 Tom Berryhill Republican Twain Harte Yes
## 9 9 Nancy Skinner Democratic Berkeley
## 10 10 Bob Wieckowski Democratic Fremont
## # ... with 30 more rows
来源:https://stackoverflow.com/questions/47585699/rvest-html-table-error-error-in-outj-k-subscript-out-of-bounds