Rvest html_table error - Error in out[j + k, ] : subscript out of bounds

 ̄綄美尐妖づ 提交于 2020-02-02 03:17:12


I'm somewhat new to scraping with R, but I'm getting an error message that I can't make sense of. My code:

 url <- "https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session"

leg <- read_html(url)

testdata <- leg %>% 
  html_nodes('table') %>% 
  .[6] %>% 

To which I get the response:

Error in out[j + k, ] : subscript out of bounds

When I swap out html_table with html_text I don't get the error. Any idea what I'm doing wrong?



Hope this helps!


url <- "https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session"
url %>%
  htmltab(6, rm_nodata_cols = F) %>%
  .[,-1] %>%
  replace_na(list(Notes = "", "Term-limited?" = "")) %>%
  `rownames<-` (seq_len(nrow(.)))

Output is:

  District              Name      Party       Residence Term-limited? Notes
1        1        Ted Gaines Republican El Dorado Hills                    
2        2      Mike McGuire Democratic      Healdsburg                    
3        3         Bill Dodd Democratic            Napa                    
4        4       Jim Nielsen Republican          Gerber                    
5        5 Cathleen Galgiani Democratic        Stockton                    
6        6       Richard Pan Democratic      Sacramento                    


Why not just target the table better?


wp_url <- "https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session"

leg <- read_html(wp_url)

html_node(leg, xpath=".//table[contains(., 'District')]") %>%
##            Position                   Position                   Name                  Party               District
## 1                          Lieutenant Governor           Gavin Newsom             Democratic                       
## 2                        President pro tempore          Kevin de León             Democratic       24th–Los Angeles
## 3                              Majority leader           Bill Monning             Democratic            17th–Carmel
## 4                                Majority whip          Nancy Skinner             Democratic           9th–Berkeley
## 5                        Majority caucus chair           Connie Leyva             Democratic             20th–Chino
## 6                   Majority caucus vice chair           Mike McGuire             Democratic         2nd–Healdsburg
## 7                              Minority leader         Patricia Bates             Republican     36th–Laguna Niguel
## 8                        Minority caucus chair            Jim Nielsen             Republican             4th–Gerber
## 9                                Minority whip             Ted Gaines             Republican    1st–El Dorado Hills
## 10        Secretary                  Secretary         Daniel Alvarez         Daniel Alvarez         Daniel Alvarez
## 11 Sergeant-at-Arms           Sergeant-at-Arms         Debbie Manning         Debbie Manning         Debbie Manning
## 12         Chaplain                   Chaplain Sister Michelle Gorman Sister Michelle Gorman Sister Michelle Gorman

ARGH! Wrong table. It's still unwise to just use numeric indexes like that. We can still target the table you want better:


wp_url <- "https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session"

leg <- read_html(wp_url)

target_table <- html_node(leg, xpath=".//span[@id='Members']/../following-sibling::table")

But, rvest::html_table() is causing the error and you should absolutely file a bug report on the GH page for it.

The htmltab pkg in used in the other answer looks handy (and feel free to accept that answer vs this one since it's shorter and works).

We'll do it the old-fashioned way, but will need a helper function to make better column names:

mcga <- function(x) {
  x <- tolower(x)
  x <- gsub("[[:punct:][:space:]]+", "_", x)
  x <- gsub("_+", "_", x)
  x <- gsub("(^_|_$)", "", x)
  make.unique(x, sep = "_")

Now, we extract the header row and the data rows:

header_row <- html_node(target_table, xpath=".//tr[th]")
data_rows <- html_nodes(target_table, xpath=".//tr[td]")

We peek at the header row and see that there's an evil colspan in there. We'll make use of this knowledge later.

## {xml_nodeset (6)}
## [1] <th scope="col" width="30" colspan="2">District</th>
## [2] <th scope="col" width="170">Name</th>
## [3] <th scope="col" width="70">Party</th>
## [4] <th scope="col" width="130">Residence</th>
## [5] <th scope="col" width="60">Term-limited?</th>
## [6] <th scope="col" width="370">Notes</th>

Get the column names, and make them tidy:

html_children(header_row) %>%
  html_text() %>%
  tolower() %>%
  mcga() -> col_names

Now, iterate over the rows, pull out the values, remove the extra first value and turn the whole thing into a data frame:

map_df(data_rows, ~{
  kid_txt <- html_children(.x) %>% html_text() 
  as.list(setNames(kid_txt[-1], col_names))
## # A tibble: 40 x 6
##    district              name      party       residence term_limited notes
##       <chr>             <chr>      <chr>           <chr>        <chr> <chr>
##  1        1        Ted Gaines Republican El Dorado Hills                   
##  2        2      Mike McGuire Democratic      Healdsburg                   
##  3        3         Bill Dodd Democratic            Napa                   
##  4        4       Jim Nielsen Republican          Gerber                   
##  5        5 Cathleen Galgiani Democratic        Stockton                   
##  6        6       Richard Pan Democratic      Sacramento                   
##  7        7      Steve Glazer Democratic          Orinda                   
##  8        8     Tom Berryhill Republican     Twain Harte          Yes      
##  9        9     Nancy Skinner Democratic        Berkeley                   
## 10       10    Bob Wieckowski Democratic         Fremont                   
## # ... with 30 more rows

