Convert xml_nodeset to data.frame

前端 未结 1 1558
攒了一身酷
攒了一身酷 2021-02-02 17:33

I am using rvest. And I would like to convert the result to a data frame:

> links <- pgsession %>% jump_to(urls[2])  %>%  read_html() %&         


        
1条回答
  •  不思量自难忘°
    2021-02-02 18:32

    This will get you all the attributes from the links into a tbl_df. bind_rows gets you "fill" for free:

    library(rvest)
    library(dplyr)
    
    pg <- read_html("https://en.wikipedia.org/wiki/Main_Page")
    links <- html_nodes(pg, "a")
    bind_rows(lapply(xml_attrs(links), function(x) data.frame(as.list(x), stringsAsFactors=FALSE)))
    
    ## Source: local data frame [310 x 10]
    ## 
    ##       id                         href                  title class   dir accesskey   rel  lang hreflang style
    ##    (chr)                        (chr)                  (chr) (chr) (chr)     (chr) (chr) (chr)    (chr) (chr)
    ## 1    top                           NA                     NA    NA    NA        NA    NA    NA       NA    NA
    ## 2     NA                     #mw-head                     NA    NA    NA        NA    NA    NA       NA    NA
    ## 3     NA                    #p-search                     NA    NA    NA        NA    NA    NA       NA    NA
    ## 4     NA              /wiki/Wikipedia              Wikipedia    NA    NA        NA    NA    NA       NA    NA
    ## 5     NA           /wiki/Free_content           Free content    NA    NA        NA    NA    NA       NA    NA
    ## 6     NA           /wiki/Encyclopedia           Encyclopedia    NA    NA        NA    NA    NA       NA    NA
    ## 7     NA /wiki/Wikipedia:Introduction Wikipedia:Introduction    NA    NA        NA    NA    NA       NA    NA
    ## 8     NA     /wiki/Special:Statistics     Special:Statistics    NA    NA        NA    NA    NA       NA    NA
    ## 9     NA       /wiki/English_language       English language    NA    NA        NA    NA    NA       NA    NA
    ## 10    NA            /wiki/Portal:Arts            Portal:Arts    NA    NA        NA    NA    NA       NA    NA
    ## ..   ...                          ...                    ...   ...   ...       ...   ...   ...      ...   ...
    

    Alternately, you could use purrr:

    library(rvest)
    library(purrr)
    
    pg <- read_html("https://en.wikipedia.org/wiki/Main_Page")
    html_nodes(pg, "a") %>% 
      map(xml_attrs) %>% 
      map_df(~as.list(.))
    
    ## # A tibble: 342 × 10
    ##       id                         href                  title class   dir accesskey   rel hreflang  lang style
    ##                                                            
    ## 1    top                                                                 
    ## 2                        #mw-head                                        
    ## 3                       #p-search                                        
    ## 4                 /wiki/Wikipedia              Wikipedia                     
    ## 5              /wiki/Free_content           Free content                     
    ## 6              /wiki/Encyclopedia           Encyclopedia                     
    ## 7    /wiki/Wikipedia:Introduction Wikipedia:Introduction                     
    ## 8        /wiki/Special:Statistics     Special:Statistics                     
    ## 9          /wiki/English_language       English language                     
    ## 10              /wiki/Portal:Arts            Portal:Arts                     
    ## # ... with 332 more rows
    

    which I think is more functionally idiomatic and an overall cleaner approach.

    0 讨论(0)
提交回复
热议问题