Harvest (rvest) multiple HTML pages from a list of urls

前端未结

关注

 1  892

I have a dataframe that looks like this:

country <- c(\"Canada\", \"US\", \"Japan\", \"China\")
url <- c(\"http://en.wikipedia.org/wiki/United_States\"


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  情话喂你        
                
              
                            
                2020-12-06 03:54
              
            
            
                                                                       
This will scrape them into a full data frame (one row per TOC entry). Tedious-but-straightforward "print/output" code left to the OP:

library(rvest)
library(dplyr)

country <- c("Canada", "US", "Japan", "China")
url <- c("http://en.wikipedia.org/wiki/United_States", 
         "http://en.wikipedia.org/wiki/Canada",
         "http://en.wikipedia.org/wiki/Japan", 
         "http://en.wikipedia.org/wiki/China")
df <- data.frame(country, url)

bind_rows(lapply(url, function(x) {

  data.frame(url=x, toc_entry=toc <- html(url[1]) %>%
    html_nodes(".toctext") %>%
    html_text())

})) -> toc_entries

df <- toc_entries %>% left_join(df)

df[sample(nrow(df), 10),]

## Source: local data frame [10 x 3]
## 
##                                           url                            toc_entry country
## 1          http://en.wikipedia.org/wiki/Japan                   Government finance   Japan
## 2         http://en.wikipedia.org/wiki/Canada        Cold War and civil rights era      US
## 3  http://en.wikipedia.org/wiki/United_States                                 Food  Canada
## 4          http://en.wikipedia.org/wiki/Japan                               Sports   Japan
## 5         http://en.wikipedia.org/wiki/Canada                             Religion      US
## 6          http://en.wikipedia.org/wiki/China        Cold War and civil rights era   China
## 7          http://en.wikipedia.org/wiki/Japan Literature, philosophy, and the arts   Japan
## 8  http://en.wikipedia.org/wiki/United_States                           Population  Canada
## 9          http://en.wikipedia.org/wiki/Japan                          Settlements   Japan
## 10        http://en.wikipedia.org/wiki/Canada                             Military      US

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复