How to get rid of the error while scraping web in R?

后端未结

关注

 1  859

I\'m scraping this website and get an error message is tibble columns must have compatible sizes.
What should I do in this case?

library(rvest)
library(tidy


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  孤独总比滥情好        
                
              
                            
                2021-01-28 22:13
              
            
            
                                                                       
The problem is due to every restaurant not having a complete record.  In this example the 13th item on the list did not include the price, thus the price vector had 14 items while the place vector had 15 items.
One way to solve this problem is to find the common parent node and then parse those nodes with the html_node() function.  html_node() will always return a value even if it is NA.
library(rvest)
library(dplyr)
library(tibble)


url <- "https://www.zomato.com/tr/toronto/drinks-and-nightlife?page=5"
readpage <- function(url){
   #read the page once
   page <-read_html(url)

   #parse out the parent nodes
   results <- page %>% html_nodes("article.search-result")

   #retrieve the place and price from each parent
   place <- results %>% html_node("a.result-title.hover_feedback.zred.bold.ln24.fontsize0") %>%
      html_attr("title")
   price <- results %>% html_node("div.res-cost.clearfix span.col-s-11.col-m-12.pl0") %>%
      html_text()

   #return a tibble/data,frame
   tibble(url, place, price)
}

readpage(url)

Also note in your code example above, you were reading the same page multiple times.  This is slow and puts additional load on the server.  This could be view as a "denial of service" attack.

It is best to read the page once into memory and then work with that copy.
Update

To answer your question concerning multiple pages. Wrap the above function in a lapply function and then bind the list of returned data frames (or tibbles)
dfs <- lapply(listofurls, function(url){ readpage(url)})
finalanswer <- bind_rows(dfs)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复