问题
I am trying to scrap data from Yelp. One step is to extract links from each restaurant. For example, I search restaurants in NYC and get some results. Then I want to extract the links of all the 10 restaurants Yelp recommends on page 1. Here is what I have tried:
library(rvest)
page=read_html("http://www.yelp.com/search?find_loc=New+York,+NY,+USA")
page %>% html_nodes(".biz-name span") %>% html_attr('href')
But the code always returns 'NA'. Can anyone help me with that? Thanks!
回答1:
library(rvest)
page <- read_html("http://www.yelp.com/search?find_loc=New+York,+NY,+USA")
page %>% html_nodes(".biz-name") %>% html_attr('href')
Hope this would simplify your problem
回答2:
I also was able to clean the results from above which for me were quite noisy
links <- page %>% html_nodes("a") %>% html_attr("href")
with a simple regex string matching
links <- links[which(regexpr('common-url-element', links) >= 1)]
.
来源:https://stackoverflow.com/questions/35247033/using-rvest-to-extract-links