How to extract text from a several “div class” (html) using R?

前端 未结 1 910
别那么骄傲
别那么骄傲 2021-01-29 01:13

My goal is to extract info from this html page to create a database: https://drive.google.com/folderview?id=0B0aGd85uKFDyOS1XTTc2QnNjRmc&usp=sharing

One of the varia

相关标签:
1条回答
  • 2021-01-29 02:15

    There are lots of ways to do this, and depending on how consistent the HTML is, one may be better than another. A reasonably simple strategy that works in this case, though:

    library(rvest)
    
    page <- read_html('page.html')
    
    # find all nodes with a class of "listing_row_price"
    listings <- html_nodes(page, css = '.listing_row_price')
    
    # for each listing, if it has two children get the text of the first, else return NA
    prices <- sapply(listings, function(x){ifelse(length(html_children(x)) == 2, 
                                                  html_text(html_children(x)[1]), 
                                                  NA)})
    # replace everything that's not a number with nothing, and turn it into an integer
    prices <- as.integer(gsub('[^0-9]', '', prices))
    
    0 讨论(0)
提交回复
热议问题