Scraping with rvest - complete with NAs when tag is not present

后端 未结 4 2090
清酒与你
清酒与你 2020-11-30 12:20

I want to parse this HTML: and get this elements from it:

a) p tag, with class: \"normal_encontrado\".
b) div with c

相关标签:
4条回答
  • 2020-11-30 12:25

    Go one level up from your target and lapply over each parent element:

    library(xml2)
    library(rvest)
    
    pg <- read_html('<html>
    <head></head>
    <body>
    
    <div class="product_price" id="product_price_186251">
      <p class="normal_encontrado">
        S/. 2,799.00
      </p>
    
      <div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
        S/. 2,299.00
      </div>    
    </div>
    
    <div class="product_price" id="product_price_232046">
      <div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
        S/. 4,999.00
      </div>
    </div>
    </body>
    </html>')
    
    prod <- html_nodes(pg, "div.product_price")
    do.call(rbind, lapply(prod, function(x) {
      norm <- tryCatch(xml_text(xml_node(x, "p.normal_encontrado")),
                       error=function(err) {NA})
      price <- tryCatch(xml_text(xml_node(x, "div.price")),
                        error=function(err) {NA})
      data.frame(norm, price, stringsAsFactors=FALSE)
    }))
    
    ##                     norm                  price
    ## 1 \n    S/. 2,799.00\n   \n    S/. 2,299.00\n  
    ## 2                   <NA> \n    S/. 4,999.00\n  
    

    I have no idea if you wanted the strings trimmed or anything else done, but those machinations are pretty easy.

    0 讨论(0)
  • 2020-11-30 12:26

    Using the XML package parse the input with xmlTreeParse and then use xpathSApply to interate over the product_price class div nodes. For each such node the anonyous function gets the value of the div and p subnodes. The resulting character matrix m is reworked into a data frame DF and the columns are cleaned removing any character that is not a dot or digit and also removing any dot followed by a non-digit. Copnvert result to numeric. Note that no special processing for the missing p case is needed.

    # input
    
    Lines <- '<html>
    <head></head>
    <body>
    
    <div class="product_price" id="product_price_186251">
      <p class="normal_encontrado">
        S/. 2,799.00
      </p>
    
      <div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
        S/. 2,299.00
      </div>    
    </div>
    
    <div class="product_price" id="product_price_232046">
      <div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
        S/. 4,999.00
      </div>
    </div>
    </body>
    </html>'
    
    # code to read input and produce a data.frame
    
    library(XML)
    doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)
    
    m <- xpathSApply(doc, "//div[@class = 'product_price']", function(node) {
      list(p = xmlValue(node[["p"]]), div = xmlValue(node[["div"]])) })
    
    DF <- as.data.frame(t(m), stringsAsFactors = FALSE) # rework into data frame
    DF[] <- lapply(DF, function(x) as.numeric(gsub("[^.0-9]|[.]\\D", "", x))) # clean
    

    The result is:

    > DF
         p  div
    1 2799 2299
    2   NA 4999
    
    0 讨论(0)
  • It may not be the most idiomatic way to do this, but you can use lapply over the .product_price nodes like this:

    r.precio.antes <- page_source %>% html_nodes(".product_price") %>%
      lapply(. %>% html_nodes(".normal_encontrado") %>% html_text() %>% 
         ifelse(identical(., character(0)), NA, .)) %>% unlist
    

    This will return NA whenever the .normal_encontrado element is not found.

    r.precio.antes
    # [1] "\n                    S/. 2,799.00\n                "
    # [2] NA  
    
    length(r.precio.antes) # 2
    

    If I wanted to develop the code to make it clearer, first I isolate the .product_price nodes:

    product_nodes <- page_source %>% html_nodes(".product_price")
    

    Then I could use lapply in more traditional way:

    r.precio.antes <- lapply(product_nodes, function(pn) {
      pn %>% html_nodes(".normal_encontrado") %>% html_text()
    })
    r.precio.antes <- unlist(r.precio.antes)
    

    Instead I'm using the magrittr syntax for lapply, see e.g. the end of the Functional sequences paragraph here.

    One final hurdle is that if the element is not found, this will return character(0) rather than NA like you wanted. So I'm adding ifelse(identical(., character(0)), NA, .)) to the pipe inside the lapply to fix that.

    0 讨论(0)
  • 2020-11-30 12:40

    If the tag is not found, rvest returns a character(0). So assuming you will find at most one current and one regular price in each div.product_price, you can use this:

    pacman::p_load("rvest", "dplyr")
    
    get_prices <- function(node){
      r.precio.antes <- html_nodes(node, 'p.normal_encontrado') %>% html_text
      r.precio.actual <- html_nodes(node, 'div.price') %>% html_text
    
      data.frame(
        precio.antes = ifelse(length(r.precio.antes)==0, NA, r.precio.antes),
        precio.actual = ifelse(length(r.precio.actual)==0, NA, r.precio.actual), 
        stringsAsFactors=F
      )
    
    }
    
    doc <- read_html('test.html') %>% html_nodes("div.product_price")
    lapply(doc, get_prices) %>%
      rbind_all
    

    Edited: I misunderstood the input data, so changed the script to work with just a single html page.

    0 讨论(0)
提交回复
热议问题