问题
I want to parse this HTML: and get this elements from it:
a) p
tag, with class: "normal_encontrado"
.
b) div
with class: "price"
.
Sometimes, the p
tag is not present in some products. If this is the case, an NA
should be added to the vector collecting the text from this nodes.
The idea is to have 2 vectors with the same length, and after join them to make a data.frame
. Any ideas?
The HTML part:
<html>
<head></head>
<body>
<div class="product_price" id="product_price_186251">
<p class="normal_encontrado">
S/. 2,799.00
</p>
<div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
S/. 2,299.00
</div>
</div>
<div class="product_price" id="product_price_232046">
<div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
S/. 4,999.00
</div>
</div>
</body>
</html>
R Code:
library(rvest)
page_source <- read_html("r.html")
r.precio.antes <- page_source %>%
html_nodes(".normal_encontrado") %>%
html_text()
r.precio.actual <- page_source %>%
html_nodes(".price") %>%
html_text()
回答1:
If the tag is not found, rvest returns a character(0). So assuming you will find at most one current and one regular price in each div.product_price, you can use this:
pacman::p_load("rvest", "dplyr")
get_prices <- function(node){
r.precio.antes <- html_nodes(node, 'p.normal_encontrado') %>% html_text
r.precio.actual <- html_nodes(node, 'div.price') %>% html_text
data.frame(
precio.antes = ifelse(length(r.precio.antes)==0, NA, r.precio.antes),
precio.actual = ifelse(length(r.precio.actual)==0, NA, r.precio.actual),
stringsAsFactors=F
)
}
doc <- read_html('test.html') %>% html_nodes("div.product_price")
lapply(doc, get_prices) %>%
rbind_all
Edited: I misunderstood the input data, so changed the script to work with just a single html page.
回答2:
Using the XML package parse the input with xmlTreeParse
and then use xpathSApply
to interate over the product_price
class div
nodes. For each such node the anonyous function gets the value of the div
and p
subnodes. The resulting character matrix m
is reworked into a data frame DF
and the columns are cleaned removing any character that is not a dot or digit and also removing any dot followed by a non-digit. Copnvert result to numeric. Note that no special processing for the missing p
case is needed.
# input
Lines <- '<html>
<head></head>
<body>
<div class="product_price" id="product_price_186251">
<p class="normal_encontrado">
S/. 2,799.00
</p>
<div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
S/. 2,299.00
</div>
</div>
<div class="product_price" id="product_price_232046">
<div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
S/. 4,999.00
</div>
</div>
</body>
</html>'
# code to read input and produce a data.frame
library(XML)
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)
m <- xpathSApply(doc, "//div[@class = 'product_price']", function(node) {
list(p = xmlValue(node[["p"]]), div = xmlValue(node[["div"]])) })
DF <- as.data.frame(t(m), stringsAsFactors = FALSE) # rework into data frame
DF[] <- lapply(DF, function(x) as.numeric(gsub("[^.0-9]|[.]\\D", "", x))) # clean
The result is:
> DF
p div
1 2799 2299
2 NA 4999
回答3:
Go one level up from your target and lapply
over each parent element:
library(xml2)
library(rvest)
pg <- read_html('<html>
<head></head>
<body>
<div class="product_price" id="product_price_186251">
<p class="normal_encontrado">
S/. 2,799.00
</p>
<div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
S/. 2,299.00
</div>
</div>
<div class="product_price" id="product_price_232046">
<div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
S/. 4,999.00
</div>
</div>
</body>
</html>')
prod <- html_nodes(pg, "div.product_price")
do.call(rbind, lapply(prod, function(x) {
norm <- tryCatch(xml_text(xml_node(x, "p.normal_encontrado")),
error=function(err) {NA})
price <- tryCatch(xml_text(xml_node(x, "div.price")),
error=function(err) {NA})
data.frame(norm, price, stringsAsFactors=FALSE)
}))
## norm price
## 1 \n S/. 2,799.00\n \n S/. 2,299.00\n
## 2 <NA> \n S/. 4,999.00\n
I have no idea if you wanted the strings trimmed or anything else done, but those machinations are pretty easy.
回答4:
It may not be the most idiomatic way to do this, but you can use lapply over the .product_price
nodes like this:
r.precio.antes <- page_source %>% html_nodes(".product_price") %>%
lapply(. %>% html_nodes(".normal_encontrado") %>% html_text() %>%
ifelse(identical(., character(0)), NA, .)) %>% unlist
This will return NA whenever the .normal_encontrado
element is not found.
r.precio.antes
# [1] "\n S/. 2,799.00\n "
# [2] NA
length(r.precio.antes) # 2
If I wanted to develop the code to make it clearer, first I isolate the .product_price
nodes:
product_nodes <- page_source %>% html_nodes(".product_price")
Then I could use lapply
in more traditional way:
r.precio.antes <- lapply(product_nodes, function(pn) {
pn %>% html_nodes(".normal_encontrado") %>% html_text()
})
r.precio.antes <- unlist(r.precio.antes)
Instead I'm using the magrittr
syntax for lapply
, see e.g. the end of the Functional sequences paragraph here.
One final hurdle is that if the element is not found, this will return character(0)
rather than NA
like you wanted. So I'm adding ifelse(identical(., character(0)), NA, .))
to the pipe inside the lapply to fix that.
来源:https://stackoverflow.com/questions/33250826/scraping-with-rvest-complete-with-nas-when-tag-is-not-present