问题
is there a way to add whitespace to each elements that contain text? For this example:
movie <- read_html("http://www.imdb.com/title/tt1490017/")
cast <- html_nodes(movie, "#titleCast span.itemprop")
cast %>% html_structure()
[[1]]
<span.itemprop [itemprop]>
{text}
[[2]]
<span.itemprop [itemprop]>
{text}
I would want to add a trailing whitespace to each text element before using html_text()
. I have another use case where I want to use html_text()
higher up in the document hierarchy. The result is that several texts get combined within one vector element. This makes it impossible to infer start and end of the corresponding parts.
回答1:
Do you mean something like this?
doc <- minimal_html("Hello<p>World</p>")
doc %>% html_text # HelloWorld
doc %>% html_text_collapse(" ") # Hello World
If so here is the code:
require(stringi)
require(rvest)
html_text_collapse <- function(x, collapse = " ", trim = TRUE){
text <- html_text(html_nodes(x, xpath = ".//text()[normalize-space()]"))
if (trim) {
text <- stri_trim_both(text)
}
paste(text, collapse = collapse)
}
来源:https://stackoverflow.com/questions/42003932/adding-whitespace-to-text-elements