convert HTML Character Entity Encoding in R

后端未结

关注

 3  1006

情话喂你

Is there a way in R to convert HTML Character Entity Encodings?

I would like to convert HTML character entities like & to & o

相关标签:

3条回答

攒了一身酷

2020-12-05 03:46

Unescape xml/html values using xml2 package:

unescape_xml <- function(str){
  xml2::xml_text(xml2::read_xml(paste0("<x>", str, "</x>")))
}

unescape_html <- function(str){
  xml2::xml_text(xml2::read_html(paste0("<x>", str, "</x>")))
}

Examples:

unescape_xml("3 &lt; x &amp; x &gt; 9")
# [1] "3 < x & x > 9"
unescape_html("&euro; 2.99")
# [1] "€ 2.99"

0 讨论(0)

情话喂你

2020-12-05 03:47
While the solution by Jeroen does the job, it has the disadvantage that it is not vectorised and therefore slow if applied to a large number of characters. In addition, it only works with a character vector of length one and one has to use sapply for a longer character vector.

To demonstrate this, I first create a large character vector:
```
set.seed(123)
strings <- c("abcd", "&amp; &apos; &gt;", "&amp;", "&euro; &lt;")
many_strings <- sample(strings, 10000, replace = TRUE)
```
And apply the function:
```
unescape_html <- function(str) {
  xml2::xml_text(xml2::read_html(paste0("<x>", str, "</x>")))
}

system.time(res <- sapply(many_strings, unescape_html, USE.NAMES = FALSE))
##    user  system elapsed 
##   2.327   0.000   2.326 
head(res)
## [1] "& ' >" "€ <"   "& ' >" "€ <"   "€ <"   "abcd" 
```
It is much faster if all the strings in the character vector are combined into a single, large string, such that read_html() and xml_text() need only be used once. The strings can then easily be separated again using strsplit():
```
unescape_html2 <- function(str){
  html <- paste0("<x>", paste0(str, collapse = "#_|"), "</x>")
  parsed <- xml2::xml_text(xml2::read_html(html))
  strsplit(parsed, "#_|", fixed = TRUE)[[1]]
}

system.time(res2 <- unescape_html2(many_strings))
##    user  system elapsed 
##   0.011   0.000   0.010 
identical(res, res2)
## [1] TRUE
```
Of course, you need to be careful that the string that you use to combine the various strings in str ("#_|" in my example) does not appear anywhere in str. Otherwise, you will introduce an error, when the large string is split again in the end.
0 讨论(0)
发布评论:

提交评论
- 加载中...

感动是毒

2020-12-05 04:01

Update: this answer is outdated. Please check the answer below based on the new xml2 pkg.

Try something along the lines of:

# load XML package
library(XML)

# Convenience function to convert html codes
html2txt <- function(str) {
      xpathApply(htmlParse(str, asText=TRUE),
                 "//body//text()", 
                 xmlValue)[[1]] 
}

# html encoded string
( x <- paste("i", "s", "n", "&", "a", "p", "o", "s", ";", "t", sep = "") )
[1] "isn&apos;t"

# converted string
html2txt(x)
[1] "isn't"

UPDATE: Edited the html2txt() function so it applies to more situations

0 讨论(0)