Convert HTML Entity to proper character R

拈花ヽ惹草 提交于 2019-12-02 03:06:28

Here's one way via the XML package:

txt <- "wine/name: 2003 Karth&#228;userhof Eitelsbacher Karth&#228;userhofberg Riesling Kabinett"

library("XML")
xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])

> xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])
[1] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"

The [[1]] bit is because getNodeSet() returns a list of parsed elements, even if there is only one element as is the case here.

This was taken/modified from a reply to the R-Help list by Henrique Dallazuanna in 2010.

If you want to run this for a character vector of length >1, then lapply() this:

txt <- rep(txt, 2)
decode <- function(x) {
  xmlValue(getNodeSet(htmlParse(x, asText = TRUE), "//p")[[1]])
}
lapply(txt, decode)

or if you want it as a vector, vapply():

> vapply(txt, decode, character(1), USE.NAMES = FALSE)
[1] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"
[2] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"

For the multi-line example, use the original version, but you have to write the character vector back out to a file if you want it as a multiline document again:

txt <- "wine/name: 2001 Karth&#228;userhof Eitelsbacher Karth&#228;userhofberg 
Riesling Sp&#228;tlese
wine/wineId: 3058
wine/variant: Riesling
wine/year: 2001
review/points: N/A
review/time: 1095120000
review/userId: 1
review/userName: Eric
review/text: Hideously corked!"

out <- xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])

This gives me

> out
[1] "wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg \nRiesling Spätlese\nwine/wineId: 3058\nwine/variant: Riesling\nwine/year: 2001\nreview/points: N/A\nreview/time: 1095120000\nreview/userId: 1\nreview/userName: Eric\nreview/text: Hideously corked!"

Which if you write out using writeLines()

writeLines(out, "wines.txt")

You'll get a text file, which can be read in again using your other parsing code:

> readLines("wines.txt")
 [1] "wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg "
 [2] "Riesling Spätlese"                                            
 [3] "wine/wineId: 3058"                                            
 [4] "wine/variant: Riesling"                                       
 [5] "wine/year: 2001"                                              
 [6] "review/points: N/A"                                           
 [7] "review/time: 1095120000"                                      
 [8] "review/userId: 1"                                             
 [9] "review/userName: Eric"                                        
[10] "review/text: Hideously corked!"

And it is a file (from my BASH terminal)

$ cat wines.txt 
wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg 
Riesling Spätlese
wine/wineId: 3058
wine/variant: Riesling
wine/year: 2001
review/points: N/A
review/time: 1095120000
review/userId: 1
review/userName: Eric
review/text: Hideously corked!
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!