Convert HTML Entity to proper character R

后端 未结 1 1286
挽巷
挽巷 2021-01-24 06:45

Does anyone know of a generic function in r that can convert ä to its unicode character â? I have seen some functions that take in â

相关标签:
1条回答
  • 2021-01-24 07:20

    Here's one way via the XML package:

    txt <- "wine/name: 2003 Karth&#228;userhof Eitelsbacher Karth&#228;userhofberg Riesling Kabinett"
    
    library("XML")
    xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])
    
    > xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])
    [1] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"
    

    The [[1]] bit is because getNodeSet() returns a list of parsed elements, even if there is only one element as is the case here.

    This was taken/modified from a reply to the R-Help list by Henrique Dallazuanna in 2010.

    If you want to run this for a character vector of length >1, then lapply() this:

    txt <- rep(txt, 2)
    decode <- function(x) {
      xmlValue(getNodeSet(htmlParse(x, asText = TRUE), "//p")[[1]])
    }
    lapply(txt, decode)
    

    or if you want it as a vector, vapply():

    > vapply(txt, decode, character(1), USE.NAMES = FALSE)
    [1] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"
    [2] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"
    

    For the multi-line example, use the original version, but you have to write the character vector back out to a file if you want it as a multiline document again:

    txt <- "wine/name: 2001 Karth&#228;userhof Eitelsbacher Karth&#228;userhofberg 
    Riesling Sp&#228;tlese
    wine/wineId: 3058
    wine/variant: Riesling
    wine/year: 2001
    review/points: N/A
    review/time: 1095120000
    review/userId: 1
    review/userName: Eric
    review/text: Hideously corked!"
    
    out <- xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])
    

    This gives me

    > out
    [1] "wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg \nRiesling Spätlese\nwine/wineId: 3058\nwine/variant: Riesling\nwine/year: 2001\nreview/points: N/A\nreview/time: 1095120000\nreview/userId: 1\nreview/userName: Eric\nreview/text: Hideously corked!"
    

    Which if you write out using writeLines()

    writeLines(out, "wines.txt")
    

    You'll get a text file, which can be read in again using your other parsing code:

    > readLines("wines.txt")
     [1] "wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg "
     [2] "Riesling Spätlese"                                            
     [3] "wine/wineId: 3058"                                            
     [4] "wine/variant: Riesling"                                       
     [5] "wine/year: 2001"                                              
     [6] "review/points: N/A"                                           
     [7] "review/time: 1095120000"                                      
     [8] "review/userId: 1"                                             
     [9] "review/userName: Eric"                                        
    [10] "review/text: Hideously corked!"
    

    And it is a file (from my BASH terminal)

    $ cat wines.txt 
    wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg 
    Riesling Spätlese
    wine/wineId: 3058
    wine/variant: Riesling
    wine/year: 2001
    review/points: N/A
    review/time: 1095120000
    review/userId: 1
    review/userName: Eric
    review/text: Hideously corked!
    
    0 讨论(0)
提交回复
热议问题