R: Converting “special” letters into UTF-8?

后端 未结 2 1723
忘了有多久
忘了有多久 2021-01-12 18:21

I run into problems matching tables where one dataframe contains special characters and the other doesn\'t. Example: Doña Ana County vs. Dona Ana County

相关标签:
2条回答
  • 2021-01-12 19:05

    The first problem is that acs::fips.place is badly mangled; if provides e.g., \\xf1a where it means \xf1a. A bug should be reported to the package mantainer. In the meantime, here is one work-around:

    tbl_df(acs::fips.place) %>%
        mutate(COUNTY = scan(text = str_c(COUNTY, collapse = "\n"),
                             sep = "\n",
                             what = "character",
                             allowEscapes = TRUE)) -> fp
    
    Encoding(fp$COUNTY) <- "latin1"
    
    fp %>%
        filter(COUNTY == "Doña Ana County")
    

    Once the escapes have been cleaned up you can transliterate non-ascii characters into ascii substitutions. The stringi package makes it easy:

    library(stringi)
    fp$COUNTY <- stri_trans_general(fp$COUNTY, "latin-ascii")
    
    fp %>%
        filter(COUNTY == "Dona Ana County") 
    
    0 讨论(0)
  • 2021-01-12 19:15

    Use

     tbl_df(acs::fips.place) %>% filter(COUNTY == "Do\\xf1a Ana County")
    

    In your dataset what you really have is Do\\xf1a you can check this in the R console by using for instance :

    acs::fips.place[grep("Ana",f$COUNTY),]
    

    The functions to use are iconv(x, from = "", to = "") or enc2utf8 or enc2native which don't take a "from" argument. In most cases to build a package you need to convert data to UTF-8 (I have to transcode all my French strings when building packages). Here I think it's latin1, but the \ has been escaped.

    x<-"Do\\xf1a Ana County"
    Encoding(x)<-"latin1"
    charToRaw(x)
    #  [1] 44 6f f1 61 20 41 6e 61 20 43 6f 75 6e 74 79
    xx<-iconv(x, "latin1", "UTF-8")
    charToRaw(xx)
    # [1] 44 6f c3 b1 61 20 41 6e 61 20 43 6f 75 6e 74 79
    

    Finally if you need to clean up your output to get comparable strings you can use this function (straight from my own encoding hell).

    to.plain <- function(s) {   
       #old1 <- iconv("èéêëù","UTF8") #use this if your console is in LATIN1
       #new1 <- iconv("eeeeu","UTF8") #use this if your console is in LATIN1
      old1 <- "èéêëù"
      new1 <- "eeeeu"
      s1 <- chartr(old1, new1, s)      
    }
    
    0 讨论(0)
提交回复
热议问题