Remove accents from a dataframe column in R

前端 未结 4 1245
感情败类
感情败类 2021-02-05 04:03

I got a data.table base. I got a term column in this data.table

class(base$term)
[1] character
length(base$term)
[1] 27486

I\'m able to remove

相关标签:
4条回答
  • 2021-02-05 04:25

    It might be easier to use the stringi package. This way, you don't need to check the encoding beforehand. Furthermore stringi is consistent across operating systems and inconv is not.

    library(stringi)
    
    base <- data.table(terme = c("Millésime", 
                                 "boulangère", 
                                 "üéâäàåçêëèïîì"))
    
    base[, terme := stri_trans_general(str = terme, 
                                       id = "Latin-ASCII")]
    
    > base
               terme
    1:     Millesime
    2:    boulangere
    3: ueaaaaceeeiii
    
    0 讨论(0)
  • 2021-02-05 04:25

    You can apply this function

        rm_accent <- function(str,pattern="all") {
       if(!is.character(str))
        str <- as.character(str)
    
      pattern <- unique(pattern)
    
      if(any(pattern=="Ç"))
        pattern[pattern=="Ç"] <- "ç"
    
      symbols <- c(
        acute = "áéíóúÁÉÍÓÚýÝ",
        grave = "àèìòùÀÈÌÒÙ",
        circunflex = "âêîôûÂÊÎÔÛ",
        tilde = "ãõÃÕñÑ",
        umlaut = "äëïöüÄËÏÖÜÿ",
        cedil = "çÇ"
      )
    
      nudeSymbols <- c(
        acute = "aeiouAEIOUyY",
        grave = "aeiouAEIOU",
        circunflex = "aeiouAEIOU",
        tilde = "aoAOnN",
        umlaut = "aeiouAEIOUy",
        cedil = "cC"
      )
    
      accentTypes <- c("´","`","^","~","¨","ç")
    
      if(any(c("all","al","a","todos","t","to","tod","todo")%in%pattern)) # opcao retirar todos
        return(chartr(paste(symbols, collapse=""), paste(nudeSymbols, collapse=""), str))
    
      for(i in which(accentTypes%in%pattern))
        str <- chartr(symbols[i],nudeSymbols[i], str) 
    
      return(str)
    }
    
    0 讨论(0)
  • 2021-02-05 04:39

    Ok the way to solve the problem :

    Encoding(base$terme[2])
    [1] "UTF-8"
    iconv(base$terme[2],from="UTF-8",to="ASCII//TRANSLIT")
    [1] "Millesime"
    

    Thanks to @nicola

    0 讨论(0)
  • 2021-02-05 04:43

    Three ways to remove accents - shown and compared to each other below.
    The data to play with:

    dtCases <- fread("https://github.com/ishaberry/Covid19Canada/raw/master/cases.csv", stringsAsFactors = F )
    dim(dtCases) #  751526     16
    

    Bench-marking:

    > system.time(dtCases [, city0 := health_region])
       user  system elapsed 
      0.009   0.001   0.012 
    > system.time(dtCases [, city1 := base::iconv (health_region, to="ASCII//TRANSLIT")]) # or ... iconv (health_region, from="UTF-8", to="ASCII//TRANSLIT")
       user  system elapsed 
      0.165   0.001   0.200 
    > system.time(dtCases [, city2 := textclean::replace_non_ascii (health_region)])
       user  system elapsed 
      9.108   0.063   9.351 
    > system.time(dtCases [, city3 := stringi::stri_trans_general (health_region,id = "Latin-ASCII")])
       user  system elapsed 
       4.34    0.00    4.46 
    

    Result:

    > dtCases[city0!=city1, city0:city3] %>% unique
                               city0                         city1                         city2                         city3
                              <char>                        <char>                        <char>                        <char>
    1:                      Montréal                      Montreal                      Montreal                      Montreal
    2:                    Montérégie                    Monteregie                    Monteregie                    Monteregie
    3:          Chaudière-Appalaches          Chaudiere-Appalaches          Chaudiere-Appalaches          Chaudiere-Appalaches
    4:                    Lanaudière                    Lanaudiere                    Lanaudiere                    Lanaudiere
    5:                Nord-du-Québec                Nord-du-Quebec                Nord-du-Quebec                Nord-du-Quebec
    6:         Abitibi-Témiscamingue         Abitibi-Temiscamingue         Abitibi-Temiscamingue         Abitibi-Temiscamingue
    7: Gaspésie-Îles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine
    8:                     Côte-Nord                     Cote-Nord                     Cote-Nord                     Cote-Nord
    

    Conclusion:

    The base::iconv() is the fastest and preferred method. Tested on French words. Not tested on other languages.

    0 讨论(0)
提交回复
热议问题