Remove accents from a dataframe column in R

前端未结

关注

 4  1245

I got a data.table base. I got a term column in this data.table

class(base$term)
[1] character
length(base$term)
[1] 27486

I\'m able to remove

相关标签:

4条回答

借酒劲吻你

2021-02-05 04:25

It might be easier to use the stringi package. This way, you don't need to check the encoding beforehand. Furthermore stringi is consistent across operating systems and inconv is not.

library(stringi)

base <- data.table(terme = c("Millésime", 
                             "boulangère", 
                             "üéâäàåçêëèïîì"))

base[, terme := stri_trans_general(str = terme, 
                                   id = "Latin-ASCII")]

> base
           terme
1:     Millesime
2:    boulangere
3: ueaaaaceeeiii

0 讨论(0)

感动是毒

2021-02-05 04:25

You can apply this function

    rm_accent <- function(str,pattern="all") {
   if(!is.character(str))
    str <- as.character(str)

  pattern <- unique(pattern)

  if(any(pattern=="Ç"))
    pattern[pattern=="Ç"] <- "ç"

  symbols <- c(
    acute = "áéíóúÁÉÍÓÚýÝ",
    grave = "àèìòùÀÈÌÒÙ",
    circunflex = "âêîôûÂÊÎÔÛ",
    tilde = "ãõÃÕñÑ",
    umlaut = "äëïöüÄËÏÖÜÿ",
    cedil = "çÇ"
  )

  nudeSymbols <- c(
    acute = "aeiouAEIOUyY",
    grave = "aeiouAEIOU",
    circunflex = "aeiouAEIOU",
    tilde = "aoAOnN",
    umlaut = "aeiouAEIOUy",
    cedil = "cC"
  )

  accentTypes <- c("´","`","^","~","¨","ç")

  if(any(c("all","al","a","todos","t","to","tod","todo")%in%pattern)) # opcao retirar todos
    return(chartr(paste(symbols, collapse=""), paste(nudeSymbols, collapse=""), str))

  for(i in which(accentTypes%in%pattern))
    str <- chartr(symbols[i],nudeSymbols[i], str) 

  return(str)
}

0 讨论(0)

一生所求

2021-02-05 04:39
Ok the way to solve the problem :
```
Encoding(base$terme[2])
[1] "UTF-8"
iconv(base$terme[2],from="UTF-8",to="ASCII//TRANSLIT")
[1] "Millesime"
```
Thanks to @nicola
0 讨论(0)
发布评论:

提交评论
- 加载中...

别跟我提以往

2021-02-05 04:43

Three ways to remove accents - shown and compared to each other below.
The data to play with:

dtCases <- fread("https://github.com/ishaberry/Covid19Canada/raw/master/cases.csv", stringsAsFactors = F )
dim(dtCases) #  751526     16

Bench-marking:

> system.time(dtCases [, city0 := health_region])
   user  system elapsed 
  0.009   0.001   0.012 
> system.time(dtCases [, city1 := base::iconv (health_region, to="ASCII//TRANSLIT")]) # or ... iconv (health_region, from="UTF-8", to="ASCII//TRANSLIT")
   user  system elapsed 
  0.165   0.001   0.200 
> system.time(dtCases [, city2 := textclean::replace_non_ascii (health_region)])
   user  system elapsed 
  9.108   0.063   9.351 
> system.time(dtCases [, city3 := stringi::stri_trans_general (health_region,id = "Latin-ASCII")])
   user  system elapsed 
   4.34    0.00    4.46

Result:

> dtCases[city0!=city1, city0:city3] %>% unique
                           city0                         city1                         city2                         city3
                          <char>                        <char>                        <char>                        <char>
1:                      Montréal                      Montreal                      Montreal                      Montreal
2:                    Montérégie                    Monteregie                    Monteregie                    Monteregie
3:          Chaudière-Appalaches          Chaudiere-Appalaches          Chaudiere-Appalaches          Chaudiere-Appalaches
4:                    Lanaudière                    Lanaudiere                    Lanaudiere                    Lanaudiere
5:                Nord-du-Québec                Nord-du-Quebec                Nord-du-Quebec                Nord-du-Quebec
6:         Abitibi-Témiscamingue         Abitibi-Temiscamingue         Abitibi-Temiscamingue         Abitibi-Temiscamingue
7: Gaspésie-Îles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine
8:                     Côte-Nord                     Cote-Nord                     Cote-Nord                     Cote-Nord

Conclusion:

The base::iconv() is the fastest and preferred method. Tested on French words. Not tested on other languages.

0 讨论(0)