I got a data.table base. I got a term column in this data.table
class(base$term)
[1] character
length(base$term)
[1] 27486
I\'m able to remove
It might be easier to use the stringi package. This way, you don't need to check the encoding beforehand. Furthermore stringi is consistent across operating systems and inconv
is not.
library(stringi)
base <- data.table(terme = c("Millésime",
"boulangère",
"üéâäàåçêëèïîì"))
base[, terme := stri_trans_general(str = terme,
id = "Latin-ASCII")]
> base
terme
1: Millesime
2: boulangere
3: ueaaaaceeeiii
You can apply this function
rm_accent <- function(str,pattern="all") {
if(!is.character(str))
str <- as.character(str)
pattern <- unique(pattern)
if(any(pattern=="Ç"))
pattern[pattern=="Ç"] <- "ç"
symbols <- c(
acute = "áéíóúÁÉÍÓÚýÝ",
grave = "àèìòùÀÈÌÒÙ",
circunflex = "âêîôûÂÊÎÔÛ",
tilde = "ãõÃÕñÑ",
umlaut = "äëïöüÄËÏÖÜÿ",
cedil = "çÇ"
)
nudeSymbols <- c(
acute = "aeiouAEIOUyY",
grave = "aeiouAEIOU",
circunflex = "aeiouAEIOU",
tilde = "aoAOnN",
umlaut = "aeiouAEIOUy",
cedil = "cC"
)
accentTypes <- c("´","`","^","~","¨","ç")
if(any(c("all","al","a","todos","t","to","tod","todo")%in%pattern)) # opcao retirar todos
return(chartr(paste(symbols, collapse=""), paste(nudeSymbols, collapse=""), str))
for(i in which(accentTypes%in%pattern))
str <- chartr(symbols[i],nudeSymbols[i], str)
return(str)
}
Ok the way to solve the problem :
Encoding(base$terme[2])
[1] "UTF-8"
iconv(base$terme[2],from="UTF-8",to="ASCII//TRANSLIT")
[1] "Millesime"
Thanks to @nicola
Three ways to remove accents - shown and compared to each other below.
The data to play with:
dtCases <- fread("https://github.com/ishaberry/Covid19Canada/raw/master/cases.csv", stringsAsFactors = F )
dim(dtCases) # 751526 16
Bench-marking:
> system.time(dtCases [, city0 := health_region])
user system elapsed
0.009 0.001 0.012
> system.time(dtCases [, city1 := base::iconv (health_region, to="ASCII//TRANSLIT")]) # or ... iconv (health_region, from="UTF-8", to="ASCII//TRANSLIT")
user system elapsed
0.165 0.001 0.200
> system.time(dtCases [, city2 := textclean::replace_non_ascii (health_region)])
user system elapsed
9.108 0.063 9.351
> system.time(dtCases [, city3 := stringi::stri_trans_general (health_region,id = "Latin-ASCII")])
user system elapsed
4.34 0.00 4.46
Result:
> dtCases[city0!=city1, city0:city3] %>% unique
city0 city1 city2 city3
<char> <char> <char> <char>
1: Montréal Montreal Montreal Montreal
2: Montérégie Monteregie Monteregie Monteregie
3: Chaudière-Appalaches Chaudiere-Appalaches Chaudiere-Appalaches Chaudiere-Appalaches
4: Lanaudière Lanaudiere Lanaudiere Lanaudiere
5: Nord-du-Québec Nord-du-Quebec Nord-du-Quebec Nord-du-Quebec
6: Abitibi-Témiscamingue Abitibi-Temiscamingue Abitibi-Temiscamingue Abitibi-Temiscamingue
7: Gaspésie-Îles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine
8: Côte-Nord Cote-Nord Cote-Nord Cote-Nord
Conclusion:
The base::iconv()
is the fastest and preferred method.
Tested on French words. Not tested on other languages.