I got a data.table base. I got a term column in this data.table
class(base$term)
[1] character
length(base$term)
[1] 27486
I\'m able to remove
Three ways to remove accents - shown and compared to each other below.
The data to play with:
dtCases <- fread("https://github.com/ishaberry/Covid19Canada/raw/master/cases.csv", stringsAsFactors = F )
dim(dtCases) # 751526 16
Bench-marking:
> system.time(dtCases [, city0 := health_region])
user system elapsed
0.009 0.001 0.012
> system.time(dtCases [, city1 := base::iconv (health_region, to="ASCII//TRANSLIT")]) # or ... iconv (health_region, from="UTF-8", to="ASCII//TRANSLIT")
user system elapsed
0.165 0.001 0.200
> system.time(dtCases [, city2 := textclean::replace_non_ascii (health_region)])
user system elapsed
9.108 0.063 9.351
> system.time(dtCases [, city3 := stringi::stri_trans_general (health_region,id = "Latin-ASCII")])
user system elapsed
4.34 0.00 4.46
Result:
> dtCases[city0!=city1, city0:city3] %>% unique
city0 city1 city2 city3
1: Montréal Montreal Montreal Montreal
2: Montérégie Monteregie Monteregie Monteregie
3: Chaudière-Appalaches Chaudiere-Appalaches Chaudiere-Appalaches Chaudiere-Appalaches
4: Lanaudière Lanaudiere Lanaudiere Lanaudiere
5: Nord-du-Québec Nord-du-Quebec Nord-du-Quebec Nord-du-Quebec
6: Abitibi-Témiscamingue Abitibi-Temiscamingue Abitibi-Temiscamingue Abitibi-Temiscamingue
7: Gaspésie-Îles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine
8: Côte-Nord Cote-Nord Cote-Nord Cote-Nord
Conclusion:
The base::iconv()
is the fastest and preferred method.
Tested on French words. Not tested on other languages.