Remove accents from a dataframe column in R

前端 未结 4 1225
感情败类
感情败类 2021-02-05 04:03

I got a data.table base. I got a term column in this data.table

class(base$term)
[1] character
length(base$term)
[1] 27486

I\'m able to remove

4条回答
  •  别跟我提以往
    2021-02-05 04:43

    Three ways to remove accents - shown and compared to each other below.
    The data to play with:

    dtCases <- fread("https://github.com/ishaberry/Covid19Canada/raw/master/cases.csv", stringsAsFactors = F )
    dim(dtCases) #  751526     16
    

    Bench-marking:

    > system.time(dtCases [, city0 := health_region])
       user  system elapsed 
      0.009   0.001   0.012 
    > system.time(dtCases [, city1 := base::iconv (health_region, to="ASCII//TRANSLIT")]) # or ... iconv (health_region, from="UTF-8", to="ASCII//TRANSLIT")
       user  system elapsed 
      0.165   0.001   0.200 
    > system.time(dtCases [, city2 := textclean::replace_non_ascii (health_region)])
       user  system elapsed 
      9.108   0.063   9.351 
    > system.time(dtCases [, city3 := stringi::stri_trans_general (health_region,id = "Latin-ASCII")])
       user  system elapsed 
       4.34    0.00    4.46 
    

    Result:

    > dtCases[city0!=city1, city0:city3] %>% unique
                               city0                         city1                         city2                         city3
                                                                                                      
    1:                      Montréal                      Montreal                      Montreal                      Montreal
    2:                    Montérégie                    Monteregie                    Monteregie                    Monteregie
    3:          Chaudière-Appalaches          Chaudiere-Appalaches          Chaudiere-Appalaches          Chaudiere-Appalaches
    4:                    Lanaudière                    Lanaudiere                    Lanaudiere                    Lanaudiere
    5:                Nord-du-Québec                Nord-du-Quebec                Nord-du-Quebec                Nord-du-Quebec
    6:         Abitibi-Témiscamingue         Abitibi-Temiscamingue         Abitibi-Temiscamingue         Abitibi-Temiscamingue
    7: Gaspésie-Îles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine
    8:                     Côte-Nord                     Cote-Nord                     Cote-Nord                     Cote-Nord
    

    Conclusion:

    The base::iconv() is the fastest and preferred method. Tested on French words. Not tested on other languages.

提交回复
热议问题