Force character vector encoding from “unknown” to “UTF-8” in R

后端 未结 2 1911
挽巷
挽巷 2020-11-29 18:31

I have a problem with inconsistent encoding of character vector in R.

The text file which I read a table from is encoded (via Notepad++

相关标签:
2条回答
  • 2020-11-29 18:58

    The Encoding function returns unknown if a character string has a "native encoding" mark (CP-1250 in your case) or if it's in ASCII. To discriminate between these two cases, call:

    library(stringi)
    stri_enc_mark(poli.dt$word)
    

    To check whether each string is a valid UTF-8 byte sequence, call:

    all(stri_enc_isutf8(poli.dt$word))
    

    If it's not the case, your file is definitely not in UTF-8.

    I suspect that you haven't forced the UTF-8 mode in the data read function (try inspecting the contents of poli.dt$word to verify this statement). If my guess is true, try:

    read.csv2(file("filename", encoding="UTF-8"))
    

    or

    poli.dt$word <- stri_encode(poli.dt$word, "", "UTF-8") # re-mark encodings
    

    If data.table still complains about the "mixed" encodings, you may want to transliterate the non-ASCII characters, e.g.:

    stri_trans_general("Zażółć gęślą jaźń", "Latin-ASCII")
    ## [1] "Zazolc gesla jazn"
    
    0 讨论(0)
  • 2020-11-29 19:19

    I could not find a solution myself to a similar problem. I could not translate back unknown encoding characters from txt file into something more manageable in R.

    Therefore, I was in a situation that the same character appeared more than once in the same dataset, because it was encoded differently ("X" in Latin setting and "X" in Greek setting). However, txt saving operation preserved that encoding difference --- of course well-done.

    Trying some of the above methods, nothing worked. The problem is well described “cannot distinguish ASCII from UTF-8 and the bit will not stick even if you set it”.

    A good workaround is " export your data.frame to a CSV temporary file and reimport with data.table::fread() , specifying Latin-1 as source encoding.".

    Reproducing / copying the example given from the above source:

    package(data.table)
    df <- your_data_frame_with_mixed_utf8_or_latin1_and_unknown_str_fields
    fwrite(df,"temp.csv")
    your_clean_data_table <- fread("temp.csv",encoding = "Latin-1")
    

    I hope, it will help someone that.

    0 讨论(0)
提交回复
热议问题