R Studio can not read chinese character in txt file properly

后端 未结 1 1764
孤独总比滥情好
孤独总比滥情好 2021-01-17 07:16

While i was trying to read a txt file with read.table(), I met problems viewing the dataset in Rstudio. The original txt.file consists of three columns data inc

1条回答
  •  一向
    一向 (楼主)
    2021-01-17 07:39

    Input file (added a line in my native locale):

    100008251304976 Třiatřicet žlutých šišinek  2019-10-04 16:52:15
    100008251304976 你又知喎    2019-10-04 16:52:15
    100027970365477 甘你買多幾包花生,小心熱氣   2019-10-04 16:23:43
    

    R code snippet (converting individual rows of the x data frame could be done in a loop, I know…):

    sessionInfo()
    
    library(stringi)
    library(magrittr)
    
    x <- read.table('d:\\bat\\R\\comment.txt', encoding = 'UTF-8', quote = "\"", fill = TRUE, sep = '\t')
    
    print(x)
    
    x['V2'][1,] %>% 
      stri_replace_all_regex("", "\\\\u$1") %>% 
      stri_unescape_unicode() %>% 
      stri_enc_toutf8()
    x['V2'][2,] %>% 
      stri_replace_all_regex("", "\\\\u$1") %>% 
      stri_unescape_unicode() %>% 
      stri_enc_toutf8()
    x['V2'][3,] %>% 
      stri_replace_all_regex("", "\\\\u$1") %>% 
      stri_unescape_unicode() %>% 
      stri_enc_toutf8()
    

    Result (paste the code snippet to an open Rstudio console):

    > sessionInfo()
    R version 3.4.1 (2017-06-30)
    Platform: x86_64-w64-mingw32/x64 (64-bit)
    Running under: Windows 10 x64 (build 18363)
    
    Matrix products: default
    
    locale:
    [1] LC_COLLATE=Czech_Czechia.1250  LC_CTYPE=Czech_Czechia.1250    LC_MONETARY=Czech_Czechia.1250
    [4] LC_NUMERIC=C                   LC_TIME=Czech_Czechia.1250    
    
    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     
    
    other attached packages:
    [1] magrittr_1.5  stringi_1.1.5
    
    loaded via a namespace (and not attached):
    [1] compiler_3.4.1 tools_3.4.1   
    > library(stringi)
    > library(magrittr)
    > 
    > x <- read.table('d:\\bat\\R\\comment.txt', encoding = 'UTF-8', quote = "\"", fill = TRUE, sep = '\t')
    > 
    > print(x)
                V1                                                                                                V2
    1 1.000083e+14                                                                        Třiatřicet žlutých šišinek
    2 1.000083e+14                                                                  
    3 1.000280e+14 ,
                       V3
    1 2019-10-04 16:52:15
    2 2019-10-04 16:52:15
    3 2019-10-04 16:23:43
    > 
    > x['V2'][1,] %>% 
    +   stri_replace_all_regex("", "\\\\u$1") %>% 
    +   stri_unescape_unicode() %>% 
    +   stri_enc_toutf8()
    [1] "Třiatřicet žlutých šišinek"
    > x['V2'][2,] %>% 
    +   stri_replace_all_regex("", "\\\\u$1") %>% 
    +   stri_unescape_unicode() %>% 
    +   stri_enc_toutf8()
    [1] "你又知喎"
    > x['V2'][3,] %>% 
    +   stri_replace_all_regex("", "\\\\u$1") %>% 
    +   stri_unescape_unicode() %>% 
    +   stri_enc_toutf8()
    [1] "甘你買多幾包花生,小心熱氣"
    >
    

    Used the accepted answer to convert utf8 code point strings like to utf8.

    0 讨论(0)
提交回复
热议问题