UTF-8 encoding with dplyr and SQLite

后端 未结 1 1298
灰色年华
灰色年华 2021-02-06 07:58

I have a table in SQLite and I’d like to open it with dplyr. I use SQLite Expert Version 35.58.2478, R Studio Version 0.98.1062 on a PC with Win 7.

After connecting to th

相关标签:
1条回答
  • 2021-02-06 08:35

    I had the same problem. I solved it like below. However, I do not guarantee that the solution is rock solid. Give it a try:

    library(dplyr)
    library(sqldf)
    
    # Modifying built-in mtcars dataset
    
    mtcars$test <- 
      c("č", "ž", "š", "č", "ž", "š", letters) %>% 
      enc2utf8(.)
    
    mtcars$češćžä <- 
      c("č", "ž", "š", "č", "ž", "š", letters) %>% 
      enc2utf8(.)
    
    names(mtcars) <- 
      iconv(names(mtcars), "cp1250", "utf-8")
    
    # Connecting to sqlite database
    
    my_db <- src_sqlite("my_db.sqlite3", create = T)
    
    # exporting mtcars dataset to database
    copy_to(my_db, mtcars, temporary = FALSE)
    
    # dbSendQuery(my_db$con, "drop table mtcars")
    
    # getting data from sqlite database
    my_mtcars_from_db <-
      collect(tbl(my_db, "mtcars"))
    
    # disconnecting from database
    dbDisconnect(my_db$con)
    

    convert_to_encoding() function

    # a function that encodes 
    # column names and values in character columns
    # with specified encodings
    convert_to_encoding <- 
      function(x, from_encoding = "UTF-8", to_encoding = "cp1250"){
    
        # names of columns are encoded in specified encoding
        my_names <- 
          iconv(names(x), from_encoding, to_encoding) 
    
        # if any column name is NA, leave the names
        # otherwise replace them with new names
        if(any(is.na(my_names))){
          names(x)
        } else {
          names(x) <- my_names
        }
    
        # get column classes
        x_char_columns <- sapply(x, class)
        # identify character columns
        x_cols <- names(x_char_columns[x_char_columns == "character"])
    
        # convert all string values in character columns to 
        # specified encoding
        x <- 
          x %>%
          mutate_each_(funs(iconv(., from_encoding, to_encoding)), 
                       x_cols)
        # return x
        return(x)
      }
    
    # use
    convert_to_encoding(my_mtcars_from_db, "UTF-8", "cp1250")
    

    Results

    # before conversion
    my_mtcars_from_db
    
    Source: local data frame [32 x 13]
    
        mpg cyl  disp  hp drat    wt  qsec vs am gear carb češćžä test
    1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4          ÄŤ   ÄŤ
    2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4          Ĺľ   Ĺľ
    3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1          š   š
    4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1          ÄŤ   ÄŤ
    5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2          Ĺľ   Ĺľ
    6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1          š   š
    7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4           a    a
    8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2           b    b
    9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2           c    c
    10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4           d    d
    ..  ... ...   ... ...  ...   ...   ... .. ..  ...  ...         ...  ...
    
    # after conversion
    convert_to_encoding(my_mtcars_from_db, "UTF-8", "cp1250")
    
    Source: local data frame [32 x 13]
    
        mpg cyl  disp  hp drat    wt  qsec vs am gear carb test češćžä
    1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4    č      č
    2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4    ž      ž
    3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1    š      š
    4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1    č      č
    5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2    ž      ž
    6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1    š      š
    7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4    a      a
    8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2    b      b
    9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2    c      c
    10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4    d      d
    ..  ... ...   ... ...  ...   ...   ... .. ..  ...  ...  ...    ...
    

    Session information

    devtools::session_info()
    
    Session info -------------------------------------------------------------------
     setting  value                       
     version  R version 3.2.0 (2015-04-16)
     system   x86_64, mingw32             
     ui       RStudio (0.99.441)          
     language (EN)                        
     collate  Slovenian_Slovenia.1250     
     tz       Europe/Prague               
    
    Packages -----------------------------------------------------------------------
     package    * version date       source        
     assertthat * 0.1     2013-12-06 CRAN (R 3.2.0)
     chron      * 2.3-45  2014-02-11 CRAN (R 3.2.0)
     DBI          0.3.1   2014-09-24 CRAN (R 3.2.0)
     devtools   * 1.7.0   2015-01-17 CRAN (R 3.2.0)
     dplyr        0.4.1   2015-01-14 CRAN (R 3.2.0)
     gsubfn       0.6-6   2014-08-27 CRAN (R 3.2.0)
     lazyeval   * 0.1.10  2015-01-02 CRAN (R 3.2.0)
     magrittr   * 1.5     2014-11-22 CRAN (R 3.2.0)
     proto        0.3-10  2012-12-22 CRAN (R 3.2.0)
     R6         * 2.0.1   2014-10-29 CRAN (R 3.2.0)
     Rcpp       * 0.11.6  2015-05-01 CRAN (R 3.2.0)
     RSQLite      1.0.0   2014-10-25 CRAN (R 3.2.0)
     rstudioapi * 0.3.1   2015-04-07 CRAN (R 3.2.0)
     sqldf        0.4-10  2014-11-07 CRAN (R 3.2.0)
    
    0 讨论(0)
提交回复
热议问题