Reading Rdata file with different encoding

后端 未结 4 1064
别跟我提以往
别跟我提以往 2020-12-10 04:50

I have an .RData file to read on my Linux (UTF-8) machine, but I know the file is in Latin1 because I\'ve created them myself on Windows. Unfortunately, I don\'t have access

相关标签:
4条回答
  • 2020-12-10 05:23

    Thank you for posting this. I took the liberty to modify your function in case you have a dataframe with some columns as character and some as non-character. Otherwise, an error occurs:

    > fix.encoding(adress)
    Error in `Encoding<-`(`*tmp*`, value = "latin1") :
     a character vector argument expected
    

    So here is the modified function:

    fix.encoding <- function(df, originalEncoding = "latin1") {
        numCols <- ncol(df)
        for (col in 1:numCols)
                if(class(df[, col]) == "character"){
                        Encoding(df[, col]) <- originalEncoding
                }
        return(df)
    }
    

    However, this will not change the encoding of level's names in a "factor" column. Luckily, I found this to change all factors in your dataframe to character (which may be not the best approach, but in my case that's what I needed):

    i <- sapply(df, is.factor)
    df[i] <- lapply(df[i], as.character)
    
    0 讨论(0)
  • 2020-12-10 05:27

    Thanks to 42's comment, I've managed to write a function to recode the file:

    fix.encoding <- function(df, originalEncoding = "latin1") {
      numCols <- ncol(df)
      for (col in 1:numCols) Encoding(df[, col]) <- originalEncoding
      return(df)
    }
    

    The meat here is the command Encoding(df[, col]) <- "latin1", which takes column col of dataframe df and converts it to latin1 format. Unfortunately, Encoding only takes column objects as input, so I had to create a function to sweep all columns of a dataframe object and apply the transformation.

    Of course, if your problem is in just a couple of columns, you're better off just applying the Encoding to those columns instead of the whole dataframe (you can modify the function above to take a set of columns as input). Also, if you're facing the inverse problem, i.e. reading an R object created in Linux or Mac OS into Windows, you should use originalEncoding = "UTF-8".

    0 讨论(0)
  • 2020-12-10 05:34

    following up on previous answers, this is a minor update which makes it work on factors and dplyr's tibble. Thanks for inspiration.

    fix.encoding <- function(df, originalEncoding = "UTF-8") {
    numCols <- ncol(df)
    df <- data.frame(df)
    for (col in 1:numCols)
    {
            if(class(df[, col]) == "character"){
                    Encoding(df[, col]) <- originalEncoding
            }
    
            if(class(df[, col]) == "factor"){
                            Encoding(levels(df[, col])) <- originalEncoding
    }
    }
    return(as_data_frame(df))
    }
    
    0 讨论(0)
  • 2020-12-10 05:48

    Another option using dplyr's mutate_if:

    fix_encoding <- function(x) {
      Encoding(x) <- "latin1"
      return(x)
    }
    data <- data %>% 
      mutate_if(is.character,fix_encoding) 
    

    And for factor variables that have to be recoded:

    fix_encoding_factor <- function(x) {
      x <- as.character(x)
      Encoding(x) <- "latin1"
      x <- as.factor(x)
      return(x)
    }
    data <- data %>% 
      mutate_if(is.factor,fix_encoding_factor) 
    
    0 讨论(0)
提交回复
热议问题