How can I find out the internal code representation of a WINDOWS-1252 character?

前端 未结 3 1784
温柔的废话
温柔的废话 2021-01-05 00:54

I am processing SPSS data from a questionnaire that must have originated in M$ Word. Word automatically changes hyphens into long hyphens, and gets converted into character

相关标签:
3条回答
  • 2021-01-05 01:06

    After some head-scratching, lots of reading help files and trial-and-error, I created two little functions that does what I need. These functions work by converting their input into UTF-8 and then returning the integer vector for the UTF-8 encoded character vector, and vice versa.

    # Convert character to integer vector
    # Optional encoding specifies encoding of x, defaults to current locale
    encToInt <- function(x, encoding=localeToCharset()){
        utf8ToInt(iconv(x, encoding, "UTF-8"))
    }
    
    # Convert integer vector to character vector
    # Optional encoding specifies encoding of x, defaults to current locale
    intToEnc <- function(x, encoding=localeToCharset()){
        iconv(intToUtf8(x), "utf-8",  encoding)
    }
    

    Some examples:

    x <- "\xfa"
    encToInt(x)
    [1] 250
    
    intToEnc(250)
    [1] "ú"
    
    0 讨论(0)
  • 2021-01-05 01:16

    I use a variation on Andrie's code:

    • Vectorised on x so that I can apply it to a vector/column of characters
    • Which handles character encoded by two utf8 characters (like "\u0098" which gives c(194, 152)), by simply returning the last encoding integer.

    This is useful when for example to map latin1/cp1252 characters to an integer range, which is my application ("more compact file format", they say). This is obviously not appropriate if you need to convert the integer back to a character at some point.

    encToInt <- Vectorize(
      function(x, encoding){
        out <- utf8ToInt(iconv(x, encoding, "UTF-8"))
        out[length(out)]
      },
      vectorize.args="x", USE.NAMES = F, SIMPLIFY=T)
    
    0 讨论(0)
  • 2021-01-05 01:18

    If you load the SPSS sav file via read.spss form package foreign, you could easily import the data frame with correct encoding via specifying the encoding like:

    read.spss("foo.sav", reencode="CP1252")
    
    0 讨论(0)
提交回复
热议问题