I am processing SPSS data from a questionnaire that must have originated in M$ Word. Word automatically changes hyphens into long hyphens, and gets converted into character
After some head-scratching, lots of reading help files and trial-and-error, I created two little functions that does what I need. These functions work by converting their input into UTF-8 and then returning the integer vector for the UTF-8 encoded character vector, and vice versa.
# Convert character to integer vector
# Optional encoding specifies encoding of x, defaults to current locale
encToInt <- function(x, encoding=localeToCharset()){
utf8ToInt(iconv(x, encoding, "UTF-8"))
}
# Convert integer vector to character vector
# Optional encoding specifies encoding of x, defaults to current locale
intToEnc <- function(x, encoding=localeToCharset()){
iconv(intToUtf8(x), "utf-8", encoding)
}
Some examples:
x <- "\xfa"
encToInt(x)
[1] 250
intToEnc(250)
[1] "ú"
I use a variation on Andrie's code:
x
so that I can apply it to a vector/column of charactersThis is useful when for example to map latin1/cp1252 characters to an integer range, which is my application ("more compact file format", they say). This is obviously not appropriate if you need to convert the integer back to a character at some point.
encToInt <- Vectorize(
function(x, encoding){
out <- utf8ToInt(iconv(x, encoding, "UTF-8"))
out[length(out)]
},
vectorize.args="x", USE.NAMES = F, SIMPLIFY=T)
If you load the SPSS sav file via read.spss
form package foreign, you could easily import the data frame with correct encoding via specifying the encoding like:
read.spss("foo.sav", reencode="CP1252")