I have a column in a dataframe in R with values \"-1\",\"0\",\"1\". I\'d like to replace these values with \"no\", \"maybe\" and \"yes\" respectively. I\'ll do this by usi
By adding 2 to -1, 0, and 1, you could get indices into a vector of the desired outcomes:
c("no", "maybe", "yes")[dat + 2]
# [1] "no" "yes" "maybe" "yes" "yes" "no"
A related option could make use of the match
function to figure out the indexing:
c("no", "maybe", "yes")[match(dat, -1:1)]
# [1] "no" "yes" "maybe" "yes" "yes" "no"
Alternately, you could use a named vector for recoding:
unname(c("-1"="no", "0"="maybe", "1"="yes")[as.character(dat)])
# [1] "no" "yes" "maybe" "yes" "yes" "no"
You could also use a nested ifelse
:
ifelse(dat == -1, "no", ifelse(dat == 0, "maybe", "yes"))
# [1] "no" "yes" "maybe" "yes" "yes" "no"
If you don't mind loading a new package, the Recode
function from the car
package does this:
library(car)
Recode(dat, "-1='no'; 0='maybe'; 1='yes'")
# [1] "no" "yes" "maybe" "yes" "yes" "no"
Data:
dat <- c(-1, 1, 0, 1, 1, -1)
Note that all but the first will work if dat
were stored as a string; in the first you would need to use as.numeric(dat)
.
If code clarity is your main objective, then you should pick the one that you find easiest to understand -- I would personally pick the second or last but that is personal preference.
If code speed is of interest, then you can benchmark the solutions. Here's the benchmarks of the five options I've presented, also including the two other solutions currently posted as other answers, benchmarked on a random vector of length 100k:
set.seed(144)
dat <- sample(c(-1, 0, 1), replace=TRUE, 100000)
opt1 <- function(dat) c("no", "maybe", "yes")[dat + 2]
opt2 <- function(dat) c("no", "maybe", "yes")[match(dat, -1:1)]
opt3 <- function(dat) unname(c("-1"="no", "0"="maybe", "1"="yes")[as.character(dat)])
opt4 <- function(dat) ifelse(dat == -1, "no", ifelse(dat == 0, "maybe", "yes"))
opt5 <- function(dat) Recode(dat, "-1='no'; 0='maybe'; 1='yes'")
AnandaMahto <- function(dat) factor(dat, levels = c(-1, 0, 1), labels = c("no", "maybe", "yes"))
hrbrmstr <- function(dat) sapply(as.character(dat), switch, `-1`="no", `0`="maybe", `1`="yes", USE.NAMES=FALSE)
library(microbenchmark)
microbenchmark(opt1(dat), opt2(dat), opt3(dat), opt4(dat), opt5(dat), AnandaMahto(dat), hrbrmstr(dat))
# Unit: milliseconds
# expr min lq mean median uq max neval
# opt1(dat) 1.513500 2.553022 2.763685 2.656010 2.837673 4.384149 100
# opt2(dat) 2.153438 3.013502 3.251850 3.117058 3.269230 5.851234 100
# opt3(dat) 59.716271 61.890470 64.978685 62.509046 63.723048 144.708757 100
# opt4(dat) 62.934734 64.715815 71.181477 65.652195 71.123384 123.840577 100
# opt5(dat) 82.976441 84.849147 89.071808 85.752429 88.473162 155.347273 100
# AnandaMahto(dat) 57.267227 58.643889 60.508402 59.065642 60.368913 80.852157 100
# hrbrmstr(dat) 137.883307 148.626496 158.051220 153.441243 162.594752 228.271336 100
The first two options appear to be more than an order of magnitude quicker than any of the other options, though either the vector would have to be pretty huge or you would need to be repeating the operation a number of times for any of this to make a difference.
As pointed out by @AnandaMahto, these results are qualitatively different if we have character input instead of numeric input:
set.seed(144)
dat <- sample(c("-1", "0", "1"), replace=TRUE, 100000)
opt1 <- function(dat) c("no", "maybe", "yes")[as.numeric(dat) + 2]
opt2 <- function(dat) c("no", "maybe", "yes")[match(dat, -1:1)]
opt3 <- function(dat) unname(c("-1"="no", "0"="maybe", "1"="yes")[as.character(dat)])
opt4 <- function(dat) ifelse(dat == -1, "no", ifelse(dat == 0, "maybe", "yes"))
opt5 <- function(dat) Recode(dat, "-1='no'; 0='maybe'; 1='yes'")
AnandaMahto <- function(dat) factor(dat, levels = c(-1, 0, 1), labels = c("no", "maybe", "yes"))
hrbrmstr <- function(dat) sapply(dat, switch, `-1`="no", `0`="maybe", `1`="yes", USE.NAMES=FALSE)
library(microbenchmark)
microbenchmark(opt1(dat), opt2(dat), opt3(dat), opt4(dat), opt5(dat), AnandaMahto(dat), hrbrmstr(dat))
# Unit: milliseconds
# expr min lq mean median uq max neval
# opt1(dat) 8.397194 9.519075 10.784108 9.693706 10.163203 55.78417 100
# opt2(dat) 2.281438 3.091418 4.231162 3.210794 3.436038 49.39879 100
# opt3(dat) 3.606863 5.481115 6.466393 5.720282 6.344651 48.47924 100
# opt4(dat) 66.819638 69.996704 74.596960 71.290522 73.404043 127.52415 100
# opt5(dat) 32.897019 35.701401 38.488489 36.336489 38.950272 88.20915 100
# AnandaMahto(dat) 1.329443 2.114504 2.824306 2.275736 2.493907 46.19333 100
# hrbrmstr(dat) 81.898572 91.043729 154.331766 100.006203 141.425717 1594.17447 100
Now, the factor
solution proposed by @AnandaMahto is the quickest, followed by vector indexing with match
and named vector lookup. Again, all runtimes are fast enough that you would need a large vector or many runs for any of this to matter.
factor
is commonly used for this type of task, and leads to some pretty easily readable code:
vec <- c(0, 1, -1, -1, 1, 0)
vec
# [1] 0 1 -1 -1 1 0
factor(vec, levels = c(-1, 0, 1), labels = c("no", "maybe", "yes"))
# [1] maybe yes no no yes maybe
# Levels: no maybe yes
If you want just the character output, wrap it in as.character
.
If the column values are already strings, you just modify the levels
argument in factor
to use as.character
:
vec2 <- as.character(c(0, 1, -1, -1, 1, 0))
vec2
# [1] "0" "1" "-1" "-1" "1" "0"
factor(vec2, levels = as.character(c(-1, 0, 1)), labels = c("no", "maybe", "yes"))
# [1] maybe yes no no yes maybe
# Levels: no maybe yes
This could also be an evil application for switch
:
set.seed(1492)
thing <- sample(c(-1, 0, 1), 100, replace=TRUE)
sapply(as.character(thing), switch, `-1`="no", `0`="maybe", `1`="yes", USE.NAMES=FALSE))
If they are in fact characters already you can leave off the as.character()
bit.
NOTE: I'm not necessarily recommending this, just showing all the possible ways (and this is more of a way out of twisty mazes of ifelse
passages).
IMO factor
s are the way to go.