I run across this often enough that I figure there has to be a good idiom for it. Suppose I have a data.frame with a bunch of attributes, including "product." I also have a key which translates products to brand + size. Product codes 1-3 are Tylenol, 4-6 are Advil, 7-9 are Bayer, 10-12 are Generic.
What's the fastest (in terms of human time) way to code this up?
I tend to use nested ifelse
's if there are 3 or fewer categories, and type out the data table and merge it in if there are more than 3. Any better ideas? Stata has a recode
command that is pretty nifty for this sort of thing, although I believe it promotes data-code intermixing a little too much.
dat <- structure(list(product = c(11L, 11L, 9L, 9L, 6L, 1L, 11L, 5L,
7L, 11L, 5L, 11L, 4L, 3L, 10L, 7L, 10L, 5L, 9L, 8L)), .Names = "product", row.names = c(NA,
-20L), class = "data.frame")
One could use a list as an associative array to define the brand -> product code
mapping, i.e.:
brands <- list(Tylenol=1:3, Advil=4:6, Bayer=7:9, Generic=10:12)
Once you have this, you can then either invert this to create a product code -> brand
list (could take a lot of memory), or just use a search function:
find.key <- function(x, li, default=NA) {
ret <- rep.int(default, length(x))
for (key in names(li)) {
ret[x %in% li[[key]]] <- key
}
return(ret)
}
I'm sure there are better ways of writing this function (the for
loop is annoying me!), but at least it is vectorised, so it only requires a single pass through the list.
Using it would be something like:
> dat$brand <- find.key(dat$product, brands)
> dat
product brand
1 11 Generic
2 11 Generic
3 9 Bayer
4 9 Bayer
5 6 Advil
6 1 Tylenol
7 11 Generic
8 5 Advil
9 7 Bayer
10 11 Generic
11 5 Advil
12 11 Generic
13 4 Advil
14 3 Tylenol
15 10 Generic
16 7 Bayer
17 10 Generic
18 5 Advil
19 9 Bayer
20 8 Bayer
The recode
and levels<-
solutions are very nice, but they are also significantly slower than this one (and once you have find.key
this is easier-for-humans than recode
and on par with the levels<-
):
> microbenchmark(
recode=recode(dat$product,recodes="1:3='Tylenol';4:6='Advil';7:9='Bayer';10:12='Generic'"),
find.key=find.key(dat$product, brands),
levels=`levels<-`(factor(dat$product),brands))
Unit: microseconds
expr min lq median uq max
1 find.key 64.325 69.9815 76.8950 83.8445 221.748
2 levels 240.535 248.1470 274.7565 306.8490 1477.707
3 recode 1636.039 1683.4275 1730.8170 1855.8320 3095.938
(I can't get the switch
version to benchmark properly, but it appears to be faster than all of the above, although it is even worse-for-humans than the recode
solution.)
You could convert your variable to a factor and change its levels by levels<-
function. In one command it could be like:
`levels<-`(
factor(dat$product),
list(Tylenol=1:3, Advil=4:6, Bayer=7:9, Generic=10:12)
)
In steps:
brands <- factor(dat$product)
levels(brands) <- list(Tylenol=1:3, Advil=4:6, Bayer=7:9, Generic=10:12)
I like the recode
function in the car
package:
library(car)
dat$brand <- recode(dat$product,
recodes="1:3='Tylenol';4:6='Advil';7:9='Bayer';10:12='Generic'")
# > dat
# product brand
# 1 11 Generic
# 2 11 Generic
# 3 9 Bayer
# 4 9 Bayer
# 5 6 Advil
# 6 1 Tylenol
# 7 11 Generic
# 8 5 Advil
# 9 7 Bayer
# 10 11 Generic
# 11 5 Advil
# 12 11 Generic
# 13 4 Advil
# 14 3 Tylenol
# 15 10 Generic
# 16 7 Bayer
# 17 10 Generic
# 18 5 Advil
# 19 9 Bayer
# 20 8 Bayer
I often use the technique below:
key <- c()
key[1:3] <- "Tylenol"
key[4:6] <- "Advil"
key[7:9] <- "Bayer"
key[10:12] <- "Generic"
Then,
> key[dat$product]
[1] "Generic" "Generic" "Bayer" "Bayer" "Advil" "Tylenol" "Generic" "Advil" "Bayer" "Generic"
[11] "Advil" "Generic" "Advil" "Tylenol" "Generic" "Bayer" "Generic" "Advil" "Bayer" "Bayer"
The "database approach" is to keep a separate table (a data.frame) for your product keys definitions. It makes even more sense since you say your product keys translate into not only a brand, but also a size:
product.keys <- read.table(textConnection("
product brand size
1 Tylenol small
2 Tylenol medium
3 Tylenol large
4 Advil small
5 Advil medium
6 Advil large
7 Bayer small
8 Bayer medium
9 Bayer large
10 Generic small
11 Generic medium
12 Generic large
"), header = TRUE)
Then, you can join your data using merge
:
merge(dat, product.keys, by = "product")
# product brand size
# 1 1 Tylenol small
# 2 3 Tylenol large
# 3 4 Advil small
# 4 5 Advil medium
# 5 5 Advil medium
# 6 5 Advil medium
# 7 6 Advil large
# 8 7 Bayer small
# 9 7 Bayer small
# 10 8 Bayer medium
# 11 9 Bayer large
# 12 9 Bayer large
# 13 9 Bayer large
# 14 10 Generic small
# 15 10 Generic small
# 16 11 Generic medium
# 17 11 Generic medium
# 18 11 Generic medium
# 19 11 Generic medium
# 20 11 Generic medium
As you notice, the order of the rows is not preserved by merge
. If this is a problem, the plyr
package has a join
function that does preserve the order:
library(plyr)
join(dat, product.keys, by = "product")
# product brand size
# 1 11 Generic medium
# 2 11 Generic medium
# 3 9 Bayer large
# 4 9 Bayer large
# 5 6 Advil large
# 6 1 Tylenol small
# 7 11 Generic medium
# 8 5 Advil medium
# 9 7 Bayer small
# 10 11 Generic medium
# 11 5 Advil medium
# 12 11 Generic medium
# 13 4 Advil small
# 14 3 Tylenol large
# 15 10 Generic small
# 16 7 Bayer small
# 17 10 Generic small
# 18 5 Advil medium
# 19 9 Bayer large
# 20 8 Bayer medium
Finally, if your tables are large and speed is an issue, consider using data.tables (from the data.table
package) instead of data.frames.
This one takes some typing but if you really do have a huge data set this may be the way to go. Bryangoodrich and Dason at talkstats.com taught me this one. It's using a hash table or creating a environment that contains a look up table. I actually keep this one on my .Rprofile (the hash function that is) for dictionary type look ups.
I replicated your data 1000 times to make it a bit larger.
#################################################
# THE HASH FUNCTION (CREATES A ENW ENVIRONMENT) #
#################################################
hash <- function(x, type = "character") {
e <- new.env(hash = TRUE, size = nrow(x), parent = emptyenv())
char <- function(col) assign(col[1], as.character(col[2]), envir = e)
num <- function(col) assign(col[1], as.numeric(col[2]), envir = e)
FUN <- if(type=="character") char else num
apply(x, 1, FUN)
return(e)
}
###################################
# YOUR DATA REPLICATED 1000 TIMES #
###################################
dat <- dat <- structure(list(product = c(11L, 11L, 9L, 9L, 6L, 1L, 11L, 5L,
7L, 11L, 5L, 11L, 4L, 3L, 10L, 7L, 10L, 5L, 9L, 8L)), .Names = "product", row.names = c(NA,
-20L), class = "data.frame")
dat <- dat[rep(seq_len(nrow(dat)), 1000), , drop=FALSE]
rownames(dat) <-NULL
dat
#########################
# CREATE A LOOKUP TABLE #
#########################
med.lookup <- data.frame(val=as.character(1:12),
med=rep(c('Tylenol', 'Advil', 'Bayer', 'Generic'), each=3))
########################################
# USE hash TO CREATE A ENW ENVIRONMENT #
########################################
meds <- hash(med.lookup)
##############################
# CREATE A RECODING FUNCTION #
##############################
recoder <- function(x){
x <- as.character(x) #turn the numbers to character
rc <- function(x){
if(exists(x, env = meds))get(x, e = meds) else NA
}
sapply(x, rc, USE.NAMES = FALSE)
}
#############
# HASH AWAY #
#############
recoder(dat[, 1])
In this case hashing is slow but if you have more levels to recode then it will increase in speed over others.
Somewhat more readable than nested ifelse
's:
unlist(lapply(as.character(dat$product), switch,
`1`=,`2`=,`3`='tylenol',
`4`=,`5`=,`6`='advil',
`7`=,`8`=,`9`='bayer',
`10`=,`11`=,`12`='generic'))
Caveat: not very efficient.
I tend to use this function:
recoder <- function (x, from = c(), to = c()) {
missing.levels <- unique(x)
missing.levels <- missing.levels[!missing.levels %in% from]
if (length(missing.levels) > 0) {
from <- append(x = from, values = missing.levels)
to <- append(x = to, values = missing.levels)
}
to[match(x, from)]
}
As in:
recoder(x = dat$product, from = 1:12, to = c(rep("Product1", 3), rep("Product2", 3), rep("Product3", 3), rep("Product4", 3)))
If you have codes in sequential groups like in the example, this may cut
the mustard:
cut(dat$product,seq(0,12,by=3),labels=c("Tylenol","Advil","Bayer","Generic"))
[1] Generic Generic Bayer Bayer Advil Tylenol Generic Advil Bayer
[10] Generic Advil Generic Advil Tylenol Generic Bayer Generic Advil
[19] Bayer Bayer
Levels: Tylenol Advil Bayer Generic
There's also arules:discretize
, but I like it less because it makes you separate the labels from the range of values:
library(arules)
discretize( dat$product, method = "fixed", categories = c( 1,3,6,9,12 ), labels = c("Tylenol","Advil","Bayer","Generic") )
[1] Generic Generic Generic Generic Bayer Tylenol Generic Advil Bayer Generic Advil Generic Advil Advil Generic Bayer Generic Advil Generic Bayer
Levels: Tylenol Advil Bayer Generic
Another version, that would work in this case:
c("Tylenol","Advil","Bayer","Generic")[(dat$product %/% 3.1) + 1]
For completeness (and probably fastest and simplest solution) one can create and named vector and use it for lookup. Credit: http://adv-r.had.co.nz/Subsetting.html#applications
product.code <- c(1='Tylenol', 2='Tylenol', 3='Tylenon', 4='Advil', 5 ='Advil', 6='Advil', 7='Bayer', 8='Bayer', 9='Bayer', 10='Generic', 11='Generic', 12='Generic')
To get the output
$unname(product.code[dat$product])
Bench-marking for speed with the top solutions
$microbenchmark(
named_vector = unname(product.code[dat$product]),
find.key = find.key(dat$product, brands),
levels = `levels<-`(factor(dat$product),brands))
Unit: microseconds
expr min lq mean median uq max neval
named_vector 11.777 20.4810 26.12832 23.0410 28.1610 207.360 100
find.key 34.305 55.8090 58.75804 59.1370 65.5370 130.049 100
levels 143.361 224.7685 234.02545 247.5525 255.7445 338.944 100
This solution is very similar to @kohske's solution but would work for non-numerical lookup.
来源:https://stackoverflow.com/questions/10431403/idiom-for-ifelse-style-recoding-for-multiple-categories