问题
I am working in R
with a dataframe composed of a numeric variable and a character variable. My dataframe DF
looks like this (I add the dput
version in final part):
a1 b1
1 a 10.15
2 a 25.10
3 a 32.40
4 a 56.70
5 a 89.02
6 b 90.50
7 b 78.53
8 b 98.12
9 b 34.30
10 b 99.75
In DF
the variable a1
is a group variable and b1
is a numeric variable. Then the dilem appear. I want to create a new variable named c1
by using cut
function and considering the group saved in a1
. For this reason I combine both functions ifelse()
and cut()
in the next line of code:
DF$c1=ifelse(DF$a1=="a",
cut(DF$b1,breaks = c(0,25,50,70,max(DF$b1)),right = TRUE,include.lowest = TRUE),
ifelse(DF$a1=="b",
cut(DF$b1,breaks = c(0,50,max(DF$b1)),right = TRUE,include.lowest = TRUE),NA))
The line of code worked fine, but there is a confusing result for he new values created in c1
. Instead of showing a factor, cut()
returns integers. Then, I got this result:
table(DF$c1,exclude=NULL)
1 2 3 4 <NA>
2 6 1 1 0
Despite of creating the breaks, the integers allocated in c1
change the result. This does not happen when I work without ifelse but in this case I do not agree conditions over the group. For example the next line of code returns this result:
DF$c1=cut(DF$b1,breaks = c(0,25,50,70,max(DF$b1)),right = TRUE,include.lowest = TRUE)
table(DF$c1,exclude=NULL)
[0,25] (25,50] (50,70] (70,99.8] <NA>
1 3 1 5 0
I would like to know how to solve this behavior between ifelse()
and cut()
functions because the returned integers produce differences in the final result. In this example I work only with two groups for a1
variable but I have a large database with many groups. That is the reason I combine the functions to get different cuts for each group. Also the values for the breaks can change, so including labels in a manual way could be long. Is it possible that the combination of this two functions return the correct labels for each group (factor) instead of integers. The dput()
version of my dataframe DF
is the next:
DF<-structure(list(a1 = c("a", "a", "a", "a", "a", "b", "b", "b",
"b", "b"), b1 = c(10.15, 25.1, 32.4, 56.7, 89.02, 90.5, 78.53,
98.12, 34.3, 99.75)), .Names = c("a1", "b1"), row.names = c(NA,
-10L), class = "data.frame")
Thanks for your help!
回答1:
The problem is that both cut()
output a factor, but that since they have different levels, they are coerced to integer. A solution may be to surround your cut()
with as.character()
, thus preserving the levels for coercion, and then factor()
the whole output:
DF$c1=factor(ifelse(DF$a1=="a",
as.character(cut(DF$b1,breaks = c(0,25,50,70,max(DF$b1)),right = TRUE,include.lowest = TRUE)),
ifelse(DF$a1=="b",
as.character(cut(DF$b1,breaks = c(0,50,max(DF$b1)),right = TRUE,include.lowest = TRUE)),NA)))
DF
a1 b1 c1
1 a 10.15 [0,25]
2 a 25.10 (25,50]
3 a 32.40 (25,50]
4 a 56.70 (50,70]
5 a 89.02 (70,99.8]
6 b 90.50 (50,99.8]
7 b 78.53 (50,99.8]
8 b 98.12 (50,99.8]
9 b 34.30 [0,50]
10 b 99.75 (50,99.8]
回答2:
@scoa is right; you're trying to combine two factors with different levels, so your results are getting coerced to integers and you're losing the levels. Here's another approach with a smaller form factor, which will be more scalable.
First, make a named list of all your breaks:
breaks <- list('a' = c(0, 25, 50, 70, max(DF$b1)), 'b' = c(0, 50, max(DF$b1)))
breaks
> $a
> 0 25 50 70 99.75
> $b
> 0 50 99.75
Then use unlist(list(some, factors))
(or in this case, lapply
), which neatly merges factors, keeping all the levels. (It's sorta magic; it's one of those built-in functionalities that's really not obvious.)
DF$c1 <- unlist(lapply(1:length(breaks),
function(x){cut(DF[DF$a1 == names(breaks[x]), 'b1'],
breaks = breaks[[x]],
right = TRUE,
include.lowest = TRUE)}
))
DF
> a1 b1 c1
> 1 a 10.15 [0,25]
> 2 a 25.10 (25,50]
> 3 a 32.40 (25,50]
> 4 a 56.70 (50,70]
> 5 a 89.02 (70,99.8]
> 6 b 90.50 (50,99.8]
> 7 b 78.53 (50,99.8]
> 8 b 98.12 (50,99.8]
> 9 b 34.30 [0,50]
> 10 b 99.75 (50,99.8]
It's ultimately 2 lines of code, and should be robust on a larger, more complicated data set.
回答3:
This is not a direct answer to your question, but rather an alternativ approach to the overall task.
Because you have "a large database with many groups [with] different cuts for each group", it seems to me that a code with many nested ifelse
soon may get quite messy. Perhaps a matter of taste, but I think that the code would be easier to read and maintain if you store the breaks
for each group in a separate table instead.
Here's how you might do it using data.table
:
library(data.table)
dt_brk <- data.table(grp = c("a", "a", "a", "a", "a", "b", "b", "b"),
brk = c(0, 25, 50, 70, Inf, 0, 50, Inf))
Note that I use Inf
as the upper limit of the breaks, rather than max(your-values)
We convert your data frame "DF" to a data.table
using setDT
. Then, for each level of "a1" (by = a1
), we cut
"b1", using breaks
from "dt_brk", where "grp" equals "a1" (dt_brk[grp == a1, brk]
).
setDT(DF)[, c1 := as.character(cut(b1, breaks = dt_brk[grp == a1, brk])), by = a1]
DF
# a1 b1 c1
# 1: a 10.15 (0,25]
# 2: a 25.10 (25,50]
# 3: a 32.40 (25,50]
# 4: a 56.70 (50,70]
# 5: a 89.02 (70,Inf]
# 6: b 90.50 (50,Inf]
# 7: b 78.53 (50,Inf]
# 8: b 98.12 (50,Inf]
# 9: b 34.30 (0,50]
# 10: b 99.75 (50,Inf]
来源:https://stackoverflow.com/questions/34965401/strange-behavior-between-functions-cut-and-ifelse-in-r