Strange behavior between functions cut and ifelse in R

问题

I am working in R with a dataframe composed of a numeric variable and a character variable. My dataframe DF looks like this (I add the dput version in final part):

   a1    b1
1   a 10.15
2   a 25.10
3   a 32.40
4   a 56.70
5   a 89.02
6   b 90.50
7   b 78.53
8   b 98.12
9   b 34.30
10  b 99.75

In DF the variable a1 is a group variable and b1 is a numeric variable. Then the dilem appear. I want to create a new variable named c1 by using cut function and considering the group saved in a1. For this reason I combine both functions ifelse() and cut() in the next line of code:

DF$c1=ifelse(DF$a1=="a",
                cut(DF$b1,breaks = c(0,25,50,70,max(DF$b1)),right = TRUE,include.lowest = TRUE),
                ifelse(DF$a1=="b",
                       cut(DF$b1,breaks = c(0,50,max(DF$b1)),right = TRUE,include.lowest = TRUE),NA))

The line of code worked fine, but there is a confusing result for he new values created in c1. Instead of showing a factor, cut() returns integers. Then, I got this result:

table(DF$c1,exclude=NULL)

   1    2    3    4 <NA> 
   2    6    1    1    0

Despite of creating the breaks, the integers allocated in c1 change the result. This does not happen when I work without ifelse but in this case I do not agree conditions over the group. For example the next line of code returns this result:

DF$c1=cut(DF$b1,breaks = c(0,25,50,70,max(DF$b1)),right = TRUE,include.lowest = TRUE)

table(DF$c1,exclude=NULL)

   [0,25]   (25,50]   (50,70] (70,99.8]      <NA> 
        1         3         1         5         0

I would like to know how to solve this behavior between ifelse() and cut() functions because the returned integers produce differences in the final result. In this example I work only with two groups for a1 variable but I have a large database with many groups. That is the reason I combine the functions to get different cuts for each group. Also the values for the breaks can change, so including labels in a manual way could be long. Is it possible that the combination of this two functions return the correct labels for each group (factor) instead of integers. The dput() version of my dataframe DF is the next:

DF<-structure(list(a1 = c("a", "a", "a", "a", "a", "b", "b", "b", 
"b", "b"), b1 = c(10.15, 25.1, 32.4, 56.7, 89.02, 90.5, 78.53, 
98.12, 34.3, 99.75)), .Names = c("a1", "b1"), row.names = c(NA, 
-10L), class = "data.frame")

Thanks for your help!

回答1:

The problem is that both cut() output a factor, but that since they have different levels, they are coerced to integer. A solution may be to surround your cut() with as.character(), thus preserving the levels for coercion, and then factor() the whole output:

DF$c1=factor(ifelse(DF$a1=="a",
             as.character(cut(DF$b1,breaks = c(0,25,50,70,max(DF$b1)),right = TRUE,include.lowest = TRUE)),
             ifelse(DF$a1=="b",
                    as.character(cut(DF$b1,breaks = c(0,50,max(DF$b1)),right = TRUE,include.lowest = TRUE)),NA)))

DF

   a1    b1        c1
1   a 10.15    [0,25]
2   a 25.10   (25,50]
3   a 32.40   (25,50]
4   a 56.70   (50,70]
5   a 89.02 (70,99.8]
6   b 90.50 (50,99.8]
7   b 78.53 (50,99.8]
8   b 98.12 (50,99.8]
9   b 34.30    [0,50]
10  b 99.75 (50,99.8]

回答2:

@scoa is right; you're trying to combine two factors with different levels, so your results are getting coerced to integers and you're losing the levels. Here's another approach with a smaller form factor, which will be more scalable.

First, make a named list of all your breaks:

breaks <- list('a' = c(0, 25, 50, 70, max(DF$b1)), 'b' = c(0, 50, max(DF$b1)))
breaks

> $a
>     0 25 50 70 99.75 
> $b
>     0 50 99.75

Then use unlist(list(some, factors)) (or in this case, lapply), which neatly merges factors, keeping all the levels. (It's sorta magic; it's one of those built-in functionalities that's really not obvious.)

DF$c1 <- unlist(lapply(1:length(breaks), 
                   function(x){cut(DF[DF$a1 == names(breaks[x]), 'b1'], 
                                   breaks = breaks[[x]], 
                                   right = TRUE, 
                                   include.lowest = TRUE)}
                   ))
DF

>    a1    b1        c1
> 1   a 10.15    [0,25]
> 2   a 25.10   (25,50]
> 3   a 32.40   (25,50]
> 4   a 56.70   (50,70]
> 5   a 89.02 (70,99.8]
> 6   b 90.50 (50,99.8]
> 7   b 78.53 (50,99.8]
> 8   b 98.12 (50,99.8]
> 9   b 34.30    [0,50]
> 10  b 99.75 (50,99.8]

It's ultimately 2 lines of code, and should be robust on a larger, more complicated data set.

回答3:

This is not a direct answer to your question, but rather an alternativ approach to the overall task.

Because you have "a large database with many groups [with] different cuts for each group", it seems to me that a code with many nested ifelse soon may get quite messy. Perhaps a matter of taste, but I think that the code would be easier to read and maintain if you store the breaks for each group in a separate table instead.

Here's how you might do it using data.table:

library(data.table)
dt_brk <- data.table(grp = c("a", "a", "a", "a", "a", "b", "b", "b"),
                     brk = c(0, 25, 50, 70, Inf, 0, 50, Inf))

Note that I use Inf as the upper limit of the breaks, rather than max(your-values)

We convert your data frame "DF" to a data.table using setDT. Then, for each level of "a1" (by = a1), we cut "b1", using breaks from "dt_brk", where "grp" equals "a1" (dt_brk[grp == a1, brk]).

setDT(DF)[, c1 := as.character(cut(b1, breaks = dt_brk[grp == a1, brk])), by = a1]

DF
#     a1    b1       c1
# 1:   a 10.15   (0,25]
# 2:   a 25.10  (25,50]
# 3:   a 32.40  (25,50]
# 4:   a 56.70  (50,70]
# 5:   a 89.02 (70,Inf]
# 6:   b 90.50 (50,Inf]
# 7:   b 78.53 (50,Inf]
# 8:   b 98.12 (50,Inf]
# 9:   b 34.30   (0,50]
# 10:  b 99.75 (50,Inf]

来源：https://stackoverflow.com/questions/34965401/strange-behavior-between-functions-cut-and-ifelse-in-r

标签

r-factor