With min() in R return NA instead of Inf

问题

Please consider the following:

I recently 'discovered' the awesome plyr and dplyr packages and use those for analysing patient data that is available to me in a data frame. Such a data frame could look like this:

df <- data.frame(id = c(1, 1, 1, 2, 2), # patient ID
                 diag = c(rep("dia1", 3), rep("dia2", 2)), # diagnosis
                 age = c(7.8, NA, 7.9, NA, NA)) # patient age

I would like to summarise the minimum patient age of all patients with a median and mean. I did the following:

min.age <- df %>% 
  group_by(id) %>% 
  summarise(min.age = min(age, na.rm = T))

Since there are NAs in the data frame I receive the warning:

`Warning message: In min(age, na.rm = T) :
no non-missing arguments to min; returning Inf`

With Inf I cannot call summary(df$min.age) in a meaningful way.

Using pmin() instead of min returned the error message:

Error in summarise_impl(.data, dots) :
 Column 'in.age' must be length 1 (a summary value), not 3

What can I do to avoid any Inf and instead get NA so that I can further proceed with: summary(df$min.age)?

Thanks a lot!

回答1:

You could use is.infinite() to detect the infinities and ifelse to conditionally set them to NA.

#using your df and the dplyr package
min.age <- 
  df %>% 
  group_by(id) %>% 
  summarise(min.age = min(age, na.rm = T)) %>%
  mutate(min.age = ifelse(is.infinite(min.age), NA, min.age))

回答2:

Your code does the following:

Splits the data frame into groups by id
Applies the min function within each group to the age variable, with the na.rm=TRUE option enabled.

So for id of 1 you get min(c(7.8, NA, 7.9), na.rm=TRUE), which is the same as min(c(7.8, 7.9)) which is just 7.8.

Then, for id of 2 you get min(c(NA, NA), na.rm=TRUE), which is the same as min(c()).

Now, what is the minimum of an empty set of numbers? The definition of "minumum" is "a value smaller than all values in the set", and must satisfy the property that min(A) <= min(B) whenever B is a subset of A. One way to define the minumum of the empty set is to say it is "infinity", and that's how R treats the situation.

You can't really avoid getting Inf in this situation. But you could add another mutate to your chain to change any Inf to whatever you like, such as NA.

df %>% group_by(id) %>% summarize(min_age = min(age, na.rm = TRUE)) %>% 
    mutate(min_age = ifelse(is.infinite(min_age), NA, min_age))

回答3:

(min.age <- df %>% 
    group_by(id) %>% 
    summarise(min.age = ifelse(all(is.na(age)),NA,min(age, na.rm = T))))
# A tibble: 2 x 2
     id min.age
  <dbl>   <dbl>
1     1     7.8
2     2      NA

回答4:

an even simpler solution is the s function from the hablar package. It replaces empty vector with NA before evaluated in min/max. The code chunk by @awchisholm could be:

library(hablar)

min.age <- df %>% 
  group_by(id) %>% 
  summarise(min.age = min(s(age)))

disclaimer I am biased for this solution since I authored the package.

回答5:

The question has been answered, but it is useful to point out that if the column in question is a Date or a datetime, then it will still appear to be an NA in the summary table, but actually isn't. This is doubly confusing! Consider:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data.frame(date = as.Date(c("2013-01-01", "2013-05-23", "", "2017-04-15", "", "")),
                 int = c(1L, 2L, NA, 4L, NA, NA),
                 group = rep(LETTERS[1:3],2))

s1 <- df %>% group_by(group) %>% summarise(min_date = min(date), min_int = min(int)) %>% mutate(min_date_missing = is.na(min_date), min_int_missing = is.na(min_int))
#> Warning: package 'bindrcpp' was built under R version 3.4.4
s2 <- df %>% group_by(group) %>% summarise(min_date = min(date, na.rm = TRUE), min_int = min(int, na.rm = TRUE)) %>% mutate(min_date_missing = is.na(min_date), min_int_missing = is.na(min_int))

df
#>         date int group
#> 1 2013-01-01   1     A
#> 2 2013-05-23   2     B
#> 3       <NA>  NA     C
#> 4 2017-04-15   4     A
#> 5       <NA>  NA     B
#> 6       <NA>  NA     C
s1
#> # A tibble: 3 x 5
#>   group min_date   min_int min_date_missing min_int_missing
#>   <fct> <date>       <dbl> <lgl>            <lgl>          
#> 1 A     2013-01-01      1. FALSE            FALSE          
#> 2 B     NA             NA  TRUE             TRUE           
#> 3 C     NA             NA  TRUE             TRUE
s2
#> # A tibble: 3 x 5
#>   group min_date   min_int min_date_missing min_int_missing
#>   <fct> <date>       <dbl> <lgl>            <lgl>          
#> 1 A     2013-01-01      1. FALSE            FALSE          
#> 2 B     2013-05-23      2. FALSE            FALSE          
#> 3 C     NA            Inf  FALSE            FALSE

s1[[3,2]]
#> [1] NA
s2[[3,2]]
#> [1] NA

is.na(s1[[3,2]])
#> [1] TRUE
is.na(s2[[3,2]])
#> [1] FALSE

s1[[3,2]] == Inf
#> [1] NA
s2[[3,2]] == Inf
#> [1] TRUE

s1[[3,3]]
#> [1] NA
s2[[3,3]]
#> [1] Inf

is.na(s1[[3,3]])
#> [1] TRUE
is.na(s2[[3,3]])
#> [1] FALSE

s1[[3,2]] == Inf
#> [1] NA
s2[[3,2]] == Inf
#> [1] TRUE

sessionInfo()
#> R version 3.4.3 (2017-11-30)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.5
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] bindrcpp_0.2.2 dplyr_0.7.4   
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.17     utf8_1.1.3       crayon_1.3.4     digest_0.6.15   
#>  [5] rprojroot_1.3-2  assertthat_0.2.0 R6_2.2.2         backports_1.1.2 
#>  [9] magrittr_1.5     evaluate_0.10.1  pillar_1.2.1     cli_1.0.0       
#> [13] rlang_0.2.0.9001 stringi_1.1.7    rmarkdown_1.9    tools_3.4.3     
#> [17] stringr_1.3.0    glue_1.2.0       yaml_2.1.18      compiler_3.4.3  
#> [21] pkgconfig_2.0.1  htmltools_0.3.6  bindr_0.1.1      knitr_1.20      
#> [25] tibble_1.4.2

Created on 2018-06-27 by the reprex package (v0.2.0.9000).

回答6:

I prefer to choose my own invalid value. Say 200 will be invalid value for Age.

Now one can twist the use of min function slightly. e.g. min(age, 200, na.rm = TRUE) . This ensure that age is shown as 200 instead of +Inf when all values are missing. The result on df will be:

min.age <- df %>% 
  group_by(id) %>% 
  summarise(min.age = min(age, 200, na.rm = T))

> min.age
# A tibble: 2 x 2
#     id min.age
#  <dbl>   <dbl>
#1  1.00    7.80
#2  2.00  200

Now, its up to programmer how they use/replace this invalid value.

回答7:

This one seems interesting as it avoids the warning:

myMin <- function(vec) {
      ifelse(length(vec[!is.na(vec)]) == 0, NA_real_, min(vec, na.rm = TRUE))
    }

来源：https://stackoverflow.com/questions/48342962/with-min-in-r-return-na-instead-of-inf

标签

dplyr

plyr

min