问题
Please consider the following:
I recently 'discovered' the awesome plyr
and dplyr
packages and use those for analysing patient data that is available to me in a data frame. Such a data frame could look like this:
df <- data.frame(id = c(1, 1, 1, 2, 2), # patient ID
diag = c(rep("dia1", 3), rep("dia2", 2)), # diagnosis
age = c(7.8, NA, 7.9, NA, NA)) # patient age
I would like to summarise the minimum patient age of all patients with a median and mean. I did the following:
min.age <- df %>%
group_by(id) %>%
summarise(min.age = min(age, na.rm = T))
Since there are NAs
in the data frame I receive the warning:
`Warning message: In min(age, na.rm = T) :
no non-missing arguments to min; returning Inf`
With Inf
I cannot call summary(df$min.age)
in a meaningful way.
Using pmin()
instead of min
returned the error message:
Error in summarise_impl(.data, dots) :
Column 'in.age' must be length 1 (a summary value), not 3
What can I do to avoid any Inf
and instead get NA
so that I can further proceed with:
summary(df$min.age)
?
Thanks a lot!
回答1:
You could use is.infinite()
to detect the infinities and ifelse
to conditionally set them to NA
.
#using your df and the dplyr package
min.age <-
df %>%
group_by(id) %>%
summarise(min.age = min(age, na.rm = T)) %>%
mutate(min.age = ifelse(is.infinite(min.age), NA, min.age))
回答2:
Your code does the following:
- Splits the data frame into groups by
id
- Applies the
min
function within each group to theage
variable, with thena.rm=TRUE
option enabled.
So for id
of 1
you get min(c(7.8, NA, 7.9), na.rm=TRUE)
, which is the same as min(c(7.8, 7.9))
which is just 7.8.
Then, for id
of 2
you get min(c(NA, NA), na.rm=TRUE)
, which is the same as min(c())
.
Now, what is the minimum of an empty set of numbers? The definition of "minumum" is "a value smaller than all values in the set", and must satisfy the property that min(A) <= min(B) whenever B is a subset of A. One way to define the minumum of the empty set is to say it is "infinity", and that's how R treats the situation.
You can't really avoid getting Inf
in this situation. But you could add another mutate
to your chain to change any Inf
to whatever you like, such as NA
.
df %>% group_by(id) %>% summarize(min_age = min(age, na.rm = TRUE)) %>%
mutate(min_age = ifelse(is.infinite(min_age), NA, min_age))
回答3:
(min.age <- df %>%
group_by(id) %>%
summarise(min.age = ifelse(all(is.na(age)),NA,min(age, na.rm = T))))
# A tibble: 2 x 2
id min.age
<dbl> <dbl>
1 1 7.8
2 2 NA
回答4:
an even simpler solution is the s function from the hablar package. It replaces empty vector with NA before evaluated in min/max. The code chunk by @awchisholm could be:
library(hablar)
min.age <- df %>%
group_by(id) %>%
summarise(min.age = min(s(age)))
disclaimer I am biased for this solution since I authored the package.
回答5:
The question has been answered, but it is useful to point out that if the column in question is a Date or a datetime, then it will still appear to be an NA in the summary table, but actually isn't. This is doubly confusing! Consider:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- data.frame(date = as.Date(c("2013-01-01", "2013-05-23", "", "2017-04-15", "", "")),
int = c(1L, 2L, NA, 4L, NA, NA),
group = rep(LETTERS[1:3],2))
s1 <- df %>% group_by(group) %>% summarise(min_date = min(date), min_int = min(int)) %>% mutate(min_date_missing = is.na(min_date), min_int_missing = is.na(min_int))
#> Warning: package 'bindrcpp' was built under R version 3.4.4
s2 <- df %>% group_by(group) %>% summarise(min_date = min(date, na.rm = TRUE), min_int = min(int, na.rm = TRUE)) %>% mutate(min_date_missing = is.na(min_date), min_int_missing = is.na(min_int))
df
#> date int group
#> 1 2013-01-01 1 A
#> 2 2013-05-23 2 B
#> 3 <NA> NA C
#> 4 2017-04-15 4 A
#> 5 <NA> NA B
#> 6 <NA> NA C
s1
#> # A tibble: 3 x 5
#> group min_date min_int min_date_missing min_int_missing
#> <fct> <date> <dbl> <lgl> <lgl>
#> 1 A 2013-01-01 1. FALSE FALSE
#> 2 B NA NA TRUE TRUE
#> 3 C NA NA TRUE TRUE
s2
#> # A tibble: 3 x 5
#> group min_date min_int min_date_missing min_int_missing
#> <fct> <date> <dbl> <lgl> <lgl>
#> 1 A 2013-01-01 1. FALSE FALSE
#> 2 B 2013-05-23 2. FALSE FALSE
#> 3 C NA Inf FALSE FALSE
s1[[3,2]]
#> [1] NA
s2[[3,2]]
#> [1] NA
is.na(s1[[3,2]])
#> [1] TRUE
is.na(s2[[3,2]])
#> [1] FALSE
s1[[3,2]] == Inf
#> [1] NA
s2[[3,2]] == Inf
#> [1] TRUE
s1[[3,3]]
#> [1] NA
s2[[3,3]]
#> [1] Inf
is.na(s1[[3,3]])
#> [1] TRUE
is.na(s2[[3,3]])
#> [1] FALSE
s1[[3,2]] == Inf
#> [1] NA
s2[[3,2]] == Inf
#> [1] TRUE
sessionInfo()
#> R version 3.4.3 (2017-11-30)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.5
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] bindrcpp_0.2.2 dplyr_0.7.4
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_0.12.17 utf8_1.1.3 crayon_1.3.4 digest_0.6.15
#> [5] rprojroot_1.3-2 assertthat_0.2.0 R6_2.2.2 backports_1.1.2
#> [9] magrittr_1.5 evaluate_0.10.1 pillar_1.2.1 cli_1.0.0
#> [13] rlang_0.2.0.9001 stringi_1.1.7 rmarkdown_1.9 tools_3.4.3
#> [17] stringr_1.3.0 glue_1.2.0 yaml_2.1.18 compiler_3.4.3
#> [21] pkgconfig_2.0.1 htmltools_0.3.6 bindr_0.1.1 knitr_1.20
#> [25] tibble_1.4.2
Created on 2018-06-27 by the reprex package (v0.2.0.9000).
回答6:
I prefer to choose my own invalid value. Say 200
will be invalid value for Age
.
Now one can twist the use of min
function slightly. e.g. min(age, 200, na.rm = TRUE)
. This ensure that age is shown as 200
instead of +Inf
when all values are missing. The result on df
will be:
min.age <- df %>%
group_by(id) %>%
summarise(min.age = min(age, 200, na.rm = T))
> min.age
# A tibble: 2 x 2
# id min.age
# <dbl> <dbl>
#1 1.00 7.80
#2 2.00 200
Now, its up to programmer how they use/replace this invalid value.
回答7:
This one seems interesting as it avoids the warning:
myMin <- function(vec) {
ifelse(length(vec[!is.na(vec)]) == 0, NA_real_, min(vec, na.rm = TRUE))
}
来源:https://stackoverflow.com/questions/48342962/with-min-in-r-return-na-instead-of-inf