问题
I'm trying to follow this process with a dataset. Here is a test dataframe:
id <- c("Johnboy","Johnboy","Johnboy")
orderno <- c(2,2,1)
validorder <- c(0,1,1)
ordertype <- c(95,94,95)
orderdate <- as.Date(c("2019-06-17","2019-03-26","2018-08-23"))
df <- data.frame(id, orderno, validorder, ordertype, orderdate)
Then I do the following:
## compute order date for order types
df <- df %>%
mutate(orderdate_dried = if_else(validorder == 1 &
ordertype == 95,
orderdate, as.Date(NA)),
orderdate_fresh = if_else(validorder == 1 &
ordertype == 94,
orderdate, as.Date(NA)))
## take minimum order date by type by order number
df <- df %>%
group_by(id, orderno) %>%
mutate(orderdate_dried = min(orderdate_dried, na.rm = TRUE),
orderdate_fresh = min(orderdate_fresh, na.rm = TRUE)) %>%
ungroup()
## aggregate order date for each type over individual
df <- df %>%
group_by(id) %>%
mutate(max_orderdate_dried = max(orderdate_dried, na.rm=TRUE),
max_orderdate_fresh = max(orderdate_fresh, na.rm=TRUE)) %>%
ungroup()
But all the maximum dates at the end of this process are NA! I don't understand how? Further, if I test the original orderdate_dried
for NAs:
is.na(df$orderdate_dried)
I get NAs for each row! How is this happening?!
回答1:
Very interesting question and the answer is hidden in the question itself. For clarity instead of updating the same df
everytime I will use df1
, df2
etc.
Let's first start with data.
id <- c("Johnboy","Johnboy","Johnboy")
orderno <- c(2,2,1)
validorder <- c(0,1,1)
ordertype <- c(95,94,95)
orderdate <- as.Date(c("2019-06-17","2019-03-26","2018-08-23"))
df <- data.frame(id, orderno, validorder, ordertype, orderdate)
library(dplyr)
Step 1 -
df1 <- df %>%
mutate(orderdate_dried = if_else(validorder == 1 &
ordertype == 95,
orderdate, as.Date(NA)),
orderdate_fresh = if_else(validorder == 1 &
ordertype == 94,
orderdate, as.Date(NA)))
df1
# id orderno validorder ordertype orderdate orderdate_dried orderdate_fresh
#1 Johnboy 2 0 95 2019-06-17 <NA> <NA>
#2 Johnboy 2 1 94 2019-03-26 <NA> 2019-03-26
#3 Johnboy 1 1 95 2018-08-23 2018-08-23 <NA>
Everything as expected here.
Step 2 -
df2 <- df1 %>%
group_by(id, orderno) %>%
mutate(orderdate_dried = min(orderdate_dried, na.rm = TRUE),
orderdate_fresh = min(orderdate_fresh, na.rm = TRUE)) %>%
ungroup()
df2
# A tibble: 3 x 7
# id orderno validorder ordertype orderdate orderdate_dried orderdate_fresh
# <fct> <dbl> <dbl> <dbl> <date> <date> <date>
#1 Johnboy 2 0 95 2019-06-17 NA 2019-03-26
#2 Johnboy 2 1 94 2019-03-26 NA 2019-03-26
#3 Johnboy 1 1 95 2018-08-23 2018-08-23 NA
Everything seems as expected here as well, we get NA
when there is no other date in the group.
Step 3 -
df3 <- df2 %>%
group_by(id) %>%
mutate(max_orderdate_dried = max(orderdate_dried, na.rm=TRUE),
max_orderdate_fresh = max(orderdate_fresh, na.rm=TRUE)) %>%
ungroup()
df3
# A tibble: 3 x 9
# id orderno validorder ordertype orderdate orderdate_dried orderdate_fresh max_orderdate_dried max_orderdate_fresh
# <fct> <dbl> <dbl> <dbl> <date> <date> <date> <date> <date>
#1 Johnboy 2 0 95 2019-06-17 NA 2019-03-26 NA NA
#2 Johnboy 2 1 94 2019-03-26 NA 2019-03-26 NA NA
#3 Johnboy 1 1 95 2018-08-23 2018-08-23 NA NA NA
Everything seems to be wrong here. These are basically the same steps that you have performed and this is the same output that you are getting, so we haven't done anything different till here.
One thing which we have missed though is in step 2 we received a warning message.
Warning messages: 1: In min.default(c(NA_real_, NA_real_), na.rm = TRUE) : no non-missing arguments to min; returning Inf 2: In min.default(NA_real_, na.rm = TRUE) : no non-missing arguments to min; returning Inf
Because we had no non-NA value in a group it returned Inf
even though the output of df2
shows NA (why it shows NA
when the value is Inf
added the explanation for it at the end of the answer). So even if you test is.na
with it, it fails.
is.na(df2$orderdate_dried)
#[1] FALSE FALSE FALSE
Hence, max
with na.rm
fails too.
max(df2$orderdate_dried, na.rm = TRUE)
#[1] NA
Hence, you get all NA
s in step 3.
Solution
The solution is to check with is.finite
df3 <- df2 %>%
group_by(id) %>%
mutate(max_orderdate_dried = max(orderdate_dried[is.finite(orderdate_dried)], na.rm=TRUE),
max_orderdate_fresh = max(orderdate_fresh[is.finite(orderdate_fresh)], na.rm=TRUE)) %>%
ungroup()
df3
# A tibble: 3 x 9
# id orderno validorder ordertype orderdate orderdate_dried orderdate_fresh max_orderdate_dried max_orderdate_fresh
# <fct> <dbl> <dbl> <dbl> <date> <date> <date> <date> <date>
#1 Johnboy 2 0 95 2019-06-17 NA 2019-03-26 2018-08-23 2019-03-26
#2 Johnboy 2 1 94 2019-03-26 NA 2019-03-26 2018-08-23 2019-03-26
#3 Johnboy 1 1 95 2018-08-23 2018-08-23 NA 2018-08-23 2019-03-26
Why does it show value as NA
when the value is Inf
?
In step 2, what we are basically doing is
min(NA, na.rm = TRUE)
#[1] Inf
Warning message: In min(NA, na.rm = TRUE) : no non-missing arguments to min; returning Inf
This returns Inf
with a warning which we get.
However, since we know that a column can hold a value of only one class
.
class(Inf) #is
#[1] "numeric"
but we have data of class "Date" in df1
's orderdate_dried
column
class(df1$orderdate_dried)
#[1] "Date"
so Inf
is then coerced into class "Date" which returns.
as.Date(min(NA, na.rm = TRUE))
#[1] NA
Again this is returns NA
but it is not real NA
and is.na
fails on this
is.na(as.Date(min(NA, na.rm = TRUE)))
#[1] FALSE
hence, step 3 doesn't work as expected.
I hope this answer is clear and not too confusing.
来源:https://stackoverflow.com/questions/60632568/date-columns-with-nas-in-r-unexpected-behaviour-with-mutate