Date columns with NAs in R - unexpected behaviour with mutate

醉酒当歌 提交于 2020-08-27 19:55:04

问题


I'm trying to follow this process with a dataset. Here is a test dataframe:

id <- c("Johnboy","Johnboy","Johnboy")
orderno <- c(2,2,1)
validorder <- c(0,1,1)
ordertype <- c(95,94,95)
orderdate <- as.Date(c("2019-06-17","2019-03-26","2018-08-23"))

df <- data.frame(id, orderno, validorder, ordertype, orderdate)

Then I do the following:

## compute order date for order types
df <- df %>%
  mutate(orderdate_dried = if_else(validorder == 1 &
                                  ordertype == 95,
                                  orderdate, as.Date(NA)),
         orderdate_fresh = if_else(validorder == 1 &
                                  ordertype == 94,
                                  orderdate, as.Date(NA)))

## take minimum order date by type by order number
df <- df %>%
  group_by(id, orderno) %>%
  mutate(orderdate_dried = min(orderdate_dried, na.rm = TRUE),
         orderdate_fresh = min(orderdate_fresh, na.rm = TRUE)) %>%
  ungroup()

## aggregate order date for each type over individual
df <- df %>%
  group_by(id) %>%
  mutate(max_orderdate_dried = max(orderdate_dried, na.rm=TRUE),
         max_orderdate_fresh = max(orderdate_fresh, na.rm=TRUE)) %>%
  ungroup()

But all the maximum dates at the end of this process are NA! I don't understand how? Further, if I test the original orderdate_dried for NAs:

is.na(df$orderdate_dried)

I get NAs for each row! How is this happening?!


回答1:


Very interesting question and the answer is hidden in the question itself. For clarity instead of updating the same df everytime I will use df1, df2 etc.

Let's first start with data.

id <- c("Johnboy","Johnboy","Johnboy")
orderno <- c(2,2,1)
validorder <- c(0,1,1)
ordertype <- c(95,94,95)
orderdate <- as.Date(c("2019-06-17","2019-03-26","2018-08-23"))

df <- data.frame(id, orderno, validorder, ordertype, orderdate)

library(dplyr)

Step 1 -

df1 <- df %>%
        mutate(orderdate_dried = if_else(validorder == 1 &
                                         ordertype == 95,
                                        orderdate, as.Date(NA)),
               orderdate_fresh = if_else(validorder == 1 &
                                         ordertype == 94,
                                         orderdate, as.Date(NA)))

df1
#       id orderno validorder ordertype  orderdate orderdate_dried orderdate_fresh
#1 Johnboy       2          0        95 2019-06-17            <NA>            <NA>
#2 Johnboy       2          1        94 2019-03-26            <NA>      2019-03-26
#3 Johnboy       1          1        95 2018-08-23      2018-08-23            <NA>

Everything as expected here.

Step 2 -

df2 <- df1 %>%
        group_by(id, orderno) %>%
        mutate(orderdate_dried = min(orderdate_dried, na.rm = TRUE),
                orderdate_fresh = min(orderdate_fresh, na.rm = TRUE)) %>%
        ungroup()

df2
# A tibble: 3 x 7
#  id      orderno validorder ordertype orderdate  orderdate_dried orderdate_fresh
#  <fct>     <dbl>      <dbl>     <dbl> <date>     <date>          <date>         
#1 Johnboy       2          0        95 2019-06-17 NA              2019-03-26     
#2 Johnboy       2          1        94 2019-03-26 NA              2019-03-26     
#3 Johnboy       1          1        95 2018-08-23 2018-08-23      NA           

Everything seems as expected here as well, we get NA when there is no other date in the group.

Step 3 -

df3 <- df2 %>%
        group_by(id) %>%
        mutate(max_orderdate_dried = max(orderdate_dried, na.rm=TRUE),
               max_orderdate_fresh = max(orderdate_fresh, na.rm=TRUE)) %>%
         ungroup()

df3
# A tibble: 3 x 9
#  id      orderno validorder ordertype orderdate  orderdate_dried orderdate_fresh max_orderdate_dried max_orderdate_fresh
 #  <fct>     <dbl>      <dbl>     <dbl> <date>     <date>          <date>          <date>              <date>             
#1 Johnboy       2          0        95 2019-06-17 NA              2019-03-26      NA                  NA                 
#2 Johnboy       2          1        94 2019-03-26 NA              2019-03-26      NA                  NA                 
#3 Johnboy       1          1        95 2018-08-23 2018-08-23      NA              NA                  NA    

Everything seems to be wrong here. These are basically the same steps that you have performed and this is the same output that you are getting, so we haven't done anything different till here.

One thing which we have missed though is in step 2 we received a warning message.

Warning messages: 1: In min.default(c(NA_real_, NA_real_), na.rm = TRUE) : no non-missing arguments to min; returning Inf 2: In min.default(NA_real_, na.rm = TRUE) : no non-missing arguments to min; returning Inf

Because we had no non-NA value in a group it returned Inf even though the output of df2 shows NA (why it shows NA when the value is Inf added the explanation for it at the end of the answer). So even if you test is.na with it, it fails.

is.na(df2$orderdate_dried)
#[1] FALSE FALSE FALSE

Hence, max with na.rm fails too.

 max(df2$orderdate_dried, na.rm = TRUE)
#[1] NA

Hence, you get all NAs in step 3.


Solution

The solution is to check with is.finite

df3 <- df2 %>%
        group_by(id) %>%
         mutate(max_orderdate_dried = max(orderdate_dried[is.finite(orderdate_dried)], na.rm=TRUE),
                 max_orderdate_fresh = max(orderdate_fresh[is.finite(orderdate_fresh)], na.rm=TRUE)) %>%
         ungroup()


df3
# A tibble: 3 x 9
#  id      orderno validorder ordertype orderdate  orderdate_dried orderdate_fresh max_orderdate_dried max_orderdate_fresh
#  <fct>     <dbl>      <dbl>     <dbl> <date>     <date>          <date>          <date>              <date>             
#1 Johnboy       2          0        95 2019-06-17 NA              2019-03-26      2018-08-23          2019-03-26         
#2 Johnboy       2          1        94 2019-03-26 NA              2019-03-26      2018-08-23          2019-03-26         
#3 Johnboy       1          1        95 2018-08-23 2018-08-23      NA              2018-08-23          2019-03-26   

Why does it show value as NA when the value is Inf ?

In step 2, what we are basically doing is

min(NA, na.rm = TRUE)
#[1] Inf

Warning message: In min(NA, na.rm = TRUE) : no non-missing arguments to min; returning Inf

This returns Inf with a warning which we get.

However, since we know that a column can hold a value of only one class.

class(Inf) #is
#[1] "numeric"

but we have data of class "Date" in df1's orderdate_dried column

 class(df1$orderdate_dried)
#[1] "Date"

so Inf is then coerced into class "Date" which returns.

as.Date(min(NA, na.rm = TRUE))
#[1] NA

Again this is returns NA but it is not real NA and is.na fails on this

is.na(as.Date(min(NA, na.rm = TRUE)))
#[1] FALSE

hence, step 3 doesn't work as expected.

I hope this answer is clear and not too confusing.



来源:https://stackoverflow.com/questions/60632568/date-columns-with-nas-in-r-unexpected-behaviour-with-mutate

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!