convert string date to R Date FAST for all dates

后端 未结 5 1924
天涯浪人
天涯浪人 2020-12-31 14:38

This has been asked several times with no clear answer: I would like to convert an R character string of the form \"YYYY-mm-dd\" into a Date. The as.Date<

相关标签:
5条回答
  • 2020-12-31 14:47

    The function parse_date_time from the 'lubridate' package is quite fast too:

    library(date)
    library(lubridate)
    set.seed(21)
    x <- as.character(Sys.Date()-sample(40000, 1e6, TRUE))
    system.time(date1 <- as.Date(x))
    #  user  system elapsed 
    # 12.86    0.00   12.94 
    system.time(date2 <- as.Date(as.date(x,"ymd"))) # from package 'date'
    #  user  system elapsed 
    #  4.82    0.00    4.85 
    system.time(date3 <- as.Date(parse_date_time(x,'%y-%m-%d'))) # from package 'lubridate'
    #  user  system elapsed 
    #  0.27    0.00    0.26 
    all(date1 == date2)
    #  TRUE
    all(date1 == date3)
    #  TRUE
    
    0 讨论(0)
  • 2020-12-31 14:50

    Consider incredibly fast anytime library which is fine with 1970< issue. It uses the Boost date_time C++ library and provides functions anytime() and anydate() for conversions. Comparison:

    require(anytime)        #anydate()
    require(lubridate)      #parse_date_time()
    require(microbenchmark) #microbenchmark()
    
    set.seed(21)
    test.dd <- as.Date("2018-05-16") - sample(40000, 1e6, TRUE) #1 mln. random dates
    
    microbenchmark(
        strptime(test.dd, "%Y-%m-%d"),                     #basic strptime
        parse_date_time(test.dd, orders = "ymd"),          #lubridate (POSIXct class)
        as.Date(parse_date_time(test.dd, orders = "ymd")), #lubridate + date class conversion
        anydate(test.dd),                                  #anytime library
        times = 10L, unit = "s"
    )
    

    Result/Output:

    Unit: seconds
                                                 expr          min           lq         mean       median           uq          max neval cld
                        strptime(test.dd, "%Y-%m-%d") 10.177406012 10.472527403 1.064532e+01 10.621221596 10.819156870 11.288330598    10   c
             parse_date_time(test.dd, orders = "ymd")  4.541542019  4.603663894 4.844961e+00  4.869800287  5.055844972  5.128409226    10  b 
    as.Date(parse_date_time(test.dd, orders = "ymd"))  4.461140695  4.568415584 4.867837e+00  4.739026273  5.080610126  5.532028490    10  b 
                                     anydate(test.dd)  0.000000755  0.000004909 5.777500e-06  0.000005664  0.000006042  0.000012839    10 a 
    

    p.s. For working with time series consider flipTime library. It has all required tools and almost as fast as anytime for conversion purposes:

    require(devtools)
    install_github("Displayr/flipTime")
    
    0 讨论(0)
  • 2020-12-31 14:55

    I can get a little speedup by using the date package:

    library(date)
    set.seed(21)
    x <- as.character(Sys.Date()-sample(40000, 1e6, TRUE))
    system.time(dDate <- as.Date(x))
    #    user  system elapsed 
    #    6.54    0.01    6.56 
    system.time(ddate <- as.Date(as.date(x,"ymd")))
    #    user  system elapsed 
    #    3.42    0.22    3.64 
    

    You might want to look at the C code it uses and see if you can modify it to be faster for your specific situation.

    0 讨论(0)
  • 2020-12-31 15:07

    A further speedup: You already work with data.table. So, create a lookup table with your dates and merge them with your data.

    library(lubridate)
    library(data.table)
    
    y <- seq(as.Date('1900-01-01'), Sys.Date(), by = 'day')
    id.date <- data.table(id = as.character(y), date = as.Date(y), key = 'id')
    
    set.seed(21)
    x <- as.character(Sys.Date()-sample(40000, 1e6, TRUE))
    
    system.time(date3 <- as.Date(parse_date_time(x,'%y-%m-%d'))) # from package 'lubridate'
    #  user  system elapsed 
    #  0.15  0.00   0.15  
    
    system.time(date4 <- id.date[setDT(list(id = x)), on='id', date])
    #  user  system elapsed 
    #  0.08  0.00   0.08
    
    all(date3 == date4)
    # TRUE
    

    It's kind of a workaround, but I believe thats how data.table is intended to use. I don't know if the above mentioned time/date packages internally are based on algorithms or as well on lookup tables (hash tables).

    For larger datasets, whenever there is character manipulation involved, which tend to be slow, I consider switching to lookup a reference table.

    0 讨论(0)
  • 2020-12-31 15:14

    I had a similar problem a while ago and came up with the following solution:

    1. convert the string to a factor (if not already a factor)
    2. convert the levels of the factor to a Date
    3. Expand the converted levels to the solution using the index vector of the factor

    Extending Joshua Ulrich's example, I get (with slower timings on my laptop)

    library(date)
    set.seed(21)
    x <- as.character(Sys.Date()-sample(40000, 1e6, TRUE))
    system.time(dDate <- as.Date(x))
    #    user  system elapsed 
    #    12.09   0.00   12.12 
    system.time(ddate <- as.Date(as.date(x,"ymd")))
    #    user  system elapsed 
    #    6.97    0.04    7.05 
    system.time({
        xf <- as.factor(x)
        dDate <- as.Date(levels(xf))[as.integer(xf)]
    })
    #    user  system elapsed 
    #    1.16    0.00    1.15
    

    Here, step 2 does not depend on the length of x once x is large enough and step 3 scales extremely well (simple vector indexing). The bottleneck should be step 1, which can be avoided if the data is already stored as a factor.

    0 讨论(0)
提交回复
热议问题