I started using data.table package in R to boost performance of my code. I am using the following code:
sp500 <- read.csv(\'../rawdata/GMTSP.csv\')
days &
As others mentioned, strptime
(converting from character to POSIXlt) is the bottleneck here. Another simple solution uses the lubridate
package and its fast_strptime
method instead.
Here's what it looks like on my data:
> tables()
NAME NROW MB COLS
[1,] pp 3,718,339 126 session_id,date,user_id,path,num_sessions
KEY
[1,] user_id,date
Total: 126MB
> pp[, 2]
date
1: 2013-09-25
2: 2013-09-25
3: 2013-09-25
4: 2013-09-25
5: 2013-09-25
---
3718335: 2013-09-25
3718336: 2013-09-25
3718337: 2013-09-25
3718338: 2013-10-11
3718339: 2013-10-11
> system.time(pp[, date := as.Date(fast_strptime(date, "%Y-%m-%d"))])
user system elapsed
0.315 0.026 0.344
For comparison:
> system.time(pp[, date := as.Date(date, "%Y-%m-%d")])
user system elapsed
108.193 0.399 108.844
That's ~316 times faster!
Thanks for the suggestions. I solved it by writing the Gaussian algorithm for the dates myself and got far better results, see below.
getWeekDay <- function(year, month, day) {
# Implementation of the Gaussian algorithm to get weekday 0 - Sunday, ... , 7 - Saturday
Y <- year
Y[month<3] <- (Y[month<3] - 1)
d <- day
m <- ((month + 9)%%12) + 1
c <- floor(Y/100)
y <- Y-c*100
dayofweek <- (d + floor(2.6*m - 0.2) + y + floor(y/4) + floor(c/4) - 2*c) %% 7
return(dayofweek)
}
sp500 <- read.csv('../rawdata/GMTSP.csv')
days <- c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday")
# Using data.table to get the things much much faster
sp500 <- data.table(sp500, key="Date")
sp500 <- sp500[,Month:=as.integer(substr(Date,1,2))]
sp500 <- sp500[,Day:=as.integer(substr(Date,4,5))]
sp500 <- sp500[,Year:=as.integer(substr(Date,7,10))]
#sp500 <- sp500[,Date:=as.Date(Date, "%m/%d/%Y")]
#sp500 <- sp500[,Weekday:=factor(weekdays(sp500[,Date]), levels=days, ordered=T)]
sp500 <- sp500[,Weekday:=factor(getWeekDay(Year, Month, Day))]
levels(sp500$Weekday) <- days
Running the whole block above gives (including reading the date from csv)... Data.table is truly impressive.
user system elapsed
19.074 0.803 20.284
Timing of the conversion itself is 3.49 elapsed.
This is an old question, but I think this tiny trick it might be useful. If you have multiple rows with the same date, you can do
data[, date := as.Date(date[1]), by = date]
It's much faster since it only processes each date once (in my dataset of 40 million rows it goes from 25 seconds to 0.5 seconds).
I originally thought: "The argument to as.Date above does not have the format specified."
I now think: I assumed the Date value that you were keying on was in a standard format. I guess not. So you are doing two processes. You are reformatting from character to Date format and you are re-sorting based on the new values which have a completely different collation sequence.
I think it's just that as.Date
converts character
to Date
via POSIXlt
, using strptime
. And strptime
is very slow, I believe.
To trace it through yourself, type as.Date
, then methods(as.Date)
, then look at the character
method.
> as.Date
function (x, ...)
UseMethod("as.Date")
<bytecode: 0x2cf4b20>
<environment: namespace:base>
> methods(as.Date)
[1] as.Date.character as.Date.date as.Date.dates as.Date.default
[5] as.Date.factor as.Date.IDate* as.Date.numeric as.Date.POSIXct
[9] as.Date.POSIXlt
Non-visible functions are asterisked
> as.Date.character
function (x, format = "", ...)
{
charToDate <- function(x) {
xx <- x[1L]
if (is.na(xx)) {
j <- 1L
while (is.na(xx) && (j <- j + 1L) <= length(x)) xx <- x[j]
if (is.na(xx))
f <- "%Y-%m-%d"
}
if (is.na(xx) || !is.na(strptime(xx, f <- "%Y-%m-%d",
tz = "GMT")) || !is.na(strptime(xx, f <- "%Y/%m/%d",
tz = "GMT")))
return(strptime(x, f))
stop("character string is not in a standard unambiguous format")
}
res <- if (missing(format))
charToDate(x)
else strptime(x, format, tz = "GMT") #### slow part, I think ####
as.Date(res)
}
<bytecode: 0x2cf6da0>
<environment: namespace:base>
>
Why is as.POSIXlt(Date)$year+1900
relatively fast? Again, trace it through :
> as.POSIXct
function (x, tz = "", ...)
UseMethod("as.POSIXct")
<bytecode: 0x2936de8>
<environment: namespace:base>
> methods(as.POSIXct)
[1] as.POSIXct.date as.POSIXct.Date as.POSIXct.dates as.POSIXct.default
[5] as.POSIXct.IDate* as.POSIXct.ITime* as.POSIXct.numeric as.POSIXct.POSIXlt
Non-visible functions are asterisked
> as.POSIXlt.Date
function (x, ...)
{
y <- .Internal(Date2POSIXlt(x))
names(y$year) <- names(x)
y
}
<bytecode: 0x395e328>
<environment: namespace:base>
>
Intrigued, let's dig into Date2POSIXlt. For this bit we need to grep main/src to know which .c file to look at.
~/R/Rtrunk/src/main$ grep Date2POSIXlt *
names.c:{"Date2POSIXlt",do_D2POSIXlt, 0, 11, 1, {PP_FUNCALL, PREC_FN, 0}},
$
Now we know we need to look for D2POSIXlt :
~/R/Rtrunk/src/main$ grep D2POSIXlt *
datetime.c:SEXP attribute_hidden do_D2POSIXlt(SEXP call, SEXP op, SEXP args, SEXP env)
names.c:{"Date2POSIXlt",do_D2POSIXlt, 0, 11, 1, {PP_FUNCALL, PREC_FN, 0}},
$
Oh, we could have guessed datetime.c. Anyway, so looking at latest live copy :
datetime.c
Search in there for D2POSIXlt
and you'll see how simple it is to go from Date (numeric) to POSIXlt. You'll also see how POSIXlt is one real vector (8 bytes) plus seven integer vectors (4 bytes each). That's 40 bytes, per date!
So the crux of the issue (I think) is why strptime
is so slow, and maybe that can be improved in R. Or just avoid POSIXlt
, either directly or indirectly.
Here's a reproducible example using the number of items stated in question (3,000,000) :
> Range = seq(as.Date("2000-01-01"),as.Date("2012-01-01"),by="days")
> Date = format(sample(Range,3000000,replace=TRUE),"%m/%d/%Y")
> system.time(as.Date(Date, "%m/%d/%Y"))
user system elapsed
21.681 0.060 21.760
> system.time(strptime(Date, "%m/%d/%Y"))
user system elapsed
29.594 8.633 38.270
> system.time(strptime(Date, "%m/%d/%Y", tz="GMT"))
user system elapsed
19.785 0.000 19.802
Passing tz
appears to speed up strptime
, which as.Date.character
does. So maybe it depends on your locale. But strptime
appears to be the culprit, not data.table
. Perhaps rerun this example and see if it takes 90 seconds for you on your machine?