问题
I'm trying to prepare some demographic data retrieved from Eurostat for further processing, amongst others replacing any missing data with corresponding approximated ones.
First I was using data.frames only, but then I got convinced that data.tables might offer some advantages over regular data.frames, so I migrated to data.tables.
One thing I've observed while doing so was getting different results when using "na.spline" in combination with "apply" versus "na.spline" as part of the data.table.
#1 source data
(dt0 <- data.table(
"age,sex,geo\\time" = c("TOTAL,F,AD", "TOTAL,F,AL", "TOTAL,F,AM", "TOTAL,F,AT", "TOTAL,F,AZ"),
"2014" = c(NA, NA, NA, 4351253, NA),
"2013" = c(37408, NA, NA, 4328238, 4707690),
"2012" = c(38252, NA, 1684000, 4309977, 4651601),
"2011" = c(38252, 1409931, 1679066, 4296293, 4594023),
"2010" = c(40296, NA, 1673656, 4285442, 4542083)
))
Which generates
age,sex,geo\\time 2014 2013 2012 2011 2010
1: TOTAL,F,AD NA 37408 38252 38252 40296
2: TOTAL,F,AL NA NA NA 1409931 NA
3: TOTAL,F,AM NA NA 1684000 1679066 1673656
4: TOTAL,F,AT 4351253 4328238 4309977 4296293 4285442
5: TOTAL,F,AZ NA 4707690 4651601 4594023 4542083
Split into "row headings"...
(dt0a <- dt0[, 1, with=FALSE])
(cn0a <- colnames(dt0a))
...and population data...
(dt0b <- dt0[, 2:ncol(dt0), with=FALSE])
(cn0ba <- colnames(dt0b))
#2 replace NAs by "na.spline" & "apply"
(dt1b <- data.table(t(apply(dt0b, 1, na.spline, na.rm=FALSE))))
(setnames(dt1b, cn0b))
(dt1 <- cbind(dt0a, dt1b))
Which results in...
age,sex,geo\\time 2014 2013 2012 2011 2010
1: TOTAL,F,AD 32832 37408 38252 38252 40296
2: TOTAL,F,AL 1409931 1409931 1409931 1409931 1409931
3: TOTAL,F,AM 1692440 1688458 1684000 1679066 1673656
4: TOTAL,F,AT 4351253 4328238 4309977 4296293 4285442
5: TOTAL,F,AZ 4755163 4707690 4651601 4594023 4542083
#3 replace NAs within "data.table"
(dt2b <- dt0b[,lapply(.SD, na.spline, na.rm=FALSE)])
(dt2 <- cbind(dt0a, dt2b))
Et voila...
age,sex,geo\\time 2014 2013 2012 2011 2010
1: TOTAL,F,AD 4351253 37408 38252.0 38252 40296.0
2: TOTAL,F,AL 4351253 1993097 -611513.8 1409931 -629363.2
3: TOTAL,F,AM 4351253 3423374 1684000.0 1679066 1673656.0
4: TOTAL,F,AT 4351253 4328238 4309977.0 4296293 4285442.0
5: TOTAL,F,AZ 4351253 4707690 4651601.0 4594023 4542083.0
#4 compare results
(identical(dt1, dt2))
Considering the aforementioned not quite a surprise...
[1] FALSE
(The values calculated as a replacement for NAs with approach #2 are the ones I'd be interested in, only generated via approach #3).
Reason for pursuing the "data.table" route (approach #3) is one of performance (as it has been pointed out in various posts, when using "apply" a matrix operation is getting carried out, which is taking a considerably longer time than a corresponding approach facilitating "data.table" only).
Being pretty new to R I reckon I've done something quite fundamentally wrong, only thing is, I haven't got the faintest idea what this might be.
Any help pointing me in the right direction is more than appreciated!
-Sil68
回答1:
Using a matrix. Using a matrix operation on a matrix
is not slow:
mat <- t(as.matrix(dt0[,-1]))
colnames(mat) <- dt0[[1]]
mat[] <- na.spline(mat,na.rm=FALSE)
which gives
TOTAL,F,AD TOTAL,F,AL TOTAL,F,AM TOTAL,F,AT TOTAL,F,AZ
2014 32832 1409931 1692440 4351253 4755163
2013 37408 1409931 1688458 4328238 4707690
2012 38252 1409931 1684000 4309977 4651601
2011 38252 1409931 1679066 4296293 4594023
2010 40296 1409931 1673656 4285442 4542083
Using a data.table. If you instead want to use a data.table
, do
mat <- t(as.matrix(dt0[,-1]))
colnames(mat) <- dt0[[1]]
DT <- data.table(mat,keep.rownames=TRUE)
DT[,(vn):=lapply(.SD,na.spline,na.rm=FALSE),.SDcols=vn]
which updates DT
by reference, giving
rn TOTAL,F,AD TOTAL,F,AL TOTAL,F,AM TOTAL,F,AT TOTAL,F,AZ
1: 2014 32832 1409931 1692440 4351253 4755163
2: 2013 37408 1409931 1688458 4328238 4707690
3: 2012 38252 1409931 1684000 4309977 4651601
4: 2011 38252 1409931 1679066 4296293 4594023
5: 2010 40296 1409931 1673656 4285442 4542083
Benchmarking:
mat <- t(as.matrix(dt0[,-1]))
colnames(mat) <- dt0[[1]]
DT <- data.table(mat,keep.rownames=TRUE)
vn <- names(DT)[-1]
tvn <- names(dt0)[-1]
require(microbenchmark)
microbenchmark(
transp = dt0[,as.list(na.spline(unlist(.SD), na.rm=FALSE)),by=1:nrow(dt0),.SDcols=tvn],
lapply = DT[,lapply(.SD,na.spline,na.rm=FALSE),.SDcols=vn],
apply = apply(mat,2,na.spline,na.rm=FALSE),
fun = na.spline(mat,na.rm=FALSE),
times=10)
Results:
Unit: milliseconds
expr min lq mean median uq max neval
transp 4.666934 4.734891 4.850268 4.787690 4.897202 5.259957 10
lapply 3.923823 4.010356 4.327646 4.039445 4.049957 6.976446 10
apply 2.505556 2.525601 2.578890 2.585978 2.592090 2.758801 10
fun 1.945290 1.994178 2.063063 2.068490 2.085112 2.272846 10
The "transp" result shows the timing of @shadow's solution, which preserves the OP's formatting. apply
is not necessary here, thanks to how na.spline
works.
回答2:
You are doing rowwise operations in your apply
version and colwise operations in your data.table
version. You can do the rowwise operation in data.table
if you set by = 1:nrow(dt)
.
dt2b <- dt0b[, as.list(na.spline(unlist(.SD), na.rm=FALSE)), by = 1:nrow(dt0b)]
You can also use .SDcols
, so that you don't need to split up the data. IF the age,sex,geo\\time
is unique, the following will work:
dt0[, as.list(na.spline(unlist(.SD), na.rm=FALSE)), by = `age,sex,geo\\time`, .SDcols = -"age,sex,geo\\time"]
## age,sex,geo\\time V1 V2 V3 V4 V5
## 1: TOTAL,F,AD 32832 37408 38252 38252 40296
## 2: TOTAL,F,AL 1409931 1409931 1409931 1409931 1409931
## 3: TOTAL,F,AM 1692440 1688458 1684000 1679066 1673656
## 4: TOTAL,F,AT 4351253 4328238 4309977 4296293 4285442
## 5: TOTAL,F,AZ 4755163 4707690 4651601 4594023 4542083
来源:https://stackoverflow.com/questions/30075752/how-to-replace-na-values-in-a-data-table-with-na-spline