Indexing sequence to use for addressing an element of a data frame

一曲冷凌霜 提交于 2019-12-13 18:41:21

问题


There a several ways to access a specific element in a data frame, using various combinations of brackets ([ ]), and dollar signs ($). In time-sensitive functions, which one to use can be important?

Benchmarking some of the possible combinations:

library(microbenchmark)
df <- data.frame(a=1:6,b=1:6,c=1:6,d=1:6,e=1:6,f=1:6)
microbenchmark(df$c[3],
               df[3,]$c,
               df[3,3],
               df[3,][3],
               df[3,][[3]],
               df[,3][3],
               times=1e3)

yields these timings:

Unit: microseconds
         expr    min       lq      mean   median       uq      max neval
      df$c[3]  9.836  11.4505  14.03068  12.2015  12.9280 1252.854  1000
    df[3, ]$c 77.204  89.5750 100.18752  92.2445  98.6395 1351.521  1000
     df[3, 3] 15.719  18.9850  21.04074  19.6010  20.7400   82.519  1000
   df[3, ][3] 88.599 100.5920 110.59009 104.0415 110.5435  409.050  1000
 df[3, ][[3]] 75.856  87.2200  98.67104  89.9360  96.1695 1391.299  1000
   df[, 3][3] 11.639  13.4225  14.77493  13.9510  14.6905   55.172  1000

Where we see that df$c[3] is fastest, closely followed by df[,3][3]. Others are much slower.

In time sensitive appplications, I often use data tables rather than frames, because sorting and subsetting operations are typically much faster. However, addressing operations can be much slower, as we see if we repeat the above for a data.table:

library(data.table)
dt <- as.data.table(df)
microbenchmark(dt$c[3],
               dt[3,]$c,
               dt[3,3],
               dt[3,][[3]],
               times=1e3)
Unit: microseconds
         expr     min       lq      mean   median       uq      max neval
      dt$c[3]   9.503  11.4020  14.90066  12.6820  13.8950 1336.407  1000
    dt[3, ]$c 417.756 437.0495 480.26532 448.8625 463.6350 2909.038  1000
     dt[3, 3] 205.115 218.9590 238.78000 227.9575 239.1265 1554.503  1000
 dt[3, ][[3]] 414.378 435.2115 470.76853 447.1505 461.3310 1906.432  1000

My question is this: Is $[ ] guaranteed to always be the fastest addressing method, or can this depend on factors such as the types of data in the data frame (or table), the platform (OS), or the build version? If anyone can explain the reasons underlying differences in timing, and/or the pros/cons of various approaches, that would be also useful.

UPDATE

Following the suggestions in the answer from 42- the test is repeated here using more rows and with the additional syntax options from both 42- and also in the comment by A.Webb who suggested df[[3,3]] as the fastest. (note: I also tried the same test but accessing higher row numbers, but timing seems to be independent of which row is selected).

df <- data.frame(a=1:1000,b=1:1000,c=1:1000,d=1:1000,e=1:1000,f=1:1000)

Unit: microseconds
         expr    min      lq       mean  median       uq      max neval
      df$c[3]  8.314  9.7610  12.870667 10.6260  12.0950 1250.339  1000
 df[["c"]][3]  6.932  8.0670   9.652672  8.7075   9.9445   26.512  1000
  (df[3, ])$c 72.395 77.2390  90.893724 79.8320  95.8540  256.082  1000
     df[3, 3] 14.871 16.2625  19.377482 17.1180  20.1720   47.720  1000
   df[3, ][3] 82.446 86.7680 102.462603 89.9660 107.7965  232.685  1000
 df[3, ][[3]] 70.559 75.2140  93.581394 78.3385  93.4235 1507.933  1000
   df[, 3][3]  9.933 11.4770  13.430309 12.1090  14.0900   38.213  1000
   df[[3, 3]]  6.465  7.8355   9.236773  8.4500   9.6355   29.833  1000

So it looks like df[[i,j]] is fastest, followed extremely closely by df[["colname"]][j]. Which of these to use would probably depend on whether you need to use column names or numbers.

The question is still open if we can assume that this is always the case on all platforms and for all data types.


回答1:


As stated in my comments, df$c[3] is actually parsed to '[['(df, 'c')[3], so it's not surprising that skipping the parsing process results in faster execution. The data.table comparisons are mostly non-equivalent except when using $ which is not really a data.table function..

Unit: microseconds
         expr     min       lq      mean   median       uq      max neval   cld
      df$c[3]  16.035  16.8245  17.63600  17.3090  17.9400   31.158  1000 ab   
 df[["c"]][3]  13.008  13.9090  14.60883  14.2775  14.8355  121.634  1000 a    
  (df[3, ])$c 137.376 140.4895 143.57778 141.6055 143.8310  175.180  1000    d 
     df[3, 3]  29.316  30.5715  31.25617  30.9040  31.3165   49.764  1000   c  
   df[3, ][3] 156.524 159.4180 167.99243 160.3910 162.3120 2636.693  1000     e
 df[3, ][[3]] 134.975 137.3945 142.92265 138.3810 140.2370 2675.090  1000    d 
   df[, 3][3]  20.108  21.2860  21.94357  21.5810  21.8640   59.057  1000  b   

I admit to being surprised at the fact that the code I wrote: '[['(df, 'c')[3], was unparsed as df[["c"]][3] and rather puzzled by some of the results, but the general rule is selecting columns first followed by positions in the resulting vector is generally much faster.

Also: this needs to be tested with larger objects. Ones with rows >> cols



来源:https://stackoverflow.com/questions/35905688/indexing-sequence-to-use-for-addressing-an-element-of-a-data-frame

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!