Fast alternative to split in R

后端未结

关注

 2  1169

I\'m partitioning a data frame with split() in order to use parLapply() to call a function on each partition in parallel. The data frame has 1.3 m

相关标签:

2条回答

萌比男神i

2020-12-20 13:39
Split indexes into pop
```
idx <- split(seq_len(nrow(pop)), list(pop$ID, pop$code))
```
Split is not slow, e.g.,
```
> system.time(split(seq_len(1300000), sample(250000, 1300000, TRUE)))
   user  system elapsed 
  1.056   0.000   1.058 
```
so if yours is I guess there's some aspect of your data that slows things down, e.g., ID and code are both factors with many levels and so their complete interaction, rather than the level combinations appearing in your data set, are calculated
```
> length(split(1:10, list(factor(1:10), factor(10:1))))
[1] 100
> length(split(1:10, paste(letters[1:10], letters[1:10], sep="-")))
[1] 10
```
or perhaps you're running out of memory.

Use mclapply rather than parLapply if you're using processes on a non-Windows machine (which I guess is the case since you ask for detectCores()).
```
par_pop <- mclapply(idx, function(i, pop, fun) fun(pop[i,]), pop, func)
```
Conceptually it sounds like you're really aiming for pvec (distribute a vectorized calculation over processors) rather than mclapply (iterate over individual rows in your data frame).

Also, and really as the initial step, consider identifying the bottle necks in func; the data is large but not that big so perhaps parallel evaluation is not needed -- maybe you've written PDI code instead of R code? Pay attention to data types in the data frame, e.g., factor versus character. It's not unusual to get a 100x speed-up between poorly written and efficient R code, whereas parallel evaluation is at best proportional to the number of cores.
0 讨论(0)
发布评论:

提交评论
- 加载中...
猫巷女王i

2020-12-20 13:40
Split(x,f) is slow if x is a factor AND f contains a lot of different elements

So, this code if fast:
```
system.time(split(seq_len(1300000), sample(250000, 1300000, TRUE)))
```
But, this is very slow:
```
system.time(split(factor(seq_len(1300000)), sample(250000, 1300000, TRUE)))
```
And this is fast again because there are only 25 groups
```
system.time(split(factor(seq_len(1300000)), sample(25, 1300000, TRUE)))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...