问题
R version: 3.2.4
RStudio version: 0.99.893
Windows 7
Intel i7
480 GB RAM
str(df)
161976 obs. of 11 variables
I am a relative novice to R and do not have a software programming background. My task is to perform clustering on a data set.
The variables have been scaled and centered. I am using the following code to find the optimal number of clusters:
d <- dist(df, method = "euclidean")
library(cluster)
pamk.best <- pamk(d)
plot(pam(d, pamk.best$nc))
I have noticed that the system never uses more than 22% of the CPU's processing power.
I have taken the following actions so far:
- Unsuccessfully tried to change the Set Priority and Set Affinity setting for
rsession.exe
in the Processes tab of the Windows Task Manager. But, for some reason, it always comes back to low even when I set it to High or Realtime or anything else on that list. The Set Affinity setting shows that the system is allowing R to use all of the cores. - I have adjusted the
High Performance
settings on my machine by going into Control Panel -> Power Options -> Change advance power settings -> Processor Power Management to 100%. - I have read up the parallel processing
CRAN Task View for High Performance Computing
. I may be wrong but I don't think that calculating distance between observations in a data set is a task that should be parallelized, in the sense of, dividing up the data set in subsets and performing the distance calculations on subsets in parallel on different cores. Please correct me if I am wrong.
One option I have is to perform clustering on a subset of the data set and then predict cluster membership for the rest of the data set. But, I am thinking that if I have the processing power and the memory available, why can't I perform the clustering on the whole data set!
Is there a way to have the machine or R
use higher percentage of the processing power and complete the task quicker?
EDIT: I think that my issue is different from the one described in Multithreading in R because I am not trying to run different functions in R. Rather, I am running only one function on one dataset and would like the machine to use more processing power that is available to it.
回答1:
It is probably using one core only.
There is no automatic way to parallelize computations. So what you need to do is rewrite parts of R (here, probably the dist
and pam
functions, which supposedly are C or Fortran code) to use more than one core.
Or you use a different tool, where someone did the work already. I'm a big fan of ELKI but it's mostly single-core. I think Julia may be worth a look because it is more similar to R (it is very similar to Matlab) and it was designed to use multi-core better. Of course there may also be an R module that parallelizes this. I'd look at the Rcpp modules, which are udually very fast.
But the key to fast and scalable clustering is to avoid distance matrixes. See: a 4-core system yields maybe a 3.5x speedup (often much less, because of turboboost) and a 8 core yields up to 6.5x better performance. But if you increase the data set size 10x you need 100x as much memory and computation. This is a race that you cannot win, except with clever algorithms
回答2:
Here is a quick example of using multiple CPU cores. The task has to be split similar to a for loop, but you cannot access any intermediate results for further calculations until the loop was fully executed.
library(doParallel)
registerDoParallel(cores = detectCores(all.tests = FALSE, logical = TRUE))
This would be a basic example of how you can split a task:
vec = c(1,3,5)
do = function(n) n^2
foreach(i = seq_along(vec)) %dopar% do(vec[i])
If packages are required within your do() function, you can load them in the following way:
foreach(i = seq_along(vec), .packages=c(some packages)) %dopar% do(vec[i])
来源:https://stackoverflow.com/questions/37034687/how-can-i-have-r-utilize-more-of-the-processing-power-on-my-pc