问题
I have an igraph object mygraph
with ~10,000 nodes and ~145,000 edges, and I need to create a number of subgraphs from this graph but with different sizes.
What I need is to create subgraphs from a determined size (from 5 nodes to 500 nodes) where all the nodes are connected in each subgraph. I need to create ~1,000 subgraphs for each size (i.e, 1000 subgraphs for size5, 1000 for size 6, and so on), and then calculate some values for each graph according to different node attributes.
I have some code but it takes a long time to do all the calculations. I thought in using the graphlets
function in order to get the different sizes but every time I run it on my computer it crash due to memory issues.
Here is the code I am using:
First step was to create a function to create the subgraphs of different sizes and do the calculations needed.
random_network<-function(size,G){
score_fun<-function(g){
subsum <- sum(V(g)$weight*V(g)$RWRNodeweight)/sqrt(sum(V(g)$RWRNodeweight^2))
subsum
}
genes.idx <- V(G)$name
perm <- c()
while(length(perm)<1000){
seed<-sample(genes.idx,1)
while( length(seed)<size ){
tmp.neigh <- V(G)[unlist(neighborhood(G,1,seed))]$name
tmp.neigh <- setdiff(tmp.neigh, seed)
if( length(tmp.neigh)>0 )
seed<-c(seed,sample(tmp.neigh,1)) else break
}
if( length(seed)==size )
perm <- c(perm,score_fun(induced.subgraph(G,seed)))
}
perm
}
Second step was to apply the function to the actual graph
### generate some example data
library(igraph)
my_graph <- erdos.renyi.game(10000, 0.0003)
V(my_graph)$name <- 1:vcount(my_graph)
V(my_graph)$weight <- rnorm(10000)
V(my_graph)$RWRNodeweight <- runif(10000, min=0, max=0.05)
### Run the code to get the subgraphs from different size and do calculations based on nodes
genesets.length<- seq(5:500)
genesets.length.null.dis <- list()
for(k in 5:max(genesets.length){
genesets.length.null.dis[[as.character(k)]] <- random_network(size=k,G=my_graph)
}
回答1:
One approach to speed up your code further than what's possible in base R would be to use the Rcpp package. Consider the following Rcpp implementation of the full operation. It takes as input the following:
valid
: The indices of all nodes that are in a large-enough componentel
,deg
,firstPos
: A representation of the graph's edge list (nodei
's neighbors areel[firstPos[i]]
,el[firstPos[i]+1]
, ...,el[firstPos[i]+deg[i]-1]
).size
: The subgraph size to samplenrep
: The number of repetitionsweights
: The edge weights stored inV(G)$weight
RWRNodeweight
: The edge weights stored inV(G)$RWRNodeweight
library(Rcpp)
cppFunction(
"NumericVector scores(IntegerVector valid, IntegerVector el, IntegerVector deg,
IntegerVector firstPos, const int size, const int nrep,
NumericVector weights, NumericVector RWRNodeweight) {
const int n = deg.size();
std::vector<bool> used(n, false);
std::vector<bool> neigh(n, false);
std::vector<int> neighList;
std::vector<double> scores(nrep);
for (int outerIter=0; outerIter < nrep; ++outerIter) {
// Initialize variables
std::fill(used.begin(), used.end(), false);
std::fill(neigh.begin(), neigh.end(), false);
neighList.clear();
// Random first node
int recent = valid[rand() % valid.size()];
used[recent] = true;
double wrSum = weights[recent] * RWRNodeweight[recent];
double rrSum = RWRNodeweight[recent] * RWRNodeweight[recent];
// Each additional node
for (int idx=1; idx < size; ++idx) {
// Add neighbors of recent
for (int p=firstPos[recent]; p < firstPos[recent] + deg[recent]; ++p) {
if (!neigh[el[p]] && !used[el[p]]) {
neigh[el[p]] = true;
neighList.push_back(el[p]);
}
}
// Compute new node to add from all neighbors
int newPos = rand() % neighList.size();
recent = neighList[newPos];
used[recent] = true;
wrSum += weights[recent] * RWRNodeweight[recent];
rrSum += RWRNodeweight[recent] * RWRNodeweight[recent];
// Remove from neighList
neighList[newPos] = neighList[neighList.size() - 1];
neighList.pop_back();
}
// Compute score from wrSum and rrSum
scores[outerIter] = wrSum / sqrt(rrSum);
}
return NumericVector(scores.begin(), scores.end());
}
")
Now all that we need to do in base R is generate the arguments for scores
, which can be done pretty easily:
josilber.rcpp <- function(size, num.rep, G) {
n <- length(V(G)$name)
# Determine which nodes fall in sufficiently large connected components
comp <- components(G)
valid <- which(comp$csize[comp$membership] >= size)
# Construct an edge list representation for use in the Rcpp code
el <- get.edgelist(G, names=FALSE) - 1
el <- rbind(el, el[,2:1])
el <- el[order(el[,1]),]
deg <- degree(G)
first.pos <- c(0, cumsum(head(deg, -1)))
# Run the proper number of replications
scores(valid-1, el[,2], deg, first.pos, size, num.rep,
as.numeric(V(G)$weight), as.numeric(V(G)$RWRNodeweight))
}
The time to perform 1000 replications is blazing fast compared to the original code and all igraph
solutions we've seen so far (note that for much of this benchmarking I tested the original josilber
and random_network
functions for 1 replication instead of 1000 because testing for 1000 would take a prohibitively long time):
- Size=10: 0.06 seconds (a 1200x speedup over my previously proposed
josilber
function and a 4000x speedup over the originalrandom_network
function) - Size=100: 0.08 seconds (a 8700x speedup over my previously proposed
josilber
function and a 162000x speedup over the originalrandom_network
function) - Size=1000: 0.13 seconds (a 32000x speedup over my previously proposed
josilber
function and a 20.4 million times speedup over the originalrandom_network
function) - Size=5000: 0.32 seconds (a 68000x speedup over my previously proposed
josilber
function and a 290 million times speedup over the originalrandom_network
function)
In short, Rcpp probably makes it feasible to compute 1000 replicates for each size from 5 to 500 (this operation will probably take about 1 minute in total), while it would be prohibitively slow to compute the 1000 replicates for each of these sizes using the pure R code that's been proposed so far.
回答2:
Basically your algorithm for sampling a graph could be described as initializing the node set as a randomly selected node and then iteratively adding a neighbor of your current set until either there are no more neighbors or you have the desired subset size.
The expensive repeated operation here is determining the neighbors of the current set, which you do with the following:
tmp.neigh <- V(G)[unlist(neighborhood(G,1,seed))]$name
tmp.neigh <- setdiff(tmp.neigh, seed)
In short, you are looking at the neighbors of each node in your selected subset at each iteration. A more efficient approach would be to store a vector of neighbors and update it each time you add a new node; this will be more efficient because you only need to consider the neighbors of the new node.
josilber <- function(size, num.rep, G) {
score_fun <- function(vert) sum(vert$weight*vert$RWRNodeweight)/sqrt(sum(vert$RWRNodeweight^2))
n <- length(V(G)$name)
# Determine which nodes fall in sufficiently large connected components
comp <- components(G)
valid <- which(comp$csize[comp$membership] >= size)
perm <- replicate(num.rep, {
first.node <- sample(valid, 1)
used <- (1:n) == first.node # Is this node selected?
neigh <- (1:n) %in% neighbors(G, first.node) # Does each node neighbor our selections?
for (iter in 2:size) {
new.node <- sample(which(neigh & !used), 1)
used[new.node] <- TRUE
neigh[neighbors(G, new.node)] <- TRUE
}
score_fun(V(G)[used])
})
perm
}
For a single replicate, this yields significant speedups over a single replicate of the code in the question:
- For size=50, a single replicate takes 0.3 seconds for this code and 3.8 seconds for the posted code
- For size=100, a single replicate takes 0.6 seconds for this code and 15.2 seconds for the posted code
- For size=200, a single replicate takes 1.5 seconds for this code and 69.4 seconds for the posted code
- For size=500, a single replicate for this code takes 2.7 seconds (so 1000 replicates should take about 45 minutes); I did not test a single replicate of the posted code.
As mentioned in the other answers, parallelization could further improve the performance of doing 1000 replicates for a given graph size. The following uses the doParallel
package to parallelize the repeated step (the adjustment is pretty much identical to the one made by @Chris in his answer):
library(doParallel)
cl <- makeCluster(4)
registerDoParallel(cl)
josilber2 <- function(size, num.rep, G) {
score_fun <- function(vert) sum(vert$weight*vert$RWRNodeweight)/sqrt(sum(vert$RWRNodeweight^2))
n <- length(V(G)$name)
# Determine which nodes fall in sufficiently large connected components
comp <- components(G)
valid <- which(comp$csize[comp$membership] >= size)
perm <- foreach (i=1:num.rep, .combine='c') %dopar% {
library(igraph)
first.node <- sample(valid, 1)
used <- (1:n) == first.node # Is this node selected?
neigh <- (1:n) %in% neighbors(G, first.node) # Does each node neighbor our selections?
for (iter in 2:size) {
new.node <- sample(which(neigh & !used), 1)
used[new.node] <- TRUE
neigh[neighbors(G, new.node)] <- TRUE
}
score_fun(V(G)[used])
}
perm
}
On my Macbook Air, josilber(100, 1000, my_graph)
takes 670 seconds to run (this is the non-parallel version), while josilber2(100, 1000, my_graph)
takes 239 seconds to run (this is the parallel version configured with 4 workers). For the size=100
case, we have therefore gotten a 20x speedup from code improvements and an additional 3x speedup from parallelization, for a total speedup of 60x.
回答3:
I don't have a complete answer but here are some things to consider to help speed it up (assuming there is not a much faster approach using a different method).
Remove from your graph any any nodes which are not part of a component as large as you are looking for. It will really depend on your network structure but it looks like your networks are genes so there are likely many genes with very low degree and you could get some speedups by removing them. Something like this code:
cgraph <- clusters(G) tooSmall <- which(cgraph$csize < size) toKeep <- setdiff(1:length(V(G)), which(cgraph$membership %in% tooSmall)) graph <- induced.subgraph(G, vids=toKeep)
Consider running this in parallel to take advantage of multiple cores. For example, using the
parallel
package andmclapply
.library(parallel) genesets.length<- seq(5, 500) names(genesets.length) <- genesets.length genesets.length.null.dis <- mclapply(genesets.length, mc.cores=7, function(length) { random_network(size=length, G=my_graph) })
回答4:
I think it would be much more efficient to use the cliques function in igraph as a clique is a subgraph of completely connected nodes. Simply set min and max equal to the size of the subgraph your are searching for and it will return all cliques of size 5. You can than take whatever subset of these that meets your needs. Unfortunately with the example Erdos-Renyi graph you generated often times the largest clique is smaller than 5 so this will not work for the example. However, it should work just fine for a real network which exhibits more clustering than the Erdos-Renyi graph as your's most likely does.
library(igraph)
##Should be 0.003, not 0.0003 (145000/choose(10000,2))
my_graph <- erdos.renyi.game(10000, 0.003)
cliques(my_graph,min=5,max=5)
回答5:
You have a number of problems with your code (you don't pre-allocate vectors, etc.). Please see the code I came up with below. I have only tested it up to subgraph of size 100, though. However, the speed savings increase quite a bit as subgraph size goes up, compared to your code. You should install the foreach
package as well. I ran this on a laptop w/ 4 cores, 2.1 GHz.
random_network_new <- function(gsize, G) {
score_fun <- function(g) {
subsum <- sum(V(g)$weight * V(g)$RWRNodeweight) / sqrt(sum(V(g)$RWRNodeweight^2))
}
genes.idx <- V(G)$name
perm <- foreach (i=seq_len(1e3), .combine='c') %dopar% {
seed <- rep(0, length=gsize)
seed[1] <- sample(genes.idx, 1)
for (j in 2:gsize) {
tmp.neigh <- neighbors(G, as.numeric(seed[j-1]))
tmp.neigh <- setdiff(tmp.neigh, seed)
if (length(tmp.neigh) > 0) {
seed[j] <- sample(tmp.neigh, 1)
} else {
break
}
}
score_fun(induced.subgraph(G, seed))
}
perm
}
Note that I renamed the function to random_network_new
and the argument to gsize
.
system.time(genesets <- random_network_new(gsize=100, G=my_graph))
user system elapsed
1011.157 2.974 360.925
system.time(genesets <- random_network_new(gsize=50, G=my_graph))
user system elapsed
822.087 3.119 180.358
system.time(genesets <- random_network_new(gsize=25, G=my_graph))
user system elapsed
379.423 1.130 74.596
system.time(genesets <- random_network_new(gsize=10, G=my_graph))
user system elapsed
144.458 0.677 26.508
One example using your code (mine is over 10x faster for subgraph size 10; it would be much faster with bigger subgraphs):
system.time(genesets_slow <- random_network(10, my_graph))
user system elapsed
350.112 0.038 350.492
来源:https://stackoverflow.com/questions/33084860/sampling-subgraphs-from-different-sizes-using-igraph