sampling subgraphs from different sizes using igraph

六月ゝ 毕业季﹏ 提交于 2019-12-05 17:03:45

问题


I have an igraph object mygraph with ~10,000 nodes and ~145,000 edges, and I need to create a number of subgraphs from this graph but with different sizes. What I need is to create subgraphs from a determined size (from 5 nodes to 500 nodes) where all the nodes are connected in each subgraph. I need to create ~1,000 subgraphs for each size (i.e, 1000 subgraphs for size5, 1000 for size 6, and so on), and then calculate some values for each graph according to different node attributes. I have some code but it takes a long time to do all the calculations. I thought in using the graphlets function in order to get the different sizes but every time I run it on my computer it crash due to memory issues.

Here is the code I am using:

First step was to create a function to create the subgraphs of different sizes and do the calculations needed.

random_network<-function(size,G){
     score_fun<-function(g){                                                        
          subsum <- sum(V(g)$weight*V(g)$RWRNodeweight)/sqrt(sum(V(g)$RWRNodeweight^2))
           subsum
           } 

      genes.idx <- V(G)$name
      perm <- c()
      while(length(perm)<1000){
           seed<-sample(genes.idx,1) 
           while( length(seed)<size ){
                tmp.neigh <- V(G)[unlist(neighborhood(G,1,seed))]$name
                tmp.neigh <- setdiff(tmp.neigh, seed)
                if( length(tmp.neigh)>0 )  
                seed<-c(seed,sample(tmp.neigh,1)) else break 
            }
      if( length(seed)==size )
      perm <- c(perm,score_fun(induced.subgraph(G,seed)))
      } 
      perm
     } 

Second step was to apply the function to the actual graph

 ### generate some example data
 library(igraph)
 my_graph <- erdos.renyi.game(10000, 0.0003)
 V(my_graph)$name <- 1:vcount(my_graph)
 V(my_graph)$weight <- rnorm(10000)
 V(my_graph)$RWRNodeweight <- runif(10000, min=0, max=0.05)

 ### Run the code to get the subgraphs from different size and do calculations based on nodes
 genesets.length<- seq(5:500)
 genesets.length.null.dis <- list()
 for(k in 5:max(genesets.length){ 
     genesets.length.null.dis[[as.character(k)]] <- random_network(size=k,G=my_graph)
  }

回答1:


One approach to speed up your code further than what's possible in base R would be to use the Rcpp package. Consider the following Rcpp implementation of the full operation. It takes as input the following:

  • valid: The indices of all nodes that are in a large-enough component
  • el, deg, firstPos: A representation of the graph's edge list (node i's neighbors are el[firstPos[i]], el[firstPos[i]+1], ..., el[firstPos[i]+deg[i]-1]).
  • size: The subgraph size to sample
  • nrep: The number of repetitions
  • weights: The edge weights stored in V(G)$weight
  • RWRNodeweight: The edge weights stored in V(G)$RWRNodeweight
library(Rcpp)
cppFunction(
"NumericVector scores(IntegerVector valid, IntegerVector el, IntegerVector deg,
                      IntegerVector firstPos, const int size, const int nrep,
                      NumericVector weights, NumericVector RWRNodeweight) {
  const int n = deg.size();
  std::vector<bool> used(n, false);
  std::vector<bool> neigh(n, false);
  std::vector<int> neighList;
  std::vector<double> scores(nrep);
  for (int outerIter=0; outerIter < nrep; ++outerIter) {
    // Initialize variables
    std::fill(used.begin(), used.end(), false);
    std::fill(neigh.begin(), neigh.end(), false);
    neighList.clear();

    // Random first node
    int recent = valid[rand() % valid.size()];
    used[recent] = true;
    double wrSum = weights[recent] * RWRNodeweight[recent];
    double rrSum = RWRNodeweight[recent] * RWRNodeweight[recent];

    // Each additional node
    for (int idx=1; idx < size; ++idx) {
      // Add neighbors of recent
      for (int p=firstPos[recent]; p < firstPos[recent] + deg[recent]; ++p) {
        if (!neigh[el[p]] && !used[el[p]]) {
          neigh[el[p]] = true;
          neighList.push_back(el[p]);
        }
      }

      // Compute new node to add from all neighbors
      int newPos = rand() % neighList.size();
      recent = neighList[newPos];
      used[recent] = true;
      wrSum += weights[recent] * RWRNodeweight[recent];
      rrSum += RWRNodeweight[recent] * RWRNodeweight[recent];

      // Remove from neighList
      neighList[newPos] = neighList[neighList.size() - 1];
      neighList.pop_back();
    }

    // Compute score from wrSum and rrSum
    scores[outerIter] = wrSum / sqrt(rrSum);
  }
  return NumericVector(scores.begin(), scores.end());
}
")

Now all that we need to do in base R is generate the arguments for scores, which can be done pretty easily:

josilber.rcpp <- function(size, num.rep, G) {
  n <- length(V(G)$name)

  # Determine which nodes fall in sufficiently large connected components
  comp <- components(G)
  valid <- which(comp$csize[comp$membership] >= size)

  # Construct an edge list representation for use in the Rcpp code
  el <- get.edgelist(G, names=FALSE) - 1
  el <- rbind(el, el[,2:1])
  el <- el[order(el[,1]),]
  deg <- degree(G)
  first.pos <- c(0, cumsum(head(deg, -1)))

  # Run the proper number of replications
  scores(valid-1, el[,2], deg, first.pos, size, num.rep,
         as.numeric(V(G)$weight), as.numeric(V(G)$RWRNodeweight))
}

The time to perform 1000 replications is blazing fast compared to the original code and all igraph solutions we've seen so far (note that for much of this benchmarking I tested the original josilber and random_network functions for 1 replication instead of 1000 because testing for 1000 would take a prohibitively long time):

  • Size=10: 0.06 seconds (a 1200x speedup over my previously proposed josilber function and a 4000x speedup over the original random_network function)
  • Size=100: 0.08 seconds (a 8700x speedup over my previously proposed josilber function and a 162000x speedup over the original random_network function)
  • Size=1000: 0.13 seconds (a 32000x speedup over my previously proposed josilber function and a 20.4 million times speedup over the original random_network function)
  • Size=5000: 0.32 seconds (a 68000x speedup over my previously proposed josilber function and a 290 million times speedup over the original random_network function)

In short, Rcpp probably makes it feasible to compute 1000 replicates for each size from 5 to 500 (this operation will probably take about 1 minute in total), while it would be prohibitively slow to compute the 1000 replicates for each of these sizes using the pure R code that's been proposed so far.




回答2:


Basically your algorithm for sampling a graph could be described as initializing the node set as a randomly selected node and then iteratively adding a neighbor of your current set until either there are no more neighbors or you have the desired subset size.

The expensive repeated operation here is determining the neighbors of the current set, which you do with the following:

tmp.neigh <- V(G)[unlist(neighborhood(G,1,seed))]$name
tmp.neigh <- setdiff(tmp.neigh, seed)

In short, you are looking at the neighbors of each node in your selected subset at each iteration. A more efficient approach would be to store a vector of neighbors and update it each time you add a new node; this will be more efficient because you only need to consider the neighbors of the new node.

josilber <- function(size, num.rep, G) {
  score_fun <- function(vert) sum(vert$weight*vert$RWRNodeweight)/sqrt(sum(vert$RWRNodeweight^2))
  n <- length(V(G)$name)

  # Determine which nodes fall in sufficiently large connected components
  comp <- components(G)
  valid <- which(comp$csize[comp$membership] >= size)

  perm <- replicate(num.rep, {
    first.node <- sample(valid, 1)
    used <- (1:n) == first.node  # Is this node selected?
    neigh <- (1:n) %in% neighbors(G, first.node)  # Does each node neighbor our selections?
    for (iter in 2:size) {
      new.node <- sample(which(neigh & !used), 1)
      used[new.node] <- TRUE
      neigh[neighbors(G, new.node)] <- TRUE
    }
    score_fun(V(G)[used])
  })
  perm
}

For a single replicate, this yields significant speedups over a single replicate of the code in the question:

  • For size=50, a single replicate takes 0.3 seconds for this code and 3.8 seconds for the posted code
  • For size=100, a single replicate takes 0.6 seconds for this code and 15.2 seconds for the posted code
  • For size=200, a single replicate takes 1.5 seconds for this code and 69.4 seconds for the posted code
  • For size=500, a single replicate for this code takes 2.7 seconds (so 1000 replicates should take about 45 minutes); I did not test a single replicate of the posted code.

As mentioned in the other answers, parallelization could further improve the performance of doing 1000 replicates for a given graph size. The following uses the doParallel package to parallelize the repeated step (the adjustment is pretty much identical to the one made by @Chris in his answer):

library(doParallel)
cl <- makeCluster(4)
registerDoParallel(cl)
josilber2 <- function(size, num.rep, G) {
  score_fun <- function(vert) sum(vert$weight*vert$RWRNodeweight)/sqrt(sum(vert$RWRNodeweight^2))
  n <- length(V(G)$name)

  # Determine which nodes fall in sufficiently large connected components
  comp <- components(G)
  valid <- which(comp$csize[comp$membership] >= size)

  perm <- foreach (i=1:num.rep, .combine='c') %dopar% {
    library(igraph)
    first.node <- sample(valid, 1)
    used <- (1:n) == first.node  # Is this node selected?
    neigh <- (1:n) %in% neighbors(G, first.node)  # Does each node neighbor our selections?
    for (iter in 2:size) {
      new.node <- sample(which(neigh & !used), 1)
      used[new.node] <- TRUE
      neigh[neighbors(G, new.node)] <- TRUE
    }
    score_fun(V(G)[used])
  }
  perm
}

On my Macbook Air, josilber(100, 1000, my_graph) takes 670 seconds to run (this is the non-parallel version), while josilber2(100, 1000, my_graph) takes 239 seconds to run (this is the parallel version configured with 4 workers). For the size=100 case, we have therefore gotten a 20x speedup from code improvements and an additional 3x speedup from parallelization, for a total speedup of 60x.




回答3:


I don't have a complete answer but here are some things to consider to help speed it up (assuming there is not a much faster approach using a different method).

  1. Remove from your graph any any nodes which are not part of a component as large as you are looking for. It will really depend on your network structure but it looks like your networks are genes so there are likely many genes with very low degree and you could get some speedups by removing them. Something like this code:

    cgraph <- clusters(G)
    tooSmall <- which(cgraph$csize < size)
    toKeep <- setdiff(1:length(V(G)), which(cgraph$membership %in% tooSmall))
    graph <- induced.subgraph(G, vids=toKeep)
    
  2. Consider running this in parallel to take advantage of multiple cores. For example, using the parallel package and mclapply.

    library(parallel)
    genesets.length<- seq(5, 500)
    names(genesets.length) <- genesets.length
    genesets.length.null.dis <- mclapply(genesets.length, mc.cores=7,
                                         function(length) {
                                           random_network(size=length, G=my_graph)
                                         })
    



回答4:


I think it would be much more efficient to use the cliques function in igraph as a clique is a subgraph of completely connected nodes. Simply set min and max equal to the size of the subgraph your are searching for and it will return all cliques of size 5. You can than take whatever subset of these that meets your needs. Unfortunately with the example Erdos-Renyi graph you generated often times the largest clique is smaller than 5 so this will not work for the example. However, it should work just fine for a real network which exhibits more clustering than the Erdos-Renyi graph as your's most likely does.

library(igraph)
##Should be 0.003, not 0.0003 (145000/choose(10000,2))
my_graph <- erdos.renyi.game(10000, 0.003)

cliques(my_graph,min=5,max=5)



回答5:


You have a number of problems with your code (you don't pre-allocate vectors, etc.). Please see the code I came up with below. I have only tested it up to subgraph of size 100, though. However, the speed savings increase quite a bit as subgraph size goes up, compared to your code. You should install the foreach package as well. I ran this on a laptop w/ 4 cores, 2.1 GHz.

random_network_new <- function(gsize, G) {
  score_fun <- function(g) {
    subsum <- sum(V(g)$weight * V(g)$RWRNodeweight) / sqrt(sum(V(g)$RWRNodeweight^2))
  }

  genes.idx <- V(G)$name

  perm <- foreach (i=seq_len(1e3), .combine='c') %dopar% {
    seed <- rep(0, length=gsize)
    seed[1] <- sample(genes.idx, 1)

    for (j in 2:gsize) {
      tmp.neigh <- neighbors(G, as.numeric(seed[j-1]))
      tmp.neigh <- setdiff(tmp.neigh, seed)
      if (length(tmp.neigh) > 0) {
        seed[j] <- sample(tmp.neigh, 1)
      } else {
        break
      }
    }
    score_fun(induced.subgraph(G, seed))
  }
  perm
}

Note that I renamed the function to random_network_new and the argument to gsize.

system.time(genesets <- random_network_new(gsize=100, G=my_graph))                                            
   user   system  elapsed 
1011.157    2.974  360.925 
system.time(genesets <- random_network_new(gsize=50, G=my_graph))
   user  system elapsed 
822.087   3.119 180.358 
system.time(genesets <- random_network_new(gsize=25, G=my_graph))
   user  system elapsed 
379.423   1.130  74.596 
system.time(genesets <- random_network_new(gsize=10, G=my_graph))
   user  system elapsed 
144.458   0.677  26.508 

One example using your code (mine is over 10x faster for subgraph size 10; it would be much faster with bigger subgraphs):

system.time(genesets_slow <- random_network(10, my_graph))
   user  system elapsed 
350.112   0.038 350.492 


来源:https://stackoverflow.com/questions/33084860/sampling-subgraphs-from-different-sizes-using-igraph

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!