Finding Connected Components using Hadoop/MapReduce

后端 未结 4 1237
误落风尘
误落风尘 2021-02-06 06:11

I need to find connected components for a huge dataset. (Graph being Undirected)

One obvious choice is MapReduce. But i\'m a newbie to MapReduce and am quiet short of ti

相关标签:
4条回答
  • 2021-02-06 06:38

    It is a little old question but here is something you want to checkout. We implemented connected component using map-reduce on Spark platform.

    https://github.com/kwartile/connected-component

    0 讨论(0)
  • 2021-02-06 06:41

    I don't really know if an API is available which has methods to find strongly connected components. But, I implemented the BFS algorithm to find distance from source node to all other nodes in the graph (the graph was a directed graph as big as 65 million nodes).

    The idea was to explore the neighbors (distance of 1) for each node in one iteration and feeding the output of reduce back to map, until the distances converge. The map emits the shortest distances possible from each node, and reduce updated the node with the shortest distance from the list.

    I would suggest to check this out. Also, this could help. These two links would give you the basic idea about graph algorithms in map reduce paradigm (if you are already not familiar). Essentially, you need to twist the algorithm to use DFS instead of BFS.

    0 讨论(0)
  • 2021-02-06 06:54

    You may want to look at the Pegasus project from Carnegie Mellon University. They provide an efficient - and elegant - implementation using MapReduce. They also provide binaries, samples and a very detailed documentation.

    The implementation itself is based on the Generalized Iterative Matrix-Vector multiplication (GIM-V) proposed by U Kang in 2009.

    PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations U Kang, Charalampos E. Tsourakakis, Christos Faloutsos In IEEE International Conference on Data Mining (ICDM 2009)

    EDIT: The official implementation is actually limited to 2.1 billions nodes (node id are stored as integers). I'm creating a fork on github (https://github.com/placeiq/pegasus) to share my patch and other enhancements (eg. Snappy compression).

    0 讨论(0)
  • 2021-02-06 07:01

    I blogged about it for myself:

    http://codingwiththomas.blogspot.de/2011/04/graph-exploration-with-hadoop-mapreduce.html

    But MapReduce isn't a good fit for these Graph analysis things. Better use BSP (bulk synchronous parallel) for that, Apache Hama provides a good graph API on top of Hadoop HDFS.

    I've written a connected components algorithm with MapReduce here: (Mindist search)

    https://github.com/thomasjungblut/tjungblut-graph/tree/master/src/de/jungblut/graph/mapreduce

    Also a BSP version for Apache Hama can be found here:

    https://github.com/thomasjungblut/tjungblut-graph/blob/master/src/de/jungblut/graph/bsp/MindistSearch.java

    The implementation isn't as difficult as in MapReduce and it is at least 10 times faster. If you're interested, checkout the latest version in TRUNK and visit our mailing list.

    http://hama.apache.org/

    http://apache.org/hama/mail-lists.html

    0 讨论(0)
提交回复
热议问题