Finding Connected Components using Hadoop/MapReduce

后端未结

关注

 4  1241

误落风尘

I need to find connected components for a huge dataset. (Graph being Undirected)

One obvious choice is MapReduce. But i\'m a newbie to MapReduce and am quiet short of ti

相关标签:

4条回答

一向

2021-02-06 06:38

It is a little old question but here is something you want to checkout. We implemented connected component using map-reduce on Spark platform.

https://github.com/kwartile/connected-component

0 讨论(0)
发布评论:

提交评论
- 加载中...
滥情空心

2021-02-06 06:41

I don't really know if an API is available which has methods to find strongly connected components. But, I implemented the BFS algorithm to find distance from source node to all other nodes in the graph (the graph was a directed graph as big as 65 million nodes).

The idea was to explore the neighbors (distance of 1) for each node in one iteration and feeding the output of reduce back to map, until the distances converge. The map emits the shortest distances possible from each node, and reduce updated the node with the shortest distance from the list.

I would suggest to check this out. Also, this could help. These two links would give you the basic idea about graph algorithms in map reduce paradigm (if you are already not familiar). Essentially, you need to twist the algorithm to use DFS instead of BFS.

0 讨论(0)
发布评论:

提交评论
- 加载中...
Happy的楠姐

2021-02-06 06:54

You may want to look at the Pegasus project from Carnegie Mellon University. They provide an efficient - and elegant - implementation using MapReduce. They also provide binaries, samples and a very detailed documentation.

The implementation itself is based on the Generalized Iterative Matrix-Vector multiplication (GIM-V) proposed by U Kang in 2009.

PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations U Kang, Charalampos E. Tsourakakis, Christos Faloutsos In IEEE International Conference on Data Mining (ICDM 2009)

EDIT: The official implementation is actually limited to 2.1 billions nodes (node id are stored as integers). I'm creating a fork on github (https://github.com/placeiq/pegasus) to share my patch and other enhancements (eg. Snappy compression).

0 讨论(0)
发布评论:

提交评论
- 加载中...
故里飘歌

2021-02-06 07:01

I blogged about it for myself:

http://codingwiththomas.blogspot.de/2011/04/graph-exploration-with-hadoop-mapreduce.html

But MapReduce isn't a good fit for these Graph analysis things. Better use BSP (bulk synchronous parallel) for that, Apache Hama provides a good graph API on top of Hadoop HDFS.

I've written a connected components algorithm with MapReduce here: (Mindist search)

https://github.com/thomasjungblut/tjungblut-graph/tree/master/src/de/jungblut/graph/mapreduce

Also a BSP version for Apache Hama can be found here:

https://github.com/thomasjungblut/tjungblut-graph/blob/master/src/de/jungblut/graph/bsp/MindistSearch.java

The implementation isn't as difficult as in MapReduce and it is at least 10 times faster. If you're interested, checkout the latest version in TRUNK and visit our mailing list.

http://hama.apache.org/

http://apache.org/hama/mail-lists.html

0 讨论(0)
发布评论:

提交评论
- 加载中...