发表新帖

发表新帖

Why isn't Hadoop implemented using MPI?

前端未结

关注

 6  819

一个人的身影 2021-01-30 01:41

Correct me if I\'m wrong, but my understanding is that Hadoop does not use MPI for communication between different nodes.

What are the technical reasons for this?

<

6条回答

北荒 (楼主)

2021-01-30 02:28

MPI is Message Passing Interface. Right there in the name - there is no data locality. You send the data to another node for it to be computed on. Thus MPI is network-bound in terms of performance when working with large data.

MapReduce with the Hadoop Distributed File System duplicates data so that you can do your compute in local storage - streaming off the disk and straight to the processor. Thus MapReduce takes advantage of local storage to avoid the network bottleneck when working with large data.

This is not to say that MapReduce doesn't use the network... it does: and the shuffle is often the slowest part of a job! But it uses it as little, and as efficiently as possible.

To sum it up: Hadoop (and Google's stuff before it) did not use MPI because it could not have used MPI and worked. MapReduce systems were developed specifically to address MPI's shortcomings in light of trends in hardware: disk capacity exploding (and data with it), disk speed stagnant, networks slow, processor gigahertz peaked, multi-core taking over Moore's law.

0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...

热议问题