Deciding key value pair for deduplication using hadoop mapreduce

此生再无相见时 提交于 2019-12-13 12:34:01

问题


I want to implement deduplication of files using Hadoop Mapreduce. I plan to do it by calculating MD5 sum of all the files present in the input directory in my mapper function. These MD5 hash would be the key to the reducer, so files with the same hash would go to the same reducer.

The default for the mapper in Hadoop is that the key is the line number and the value is the content of the file.

Also I read that if the file is big, then it is split into chunks of 64 MB, which is the maximum block size in Hadoop.

How can I set the key values to be the names of the files, so that in my mapper I can compute the hash of the file ? Also how to ensure that no two nodes will compute the hash for the same file?


回答1:


If you would need to have the entire file as input to one mapper, then you need to keep the isSplitable false. In this scenario you could take in the whole file as input to the mapper and apply your MD5 on the same and emit it as the key.

WholeFileInputFormat (not a part of the hadoop code) can be used here. You can get the implementation online or its available in the Hadoop: The Definitive Guide book.

Value can be the file name. Calling getInputSplit() on Context instance would give you the input splits which can be cast as filesplits. Then fileSplit.getPath().getName() would yield you the file name. This would give you the filename, which could be emitted as the value.

I have not worked on this - org.apache.hadoop.hdfs.util.MD5FileUtils, but the javadocs says that this might be what works good for you.

Textbook src link for WholeFileInputFormat and associated RecordReader have been included for reference

1) WholeFileInputFormat

2) WholeFileRecordReader

Also including the grepcode link to MD5FileUtils



来源:https://stackoverflow.com/questions/22740710/deciding-key-value-pair-for-deduplication-using-hadoop-mapreduce

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!