How to use Hadoop Streaming with LZO-compressed Sequence Files?

后端未结

关注

 4  1311

I\'m trying to play around with the Google ngrams dataset using Amazon\'s Elastic Map Reduce. There\'s a public dataset at http://aws.amazon.com/datasets/8172056142375670, a

相关标签:

4条回答

一个人的身影

2021-01-13 05:51

lzo is packaged as part of elastic mapreduce so there's no need to install anything.

i just tried this and it works...

 hadoop jar ~hadoop/contrib/streaming/hadoop-streaming.jar \
  -D mapred.reduce.tasks=0 \
  -input s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-all/1gram/ \
  -inputformat SequenceFileAsTextInputFormat \
  -output test_output \
  -mapper org.apache.hadoop.mapred.lib.IdentityMapper

0 讨论(0)

北海茫月

2021-01-13 05:57
I have weird results use lzo and my problem get resolved with some other codec
```
-D mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
```
Then things just work. You don't need (maybe also shouldn't) to change the -inputformat.
```
Version: 0.20.2-cdh3u4, 214dd731e3bdb687cb55988d3f47dd9e248c5690
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
遇见更好的自我

2021-01-13 06:02

Lzo compression has been removed from Hadoop 0.20.x onwards due to licensing issues. If you want to process lzo-compressed sequence files, lzo native libraries have to be installed and configured in hadoop cluster.

Kevin's Hadoop-lzo project is the current working solution I am aware of. I have tried it. It works.

Install ( if not done already so ) lzo-devel packages at OS. These packages enable lzo compression at the OS level without which hadoop lzo compression won't work.

Follow the instructions specified in the hadoop-lzo readme and compile it. After build, you would get hadoop-lzo-lib jar and hadoop lzo native libraries. Ensure that you compile it from the machine ( or machine of same arch ) where your cluster is configured.

Hadoop standard native libraries are also required which have been provided in the distribution by default for linux. If you are using solaris, you would also need to build hadoop from source inorder to get standard hadoop native libraries.

Restart the cluster once all changes are made.

0 讨论(0)
发布评论:

提交评论
- 加载中...
深忆病人

2021-01-13 06:04

You may want to look at this https://github.com/kevinweil/hadoop-lzo

0 讨论(0)
发布评论:

提交评论
- 加载中...