I\'m trying to play around with the Google ngrams dataset using Amazon\'s Elastic Map Reduce. There\'s a public dataset at http://aws.amazon.com/datasets/8172056142375670, a
lzo is packaged as part of elastic mapreduce so there's no need to install anything.
i just tried this and it works...
hadoop jar ~hadoop/contrib/streaming/hadoop-streaming.jar \ -D mapred.reduce.tasks=0 \ -input s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-all/1gram/ \ -inputformat SequenceFileAsTextInputFormat \ -output test_output \ -mapper org.apache.hadoop.mapred.lib.IdentityMapper
I have weird results use lzo and my problem get resolved with some other codec
-D mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
Then things just work. You don't need (maybe also shouldn't) to change the -inputformat
.
Version: 0.20.2-cdh3u4, 214dd731e3bdb687cb55988d3f47dd9e248c5690
Lzo compression has been removed from Hadoop 0.20.x onwards due to licensing issues. If you want to process lzo-compressed sequence files, lzo native libraries have to be installed and configured in hadoop cluster.
Kevin's Hadoop-lzo project is the current working solution I am aware of. I have tried it. It works.
Install ( if not done already so ) lzo-devel packages at OS. These packages enable lzo compression at the OS level without which hadoop lzo compression won't work.
Follow the instructions specified in the hadoop-lzo readme and compile it. After build, you would get hadoop-lzo-lib jar and hadoop lzo native libraries. Ensure that you compile it from the machine ( or machine of same arch ) where your cluster is configured.
Hadoop standard native libraries are also required which have been provided in the distribution by default for linux. If you are using solaris, you would also need to build hadoop from source inorder to get standard hadoop native libraries.
Restart the cluster once all changes are made.
You may want to look at this https://github.com/kevinweil/hadoop-lzo