How to use Hadoop Streaming with LZO-compressed Sequence Files?

后端 未结 4 1309
抹茶落季
抹茶落季 2021-01-13 05:20

I\'m trying to play around with the Google ngrams dataset using Amazon\'s Elastic Map Reduce. There\'s a public dataset at http://aws.amazon.com/datasets/8172056142375670, a

相关标签:
4条回答
  • 2021-01-13 05:51

    lzo is packaged as part of elastic mapreduce so there's no need to install anything.

    i just tried this and it works...

     hadoop jar ~hadoop/contrib/streaming/hadoop-streaming.jar \
      -D mapred.reduce.tasks=0 \
      -input s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-all/1gram/ \
      -inputformat SequenceFileAsTextInputFormat \
      -output test_output \
      -mapper org.apache.hadoop.mapred.lib.IdentityMapper
    
    0 讨论(0)
  • 2021-01-13 05:57

    I have weird results use lzo and my problem get resolved with some other codec

    -D mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
    -D mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
    

    Then things just work. You don't need (maybe also shouldn't) to change the -inputformat.

    Version: 0.20.2-cdh3u4, 214dd731e3bdb687cb55988d3f47dd9e248c5690
    
    0 讨论(0)
  • 2021-01-13 06:02

    Lzo compression has been removed from Hadoop 0.20.x onwards due to licensing issues. If you want to process lzo-compressed sequence files, lzo native libraries have to be installed and configured in hadoop cluster.

    Kevin's Hadoop-lzo project is the current working solution I am aware of. I have tried it. It works.

    Install ( if not done already so ) lzo-devel packages at OS. These packages enable lzo compression at the OS level without which hadoop lzo compression won't work.

    Follow the instructions specified in the hadoop-lzo readme and compile it. After build, you would get hadoop-lzo-lib jar and hadoop lzo native libraries. Ensure that you compile it from the machine ( or machine of same arch ) where your cluster is configured.

    Hadoop standard native libraries are also required which have been provided in the distribution by default for linux. If you are using solaris, you would also need to build hadoop from source inorder to get standard hadoop native libraries.

    Restart the cluster once all changes are made.

    0 讨论(0)
  • 2021-01-13 06:04

    You may want to look at this https://github.com/kevinweil/hadoop-lzo

    0 讨论(0)
提交回复
热议问题