Class com.hadoop.compression.lzo.LzoCodec not found for Spark on CDH 5?

前端 未结 3 811
别那么骄傲
别那么骄傲 2020-12-05 20:32

I have been working on this problem for two days and still have not find the way.

Problem: Our Spark installed via newest CDH 5 always complains abo

相关标签:
3条回答
  • 2020-12-05 20:50

    For Hortonworks 2.3.0 with Ambari for Spark to work with LZO you need to add Custom spark-defaults properties. I added:

    • spark.driver.extraClassPath /usr/hdp/current/hadoop-client/lib/hadoop-lzo-0.6.0.{{hdp_full_version}}.jar
    • spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64

    This is based on the HDP 2.3.0 upgrading SPARK 2.2 page (it has some typos).

    0 讨论(0)
  • 2020-12-05 20:55

    Solved!! May the solution help others who encounter the same problem.


    In this tutorial, I will show you how to enable LZO compression on Hadoop, Pig and Spark. I suppose that you have set up a basic hadoop installation successfully (if not, please refer to other tutorials for Hadoop installation ).

    You reach this page possibly because you encounter the same problem as I encountered, usually starting with Java exception:

    Caused by: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found.
    

    As the Apache and Cloudera distributions are two of the most popular distributions, configurations for both contexts are shown. Briefly, three main steps would be walked through towards the final success:

    • Installing native-lzo libraries
    • Installing hadoop-lzo library
    • Setting up environment variables correctly (the right part consuming my most time)

    Step1: Installing native-lzo libraries

    The native-lzo library is required for the installation of hadoop-lzo. You can install them manually or by facilitating the Package Manager (NOTE: Make sure all nodes in the cluster have native-lzo installed.):

    • On Mac OS:

      sudo port install lzop lzo2
      
    • On RH or CentOS:

      sudo yum install lzo liblzo-devel
      
    • On Debian or ubuntu:

      sudo apt-get install liblzo2-dev
      

    Step2: Installing hadoop-lzo library

    For Apache Hadoop

    As the LZO is GPL'ed, it not shipped with official Hadoop distribution which takes Apache Software License. I recommend the Twitter version which is a forked version of hadoop-gpl-compression with remarkable improvements. If you are running the official Hadoop, some installation structures are provided the the documentation.

    For Cloudera Distribution

    In Cloudera's CDH, hadoop-lzo is shipped to customers as parcels and you can download and distribute it conviniently using the Cloudera Manager. By default, the hadoop-lzo will be installed in /opt/cloudera/parcels/HADOOP_LZO.

    Here we show the configuration on our cluster:

    • Cloudera CDH 5
    • HADOOP_LZO version 0.4.15

    Step3: Setting up env variables

    For Apache Hadoop/Pig

    The basic configuration is for Apache Hadoop, while Pig is piggying upon its functionality.

    • Set compression codecs libraries in core-site.xml:

      <property>
        <name>io.compression.codecs</name>
        <value>org.apache.hadoop.io.compress.GzipCodec,
            org.apache.hadoop.io.compress.DefaultCodec,
            org.apache.hadoop.io.compress.BZip2Codec,
            com.hadoop.compression.lzo.LzoCodec,
            com.hadoop.compression.lzo.LzopCodec
        </value>
      </property>
      <property>
        <name>io.compression.codec.lzo.class</name>
        <value>com.hadoop.compression.lzo.LzoCodec</value>
      </property>
      
    • Set MapReduce compression configuration in mapred-site.xml:

      <property>
        <name>mapred.compress.map.output</name>
        <value>true</value>
      </property>
      <property>
        <name>mapred.map.output.compression.codec</name>
        <value>com.hadoop.compression.lzo.LzoCodec</value>
      </property>
      <property>
        <name>mapred.child.env</name>
        <value>JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/path/to/your/hadoop-lzo/libs/native</value>
      </property>
      
    • Append HADOOP_CLASSPATH to hadoop-env.sh:

      HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/opt/cloudera/parcels/CDH/lib/hadoop/lib/*
      

    For Cloudera Distribution

    You can use the Cloudera Manager to enable the same previous settings via GUI interface:

    • For MapReduce component, change the configuration of corresponding keys as above:

      > **io.compression.codecs**
      > **mapred.compress.map.output**
      > **mapred.map.output.compression.codec**
      > **MapReduce Client safety valve for mapred-site.xml**
      
    • Edit MapReduce Client Environment Snippet for hadoop-env.sh to append the HADOOP_CLASSPATH variable.

    At last, restart dependent services in right order and deploy the configurations among all nodes. That's it!!. Then you can test the functionality with command and get successful messages similar to below:

       $ hadoop jar /path/to/hadoop-lzo.jar com.hadoop.compression.lzo.LzoIndexer lzo_logs
       $ 14/05/04 01:13:13 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
       $ 14/05/04 01:13:13 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 49753b4b5a029410c3bd91278c360c2241328387]
       $ 14/05/04 01:13:14 INFO lzo.LzoIndexer: [INDEX] LZO Indexing file datasets/lzo_logs size 0.00 GB...
       $ 14/05/04 01:13:14 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
       $ 14/05/04 01:13:14 INFO lzo.LzoIndexer: Completed LZO Indexing in 0.39 seconds (0.02 MB/s).  Index size is 0.01 KB.
    

    For Spark

    This consumes me much time because there are less information in previous posts. But the solution is strightforward with previous experience.

    No matter the Spark is installed via tar or the Cloudera Manager, you need merely to append two path values to spark-env.sh:

       SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/path/to/your/hadoop-lzo/libs/native
       SPARK_CLASSPATH=$SPARK_CLASSPATH:/path/to/your/hadoop-lzo/java/libs
    

    Ralated posts and questions

    A comparison of LZO performance is given in another place. A related question is also asked on StackOverflow but there are no solutions about this up to the finish of this tutorial. You maybe also interested in how to use the LZO Parcel from Cloudera.

    0 讨论(0)
  • 2020-12-05 21:01

    I just had the same error in my Cloudera 5 installation. In my case it was GPLEXTRAS parcel which was installed, distributed but not activated.

    On Cloudera Manager -> Hosts -> Parcels I pressed on filters clear everywhere, then I was able to press the Activate on the GPLEXTRAS parcess that was previously distributed already.

    That was enough to fix my issue.

    0 讨论(0)
提交回复
热议问题