Running app jar file on spark-submit in a google dataproc cluster instance

I'm running a .jar file that contains all dependencies that I need packaged in it. One of this dependencies is com.google.common.util.concurrent.RateLimiter and already checked it's class file is in this .jar file.

Unfortunately when I hit the command spark-submit on the master node of my google's dataproc-cluster instance I'm getting this error:

Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.createStarted()Lcom/google/common/base/Stopwatch;
at com.google.common.util.concurrent.RateLimiter$SleepingStopwatch$1.<init>(RateLimiter.java:417)
at com.google.common.util.concurrent.RateLimiter$SleepingStopwatch.createFromSystemTimer(RateLimiter.java:416)
at com.google.common.util.concurrent.RateLimiter.create(RateLimiter.java:130)
at LabeledAddressDatasetBuilder.publishLabeledAddressesFromBlockstem(LabeledAddressDatasetBuilder.java:60)
at LabeledAddressDatasetBuilder.main(LabeledAddressDatasetBuilder.java:144)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

It seems something happened in the sense of overwriting my dependencies. Already decompiled the Stopwatch.class file from this .jar and checked that method is there. That just happened when I ran on that google dataproc instance. I did grep on the process executing the spark-submit and I got the flag -cp like this:

/usr/lib/jvm/java-8-openjdk-amd64/bin/java -cp /usr/lib/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.5.0-hadoop2.7.1.jar:/usr/lib/spark/lib/datanucleus-api-jdo-3.2.6.jar:/usr/lib/spark/lib/datanucleus-rdbms-3.2.9.jar:/usr/lib/spark/lib/datanucleus-core-3.2.10.jar:/etc/hadoop/conf/:/etc/hadoop/conf/:/usr/lib/hadoop/lib/native/:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*

Is there anything I can do to solve this problem?

Thank you.

As you've found, Dataproc includes Hadoop dependencies on the classpath when invoking Spark. This is done primarily so that using Hadoop input formats, file systems, etc is fairly straight-forward. The downside is that you will end up with Hadoop's guava version which is 11.02 (See HADOOP-10101).

How to work around this depends on your build system. If using Maven, the maven-shade plugin can be used to relocate your version of guava under a new package name. An example of this can be seen in the GCS Hadoop Connector's packaging, but the crux of it is the following plugin declaration in your pom.xml build section:

  <plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>2.3</version>
    <executions>
      <execution>
        <phase>package</phase>
        <goals>
          <goal>shade</goal>
        </goals>
        <configuration>
          <relocations>
            <relocation>
              <pattern>com.google.common</pattern>
              <shadedPattern>your.repackaged.deps.com.google.common</shadedPattern>
            </relocation>
          </relocations>
        </execution>
      </execution>
    </plugin>

Similar relocations can be accomplished with the sbt-assembly plugin for sbt, jarjar for ant, and either jarjar or shadow for gradle.

来源：https://stackoverflow.com/questions/33922719/running-app-jar-file-on-spark-submit-in-a-google-dataproc-cluster-instance

标签

java

jar

apache-spark

google-cloud-dataproc