Running app jar file on spark-submit in a google dataproc cluster instance

谁都会走 提交于 2019-12-01 20:53:51

As you've found, Dataproc includes Hadoop dependencies on the classpath when invoking Spark. This is done primarily so that using Hadoop input formats, file systems, etc is fairly straight-forward. The downside is that you will end up with Hadoop's guava version which is 11.02 (See HADOOP-10101).

How to work around this depends on your build system. If using Maven, the maven-shade plugin can be used to relocate your version of guava under a new package name. An example of this can be seen in the GCS Hadoop Connector's packaging, but the crux of it is the following plugin declaration in your pom.xml build section:

  <plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>2.3</version>
    <executions>
      <execution>
        <phase>package</phase>
        <goals>
          <goal>shade</goal>
        </goals>
        <configuration>
          <relocations>
            <relocation>
              <pattern>com.google.common</pattern>
              <shadedPattern>your.repackaged.deps.com.google.common</shadedPattern>
            </relocation>
          </relocations>
        </execution>
      </execution>
    </plugin>

Similar relocations can be accomplished with the sbt-assembly plugin for sbt, jarjar for ant, and either jarjar or shadow for gradle.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!