GoogleHadoopFileSystem cannot be cast to hadoop FileSystem?

痴心易碎 提交于 2019-12-04 12:06:10

Short Answer

Indeed it was related to IsolatedClientLoader, and we've tracked down the root cause and verified a fix. I filed https://issues.apache.org/jira/browse/SPARK-9206 to track this issue, and successfully built a clean Spark tarball from my fork with a simple fix: https://github.com/apache/spark/pull/7549

There are a few short-term options:

  1. Use Spark 1.3.1 for now.
  2. In your bdutil deployment, use HDFS as the default filesystem (--default_fs=hdfs); you'll still be able to directly specify gs:// paths in your jobs, just that HDFS will be used for intermediate data and staging files. There are some minor incompatibilities using raw Hive in this mode, though.
  3. Use the raw val sqlContext = new org.apache.spark.sql.SQLContext(sc) instead of a HiveContext if you don't need HiveContext features.
  4. git clone https://github.com/dennishuo/spark and run ./make-distribution.sh --name my-custom-spark --tgz --skip-java-test -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver to get a fresh tarball you can specify in your bdutil's spark_env.sh.

Long Answer

We've verified that it only manifests when fs.default.name and fs.defaultFS are set to a gs:// path regardless of whether trying to load a path from parquetFile("gs://...") or parquetFile("hdfs://..."), and when fs.default.name and fs.defaultFS are set to an HDFS path, loading data from both HDFS and from GCS works fine. This is also specific to Spark 1.4+ currently, and is not present in Spark 1.3.1 or older.

The regression appears to have been introduced in https://github.com/apache/spark/commit/9ac8393663d759860c67799e000ec072ced76493 which actually fixes a prior related classloading issue, SPARK-8368. While the fix itself is correct for normal cases, there's a method IsolatedClientLoader.isSharedClass used to determine which classloader to use, and interacts with the aforementioned commit to break GoogleHadoopFileSystem classloading.

The following lines in that file include everything under com.google.* as a "shared class" because of Guava and possibly protobuf dependencies which are indeed loaded as shared libraries, but unfortunately GoogleHadoopFileSystem should be loaded as a "hive class" in this case, just like org.apache.hadoop.hdfs.DistributedFileSystem. We just happen to unluckily share the com.google.* package namespace.

protected def isSharedClass(name: String): Boolean =
  name.contains("slf4j") ||
  name.contains("log4j") ||
  name.startsWith("org.apache.spark.") ||
  name.startsWith("scala.") ||
  name.startsWith("com.google") ||
  name.startsWith("java.lang.") ||
  name.startsWith("java.net") ||
  sharedPrefixes.exists(name.startsWith)

...

/** The classloader that is used to load an isolated version of Hive. */
protected val classLoader: ClassLoader = new URLClassLoader(allJars, rootClassLoader) {
  override def loadClass(name: String, resolve: Boolean): Class[_] = {
    val loaded = findLoadedClass(name)
    if (loaded == null) doLoadClass(name, resolve) else loaded
  }

  def doLoadClass(name: String, resolve: Boolean): Class[_] = {
    ...
    } else if (!isSharedClass(name)) {
      logDebug(s"hive class: $name - ${getResource(classToPath(name))}")
      super.loadClass(name, resolve)
    } else {
      // For shared classes, we delegate to baseClassLoader.
      logDebug(s"shared class: $name")
      baseClassLoader.loadClass(name)
    }
  }
}

This can be verified by adding the following line to ${SPARK_INSTALL}/conf/log4j.properties:

log4j.logger.org.apache.spark.sql.hive.client=DEBUG

And the output shows:

...
15/07/20 20:59:14 DEBUG IsolatedClientLoader: hive class: org.apache.hadoop.hdfs.DistributedFileSystem - jar:file:/home/hadoop/spark-install/lib/spark-assembly-1.4.1-hadoop2.6.0.jar!/org/apache/hadoop/hdfs/DistributedFileSystem.class
...
15/07/20 20:59:14 DEBUG IsolatedClientLoader: shared class: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
java.lang.RuntimeException: java.lang.ClassCastException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem cannot be cast to org.apache.hadoop.fs.FileSystem
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!