Missing hive-site when using spark-submit YARN cluster mode

后端 未结 4 687
我在风中等你
我在风中等你 2020-12-06 18:38

Using HDP 2.5.3 and I\'ve been trying to debug some YARN container classpath issues.

Since HDP includes both Spark 1.6 and 2.0.0, there have been some conflicting v

相关标签:
4条回答
  • 2020-12-06 19:10

    Found an issue with this

    You create a org.apache.spark.sql.SQLContext before creating hive context the hive-site.xml is not picked properly when you create hive context.

    Solution : Create the hive context before creating another SQL context.

    0 讨论(0)
  • 2020-12-06 19:14

    You can use spark property - spark.yarn.dist.files and specify path to hive-site.xml there.

    0 讨论(0)
  • 2020-12-06 19:24

    In the cluster mode configuration is read from the conf directory of the machine, which runs the driver container, not the one use for spark-submit.

    0 讨论(0)
  • 2020-12-06 19:26

    The way I understand it, in local or yarn-client modes...

    1. the Launcher checks whether it needs Kerberos tokens for HDFS, YARN, Hive, HBase
      > hive-site.xml is searched in the CLASSPATH by the Hive/Hadoop client libs (including in driver.extraClassPath because the Driver runs inside the Launcher and the merged CLASSPATH is already built at this point)
    2. the Driver checks which kind of metastore to use for internal purposes: a standalone metastore backed by a volatile Derby instance, or a regular Hive metastore
      > that's $SPARK_CONF_DIR/hive-site.xml
    3. when using the Hive interface, a Metastore connection is used to read/write Hive metadata in the Driver
      > hive-site.xml is searched in the CLASSPATH by the Hive/Hadoop client libs (and the Kerberos token is used, if any)

    So you can have one hive-site.xml stating that Spark should use an embedded, in-memory Derby instance to use as a sandbox (in-memory implying "stop leaving all these temp files behind you") while another hive-site.xml gives the actual Hive Metastore URI. And all is well.


    Now, in yarn-cluster mode, all that mechanism pretty much explodes in a nasty, undocumented mess.

    The Launcher needs its own CLASSPATH settings to create the Kerberos tokens, otherwise it fails silently. Better go to the source code to find out which undocumented Env variable you shoud use.
    It may also need an override in some properties because the hard-coded defaults suddenly are not the defaults any more (silently).

    The Driver cannot tap the original $SPARK_CONF_DIR, it has to rely on what the Launcher has made available for upload. Does that include a copy of $SPARK_CONF_DIR/hive-site.xml? Looks like it's not the case.
    So you are probably using a Derby thing as a stub.

    And the Driver has to to do with whatever YARN has forced on the container CLASSPATH, in whatever order.
    Besides, the driver.extraClassPath additions do NOT take precedence by default; for that you have to force spark.yarn.user.classpath.first=true (which is translated to the standard Hadoop property whose exact name I can't remember right now, especially since there are multiple props with similar names that may be deprecated and/or not working in Hadoop 2.x)


    Think that's bad? Try out connecting to a Kerberized HBase in yarn-cluster mode. The connection is done in the Executors, that's another layer of nastyness. But I disgress.

    Bottom line: start your diagnostic again.

    A. Are you really, really sure that the mysterious "Metastore connection errors" are caused by missing properties, and specifically the Metastore URI?

    B. By the way, are your users explicitly using a HiveContext???

    C. What is exactly the CLASSPATH that YARN presents to the Driver JVM, and what is exactly the CLASSPATH that the Driver presents to the Hadoop libs when opening the Metastore connection?

    D. If the CLASSPATH built by YARN is messed up for some reason, what would be the minimal fix -- change in precedence rules? addition? both?

    0 讨论(0)
提交回复
热议问题