Using hive database in spark

问题

I am new in spark and trying to run some queries on tpcds benchmark tables, using HortonWorks Sandbox. http://www.tpc.org/tpcds/ There is no problem while using hive through shell or hive-view on sandbox. The problem is that I don't know how connect to the database if I want to use the spark. How can I use a hive database in spark for running the queries? The only solution that I know till now is to rebuild each table manually and load data in them using the following scala codes, which is not the best solution.

scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS employee(id INT, name STRING, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'")
scala> sqlContext.sql("LOAD DATA LOCAL INPATH 'employee.txt' INTO TABLE employee")
scala> val result = sqlContext.sql("FROM employe SELECT id, name, age")
scala> result.show()

I also read some about hive-site.xml but I don't know where to find it and what changes to make on it to connect to the database.

回答1:

There is no need to connect to a specific database when using Spark and HiveContext.

You simply need to copy the "hive-site.xml" file to the Spark conf folder (or you could also create a symlink).

cp $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf/

Then, in Spark you can do something like that (I'm not a scala user so the syntax might be wrong) :

val hc = new org.apache.spark.sql.hive.HiveContext(sc)
val result = hc.sql("SELECT col1, col2, col3 FROM dbname.tablename")
result.show()

来源：https://stackoverflow.com/questions/38770503/using-hive-database-in-spark

标签

apache-spark

Hive

apache-spark-sql

hortonworks-sandbox