apache-spark-2.0

Problems to create DataFrame from Rows containing Option[T]

半腔热情 提交于 2019-12-02 01:29:03
I'm migrating some code from Spark 1.6 to Spark 2.1 and struggling with the following issue: This worked perfectly in Spark 1.6 import org.apache.spark.sql.types.{LongType, StructField, StructType} val schema = StructType(Seq(StructField("i", LongType,nullable=true))) val rows = sparkContext.parallelize(Seq(Row(Some(1L)))) sqlContext.createDataFrame(rows,schema).show The same code in Spark 2.1.1: import org.apache.spark.sql.types.{FloatType, LongType, StructField, StructType} val schema = StructType(Seq(StructField("i", LongType,nullable=true))) val rows = ss.sparkContext.parallelize(Seq(Row

How to run multiple instances of Spark 2.0 at once (in multiple Jupyter Notebooks)?

房东的猫 提交于 2019-12-01 10:56:53
I have a script which conveniently allows me to use Spark in a Jupyter Notebook. This is great, except when I run spark commands in a second notebook (for instance to test out some scratch work). I get a very long error message the key parts of which seem to be: Py4JJavaError: An error occurred while calling o31.json. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient` . . . Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /metastore_db The problem seems to be that I

How to run multiple instances of Spark 2.0 at once (in multiple Jupyter Notebooks)?

旧巷老猫 提交于 2019-12-01 09:43:17
问题 I have a script which conveniently allows me to use Spark in a Jupyter Notebook. This is great, except when I run spark commands in a second notebook (for instance to test out some scratch work). I get a very long error message the key parts of which seem to be: Py4JJavaError: An error occurred while calling o31.json. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient` . . . Caused by: ERROR XSDB6:

Schema for type Any is not supported

只愿长相守 提交于 2019-12-01 09:25:13
I'm trying to create a spark UDF to extract a Map of (key, value) pairs from a User defined case class. The scala function seems to work fine, but when I try to convert that to a UDF in spark2.0, I'm running into the " Schema for type Any is not supported" error. case class myType(c1: String, c2: Int) def getCaseClassParams(cc: Product): Map[String, Any] = { cc .getClass .getDeclaredFields // all field names .map(_.getName) .zip(cc.productIterator.to) // zipped with all values .toMap } But when I try to instantiate a function value as a UDF it results in the following error - val ccUDF = udf{

How to specify sql dialect when creating spark dataframe from JDBC?

僤鯓⒐⒋嵵緔 提交于 2019-12-01 08:40:36
问题 I'm having an issue reading data via custom JDBC with Spark. How would I go about about overriding the sql dialect inferred via jdbc url? The database in question is vitess (https://github.com/youtube/vitess) which runs a mysql variant, so I want to specify a mysql dialect. The jdbc url begins with jdbc:vitess/ Otherwise the DataFrameReader is inferring a default dialect which uses """ as a quote identifier. As a result, queries via spark.read.jdbc get sent as Select 'id', 'col2', col3', 'etc

Schema for type Any is not supported

眉间皱痕 提交于 2019-12-01 06:51:54
问题 I'm trying to create a spark UDF to extract a Map of (key, value) pairs from a User defined case class. The scala function seems to work fine, but when I try to convert that to a UDF in spark2.0, I'm running into the " Schema for type Any is not supported" error. case class myType(c1: String, c2: Int) def getCaseClassParams(cc: Product): Map[String, Any] = { cc .getClass .getDeclaredFields // all field names .map(_.getName) .zip(cc.productIterator.to) // zipped with all values .toMap } But

How to cast a WrappedArray[WrappedArray[Float]] to Array[Array[Float]] in spark (scala)

≯℡__Kan透↙ 提交于 2019-11-30 09:14:51
问题 Im using Spark 2.0. I have a column of my dataframe containing a WrappedArray of WrappedArrays of Float. An example of a row would be: [[1.0 2.0 2.0][6.0 5.0 2.0][4.0 2.0 3.0]] Im trying to transform this column into an Array[Array[Float]] . What I tried so far is the following: dataframe.select("mycolumn").rdd.map(r => r.asInstanceOf[Array[Array[Float]]]) but I get the following error: Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema

How to create SparkSession from existing SparkContext

早过忘川 提交于 2019-11-30 00:27:21
问题 I have a Spark application which using Spark 2.0 new API with SparkSession . I am building this application on top of the another application which is using SparkContext . I would like to pass SparkContext to my application and initialize SparkSession using existing SparkContext . However I could not find a way how to do that. I found that SparkSession constructor with SparkContext is private so I can't initialize it in that way and builder does not offer any setSparkContext method. Do you

How to cast a WrappedArray[WrappedArray[Float]] to Array[Array[Float]] in spark (scala)

…衆ロ難τιáo~ 提交于 2019-11-29 14:06:59
Im using Spark 2.0. I have a column of my dataframe containing a WrappedArray of WrappedArrays of Float. An example of a row would be: [[1.0 2.0 2.0][6.0 5.0 2.0][4.0 2.0 3.0]] Im trying to transform this column into an Array[Array[Float]] . What I tried so far is the following: dataframe.select("mycolumn").rdd.map(r => r.asInstanceOf[Array[Array[Float]]]) but I get the following error: Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to [[F Any idea would be highly appreciated. Thanks Try this: val wawa: WrappedArray

Why does using cache on streaming Datasets fail with “AnalysisException: Queries with streaming sources must be executed with writeStream.start()”?

本小妞迷上赌 提交于 2019-11-29 01:58:21
SparkSession .builder .master("local[*]") .config("spark.sql.warehouse.dir", "C:/tmp/spark") .config("spark.sql.streaming.checkpointLocation", "C:/tmp/spark/spark-checkpoint") .appName("my-test") .getOrCreate .readStream .schema(schema) .json("src/test/data") .cache .writeStream .start .awaitTermination While executing this sample in Spark 2.1.0 I got error. Without the .cache option it worked as intended but with .cache option i got: Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();; FileSource[src/test