apache-spark-2.0

Prebuilt Spark 2.1.0 creates metastore_db folder and derby.log when launching spark-shell

≡放荡痞女 提交于 2019-12-07 01:56:27
I just upgraded from Spark 2.0.2 to Spark 2.1.0 (by downloading the prebuilt version for Hadoop 2.7&later). No Hive is installed. Upon launch of the spark-shell, the metastore_db/ folder and derby.log file are created at the launch location, together with a bunch of warning logs (which were not printed in the previous version). Closer inspection of the debug logs shows that Spark 2.1.0 tries to initialise a HiveMetastoreConnection : 17/01/13 09:14:44 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes. Similar debug logs for Spark 2.0.2 do not show any

Pass system property to spark-submit and read file from classpath or custom path

限于喜欢 提交于 2019-12-06 22:58:57
问题 I have recently found a way to use logback instead of log4j in Apache Spark (both for local use and spark-submit ). However, there is last piece missing. The issue is that Spark tries very hard not to see logback.xml settings in its classpath. I have already found a way to load it during local execution: What I have so far Basically, checking for System property logback.configurationFile , but loading logback.xml from my /src/main/resources/ just in case: // the same as default: https:/

spark off heap memory config and tungsten

一个人想着一个人 提交于 2019-12-06 18:45:11
问题 I thought that with the integration of project Tungesten, spark would automatically use off heap memory. What for are spark.memory.offheap.size and spark.memory.offheap.enabled? Do I manually need to specify the amount of off heap memory for Tungsten here? 回答1: Spark/Tungsten use Encoders/Decoders to represent JVM objects as a highly specialized Spark SQL Types objects which then can be serialized and operated on in a highly performant way. Internal format representation is highly efficient

Transforming Spark SQL AST with extraOptimizations

你。 提交于 2019-12-06 14:55:07
I'm wanting to take a SQL string as a user input, then transform it before execution. In particular, I want to modify the top-level projection (select clause), injecting additional columns to be retrieved by the query. I was hoping to achieve this by hooking into Catalyst using sparkSession.experimental.extraOptimizations . I know that what I'm attempting isn't strictly speaking an optimisation (the transformation changes the semantics of the SQL statement), but the API still seems suitable. However, my transformation seems to be ignored by the query executor. Here is a minimal example to

Performance of UDAF versus Aggregator in Spark

浪子不回头ぞ 提交于 2019-12-06 12:09:29
I am trying to write some performance-mindful code in Spark and wondering whether I should write an Aggregator or a User-defined Aggregate Function (UDAF) for my rollup operations on a Dataframe. I have not been able to find any data anywhere on how fast each of these methods are and which you should be using for spark 2.0+. 来源: https://stackoverflow.com/questions/45356452/performance-of-udaf-versus-aggregator-in-spark

How to write Spark Structured Streaming Data into Hive?

南笙酒味 提交于 2019-12-06 08:20:28
问题 How to write Spark Structured Streaming Data into Hive? There is df.write().saveAsTable(tablename) however I am not sure if this writes streaming data I normally do df.writeStream().trigger(new ProcessingTime(1000)).foreach(new KafkaSink()).start() to write streaming data into Kafka but I don't see anything similar to write streaming data into Hive data warehouse. any ideas? 回答1: df.createOrReplaceTempView("mytable") spark.sql("create table mytable as select * from mytable"); or df.write()

Saving The RDD pair in particular format in the output file

夙愿已清 提交于 2019-12-06 05:39:49
I have a JavaPairRDD lets say data of type <Integer,List<Integer>> when i do data.saveAsTextFile("output") The output will contain the data in the following format: (1,[1,2,3,4]) etc... I want something like this in the output file : 1 1,2,3,4 i.e. 1\t1,2,3,4 Any help would be appreciated You need to understand what's happening here. You have an RDD[T,U] where T and U are some obj types, read it as RDD of Tuple of T and U. On this RDD when you call saveAsTextFile() , it essentially converts each element of RDD to string, hence the text file is generated as output. Now, how is an object of some

Workaround for importing spark implicits everywhere

ε祈祈猫儿з 提交于 2019-12-05 12:52:21
I'm new to Spark 2.0 and using datasets in our code base. I'm kinda noticing that I need to import spark.implicits._ everywhere in our code. For example: File A class A { def job(spark: SparkSession) = { import spark.implcits._ //create dataset ds val b = new B(spark) b.doSomething(ds) doSomething(ds) } private def doSomething(ds: Dataset[Foo], spark: SparkSession) = { import spark.implicits._ ds.map(e => 1) } } File B class B(spark: SparkSession) { def doSomething(ds: Dataset[Foo]) = { import spark.implicits._ ds.map(e => "SomeString") } } What I wanted to ask is if there's a cleaner way to

Out of Memory Error when Reading large file in Spark 2.1.0

左心房为你撑大大i 提交于 2019-12-05 12:00:46
I want to use spark to read a large (51GB) XML file (on an external HDD) into a dataframe (using spark-xml plugin ), do simple mapping / filtering, reordering it and then writing it back to disk, as a CSV file. But I always get a java.lang.OutOfMemoryError: Java heap space no matter how I tweak this. I want to understand why doesn't increasing the number of partitions stop the OOM error Shouldn't it split the task into more parts so that each individual part is smaller and doesn't cause memory problems? (Spark can't possibly be trying to stuff everything in memory and crashing if it doesn't

How to use dataset to groupby

Deadly 提交于 2019-12-05 10:30:28
I have a request to use rdd to do so: val test = Seq(("New York", "Jack"), ("Los Angeles", "Tom"), ("Chicago", "David"), ("Houston", "John"), ("Detroit", "Michael"), ("Chicago", "Andrew"), ("Detroit", "Peter"), ("Detroit", "George") ) sc.parallelize(test).groupByKey().mapValues(_.toList).foreach(println) The result is that: (New York,List(Jack)) (Detroit,List(Michael, Peter, George)) (Los Angeles,List(Tom)) (Houston,List(John)) (Chicago,List(David, Andrew)) How to do it use dataset with spark2.0? I have a way to use a custom function, but the feeling is so complicated, there is no simple point