apache-spark-2.0 | 易学教程

Prebuilt Spark 2.1.0 creates metastore_db folder and derby.log when launching spark-shell

阅读更多关于 Prebuilt Spark 2.1.0 creates metastore_db folder and derby.log when launching spark-shell

I just upgraded from Spark 2.0.2 to Spark 2.1.0 (by downloading the prebuilt version for Hadoop 2.7&later). No Hive is installed. Upon launch of the spark-shell, the metastore_db/ folder and derby.log file are created at the launch location, together with a bunch of warning logs (which were not printed in the previous version). Closer inspection of the debug logs shows that Spark 2.1.0 tries to initialise a HiveMetastoreConnection : 17/01/13 09:14:44 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes. Similar debug logs for Spark 2.0.2 do not show any

Pass system property to spark-submit and read file from classpath or custom path

阅读更多关于 Pass system property to spark-submit and read file from classpath or custom path

问题 I have recently found a way to use logback instead of log4j in Apache Spark (both for local use and spark-submit ). However, there is last piece missing. The issue is that Spark tries very hard not to see logback.xml settings in its classpath. I have already found a way to load it during local execution: What I have so far Basically, checking for System property logback.configurationFile , but loading logback.xml from my /src/main/resources/ just in case: // the same as default: https:/

spark off heap memory config and tungsten

阅读更多关于 spark off heap memory config and tungsten

问题 I thought that with the integration of project Tungesten, spark would automatically use off heap memory. What for are spark.memory.offheap.size and spark.memory.offheap.enabled? Do I manually need to specify the amount of off heap memory for Tungsten here? 回答1: Spark/Tungsten use Encoders/Decoders to represent JVM objects as a highly specialized Spark SQL Types objects which then can be serialized and operated on in a highly performant way. Internal format representation is highly efficient

Transforming Spark SQL AST with extraOptimizations

阅读更多关于 Transforming Spark SQL AST with extraOptimizations

I'm wanting to take a SQL string as a user input, then transform it before execution. In particular, I want to modify the top-level projection (select clause), injecting additional columns to be retrieved by the query. I was hoping to achieve this by hooking into Catalyst using sparkSession.experimental.extraOptimizations . I know that what I'm attempting isn't strictly speaking an optimisation (the transformation changes the semantics of the SQL statement), but the API still seems suitable. However, my transformation seems to be ignored by the query executor. Here is a minimal example to

Performance of UDAF versus Aggregator in Spark

阅读更多关于 Performance of UDAF versus Aggregator in Spark

I am trying to write some performance-mindful code in Spark and wondering whether I should write an Aggregator or a User-defined Aggregate Function (UDAF) for my rollup operations on a Dataframe. I have not been able to find any data anywhere on how fast each of these methods are and which you should be using for spark 2.0+. 来源： https://stackoverflow.com/questions/45356452/performance-of-udaf-versus-aggregator-in-spark

How to write Spark Structured Streaming Data into Hive?

阅读更多关于 How to write Spark Structured Streaming Data into Hive?

问题 How to write Spark Structured Streaming Data into Hive? There is df.write().saveAsTable(tablename) however I am not sure if this writes streaming data I normally do df.writeStream().trigger(new ProcessingTime(1000)).foreach(new KafkaSink()).start() to write streaming data into Kafka but I don't see anything similar to write streaming data into Hive data warehouse. any ideas? 回答1: df.createOrReplaceTempView("mytable") spark.sql("create table mytable as select * from mytable"); or df.write()

Saving The RDD pair in particular format in the output file

阅读更多关于 Saving The RDD pair in particular format in the output file

I have a JavaPairRDD lets say data of type <Integer,List<Integer>> when i do data.saveAsTextFile("output") The output will contain the data in the following format: (1,[1,2,3,4]) etc... I want something like this in the output file : 1 1,2,3,4 i.e. 1\t1,2,3,4 Any help would be appreciated You need to understand what's happening here. You have an RDD[T,U] where T and U are some obj types, read it as RDD of Tuple of T and U. On this RDD when you call saveAsTextFile() , it essentially converts each element of RDD to string, hence the text file is generated as output. Now, how is an object of some

Workaround for importing spark implicits everywhere

阅读更多关于 Workaround for importing spark implicits everywhere

I'm new to Spark 2.0 and using datasets in our code base. I'm kinda noticing that I need to import spark.implicits._ everywhere in our code. For example: File A class A { def job(spark: SparkSession) = { import spark.implcits._ //create dataset ds val b = new B(spark) b.doSomething(ds) doSomething(ds) } private def doSomething(ds: Dataset[Foo], spark: SparkSession) = { import spark.implicits._ ds.map(e => 1) } } File B class B(spark: SparkSession) { def doSomething(ds: Dataset[Foo]) = { import spark.implicits._ ds.map(e => "SomeString") } } What I wanted to ask is if there's a cleaner way to

Out of Memory Error when Reading large file in Spark 2.1.0

阅读更多关于 Out of Memory Error when Reading large file in Spark 2.1.0

I want to use spark to read a large (51GB) XML file (on an external HDD) into a dataframe (using spark-xml plugin ), do simple mapping / filtering, reordering it and then writing it back to disk, as a CSV file. But I always get a java.lang.OutOfMemoryError: Java heap space no matter how I tweak this. I want to understand why doesn't increasing the number of partitions stop the OOM error Shouldn't it split the task into more parts so that each individual part is smaller and doesn't cause memory problems? (Spark can't possibly be trying to stuff everything in memory and crashing if it doesn't

How to use dataset to groupby

阅读更多关于 How to use dataset to groupby

I have a request to use rdd to do so： val test = Seq(("New York", "Jack"), ("Los Angeles", "Tom"), ("Chicago", "David"), ("Houston", "John"), ("Detroit", "Michael"), ("Chicago", "Andrew"), ("Detroit", "Peter"), ("Detroit", "George") ) sc.parallelize(test).groupByKey().mapValues(_.toList).foreach(println) The result is that： (New York,List(Jack)) (Detroit,List(Michael, Peter, George)) (Los Angeles,List(Tom)) (Houston,List(John)) (Chicago,List(David, Andrew)) How to do it use dataset with spark2.0? I have a way to use a custom function, but the feeling is so complicated, there is no simple point