apache-spark-2.0

Split Spark DataFrame into two DataFrames (70% and 30% ) based on id column by preserving order

房东的猫 提交于 2020-05-15 08:45:11
问题 I have a spark dataframe which is like id start_time feature 1 01-01-2018 3.567 1 01-02-2018 4.454 1 01-03-2018 6.455 2 01-02-2018 343.4 2 01-08-2018 45.4 3 02-04-2018 43.56 3 02-07-2018 34.56 3 03-07-2018 23.6 I want to be able to split this into two dataframes based on the id column .So I should groupby the id column, sort by start_time and take 70% of the rows into one dataframe and 30% of the rows into another dataframe by preserving the order.The result should look like: Dataframe1: id

Avoid starting HiveThriftServer2 with created context programmatically

二次信任 提交于 2020-01-14 07:17:06
问题 We are trying to use ThriftServer to query data from spark temp tables, in spark 2.0.0. First, we have created sparkSession with enabled Hive Support. Currently, we start ThriftServer with sqlContext like this: HiveThriftServer2.startWithContext(spark.sqlContext()); We have spark stream with registered temp table "spark_temp_table": StreamingQuery streamingQuery = streamedData.writeStream() .format("memory") .queryName("spark_temp_table") .start(); With beeline we are able to see temp tables

Problems to create DataFrame from Rows containing Option[T]

旧巷老猫 提交于 2020-01-11 10:25:33
问题 I'm migrating some code from Spark 1.6 to Spark 2.1 and struggling with the following issue: This worked perfectly in Spark 1.6 import org.apache.spark.sql.types.{LongType, StructField, StructType} val schema = StructType(Seq(StructField("i", LongType,nullable=true))) val rows = sparkContext.parallelize(Seq(Row(Some(1L)))) sqlContext.createDataFrame(rows,schema).show The same code in Spark 2.1.1: import org.apache.spark.sql.types.{FloatType, LongType, StructField, StructType} val schema =

PySpark : KeyError when converting a DataFrame column of String type to Double

生来就可爱ヽ(ⅴ<●) 提交于 2020-01-07 03:00:16
问题 I'm trying to learn machine learning with PySpark . I have a dataset that has a couple of String columns which have either True or False or Yes or No as its value. I'm working with DecisionTree and I wanted to convert these String values to corresponding Double values i.e. True, Yes should change to 1.0 and False, No should change to 0.0 . I saw a tutorial where they did the same thing and I came up with this code df = sqlContext.read.csv("C:/../churn-bigml-20.csv",inferSchema=True,header

Workaround for importing spark implicits everywhere

*爱你&永不变心* 提交于 2020-01-02 04:45:08
问题 I'm new to Spark 2.0 and using datasets in our code base. I'm kinda noticing that I need to import spark.implicits._ everywhere in our code. For example: File A class A { def job(spark: SparkSession) = { import spark.implcits._ //create dataset ds val b = new B(spark) b.doSomething(ds) doSomething(ds) } private def doSomething(ds: Dataset[Foo], spark: SparkSession) = { import spark.implicits._ ds.map(e => 1) } } File B class B(spark: SparkSession) { def doSomething(ds: Dataset[Foo]) = {

Split dataset based on column values in spark

…衆ロ難τιáo~ 提交于 2019-12-23 07:36:41
问题 I am trying to split the Dataset into different Datasets based on Manufacturer column contents. It is very slow Please suggest a way to improve the code, so that it can execute faster and reduce the usage of Java code. List<Row> lsts= countsByAge.collectAsList(); for(Row lst:lsts){ String man=lst.toString(); man = man.replaceAll("[\\p{Ps}\\p{Pe}]", ""); Dataset<Row> DF = src.filter("Manufacturer='"+man+"'"); DF.show(); } The Code, Input and Output Datasets are as shown below. package org

Prebuilt Spark 2.1.0 creates metastore_db folder and derby.log when launching spark-shell

我们两清 提交于 2019-12-23 02:38:39
问题 I just upgraded from Spark 2.0.2 to Spark 2.1.0 (by downloading the prebuilt version for Hadoop 2.7&later). No Hive is installed. Upon launch of the spark-shell, the metastore_db/ folder and derby.log file are created at the launch location, together with a bunch of warning logs (which were not printed in the previous version). Closer inspection of the debug logs shows that Spark 2.1.0 tries to initialise a HiveMetastoreConnection : 17/01/13 09:14:44 INFO HiveUtils: Initializing

Spark sql issue with columns specified

非 Y 不嫁゛ 提交于 2019-12-23 01:45:17
问题 we are trying to replicate an oracle db into hive. We get the queries from oracle and run them in hive. So, we get them in this format: INSERT INTO schema.table(col1,col2) VALUES ('val','val'); While this query works in Hive directly, when I use spark.sql, I get the following error: org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'emp_id' expecting {'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 20) == SQL == insert into ss.tab(emp_id

Performance of UDAF versus Aggregator in Spark

拈花ヽ惹草 提交于 2019-12-22 17:10:11
问题 I am trying to write some performance-mindful code in Spark and wondering whether I should write an Aggregator or a User-defined Aggregate Function (UDAF) for my rollup operations on a Dataframe. I have not been able to find any data anywhere on how fast each of these methods are and which you should be using for spark 2.0+. 来源: https://stackoverflow.com/questions/45356452/performance-of-udaf-versus-aggregator-in-spark

How to enable Tungsten optimization in Spark 2?

孤街醉人 提交于 2019-12-22 07:03:18
问题 I just built Spark 2 with hive support and deploy it to a cluster with Hortonworks 2.3.4. However I find that this Spark 2.0.3 is slower than the standard spark 1.5.3 that comes with HDP 2.3 When I check explain it seems that my Spark 2.0.3 is not using tungsten. Do I need to create special build to enable Tungsten? Spark 1.5.3 Explain == Physical Plan == TungstenAggregate(key=[id#2], functions=[], output=[id#2]) TungstenExchange hashpartitioning(id#2) TungstenAggregate(key=[id#2], functions=