apache-spark-2.0 | 易学教程

Split Spark DataFrame into two DataFrames (70% and 30% ) based on id column by preserving order

阅读更多关于 Split Spark DataFrame into two DataFrames (70% and 30% ) based on id column by preserving order

问题 I have a spark dataframe which is like id start_time feature 1 01-01-2018 3.567 1 01-02-2018 4.454 1 01-03-2018 6.455 2 01-02-2018 343.4 2 01-08-2018 45.4 3 02-04-2018 43.56 3 02-07-2018 34.56 3 03-07-2018 23.6 I want to be able to split this into two dataframes based on the id column .So I should groupby the id column, sort by start_time and take 70% of the rows into one dataframe and 30% of the rows into another dataframe by preserving the order.The result should look like: Dataframe1: id

Avoid starting HiveThriftServer2 with created context programmatically

阅读更多关于 Avoid starting HiveThriftServer2 with created context programmatically

问题 We are trying to use ThriftServer to query data from spark temp tables, in spark 2.0.0. First, we have created sparkSession with enabled Hive Support. Currently, we start ThriftServer with sqlContext like this: HiveThriftServer2.startWithContext(spark.sqlContext()); We have spark stream with registered temp table "spark_temp_table": StreamingQuery streamingQuery = streamedData.writeStream() .format("memory") .queryName("spark_temp_table") .start(); With beeline we are able to see temp tables

Problems to create DataFrame from Rows containing Option[T]

阅读更多关于 Problems to create DataFrame from Rows containing Option[T]

问题 I'm migrating some code from Spark 1.6 to Spark 2.1 and struggling with the following issue: This worked perfectly in Spark 1.6 import org.apache.spark.sql.types.{LongType, StructField, StructType} val schema = StructType(Seq(StructField("i", LongType,nullable=true))) val rows = sparkContext.parallelize(Seq(Row(Some(1L)))) sqlContext.createDataFrame(rows,schema).show The same code in Spark 2.1.1: import org.apache.spark.sql.types.{FloatType, LongType, StructField, StructType} val schema =

PySpark : KeyError when converting a DataFrame column of String type to Double

阅读更多关于 PySpark : KeyError when converting a DataFrame column of String type to Double

问题 I'm trying to learn machine learning with PySpark . I have a dataset that has a couple of String columns which have either True or False or Yes or No as its value. I'm working with DecisionTree and I wanted to convert these String values to corresponding Double values i.e. True, Yes should change to 1.0 and False, No should change to 0.0 . I saw a tutorial where they did the same thing and I came up with this code df = sqlContext.read.csv("C:/../churn-bigml-20.csv",inferSchema=True,header

Workaround for importing spark implicits everywhere

阅读更多关于 Workaround for importing spark implicits everywhere

问题 I'm new to Spark 2.0 and using datasets in our code base. I'm kinda noticing that I need to import spark.implicits._ everywhere in our code. For example: File A class A { def job(spark: SparkSession) = { import spark.implcits._ //create dataset ds val b = new B(spark) b.doSomething(ds) doSomething(ds) } private def doSomething(ds: Dataset[Foo], spark: SparkSession) = { import spark.implicits._ ds.map(e => 1) } } File B class B(spark: SparkSession) { def doSomething(ds: Dataset[Foo]) = {

Split dataset based on column values in spark

阅读更多关于 Split dataset based on column values in spark

问题 I am trying to split the Dataset into different Datasets based on Manufacturer column contents. It is very slow Please suggest a way to improve the code, so that it can execute faster and reduce the usage of Java code. List<Row> lsts= countsByAge.collectAsList(); for(Row lst:lsts){ String man=lst.toString(); man = man.replaceAll("[\\p{Ps}\\p{Pe}]", ""); Dataset<Row> DF = src.filter("Manufacturer='"+man+"'"); DF.show(); } The Code, Input and Output Datasets are as shown below. package org

Prebuilt Spark 2.1.0 creates metastore_db folder and derby.log when launching spark-shell

阅读更多关于 Prebuilt Spark 2.1.0 creates metastore_db folder and derby.log when launching spark-shell

问题 I just upgraded from Spark 2.0.2 to Spark 2.1.0 (by downloading the prebuilt version for Hadoop 2.7&later). No Hive is installed. Upon launch of the spark-shell, the metastore_db/ folder and derby.log file are created at the launch location, together with a bunch of warning logs (which were not printed in the previous version). Closer inspection of the debug logs shows that Spark 2.1.0 tries to initialise a HiveMetastoreConnection : 17/01/13 09:14:44 INFO HiveUtils: Initializing

Spark sql issue with columns specified

阅读更多关于 Spark sql issue with columns specified

问题 we are trying to replicate an oracle db into hive. We get the queries from oracle and run them in hive. So, we get them in this format: INSERT INTO schema.table(col1,col2) VALUES ('val','val'); While this query works in Hive directly, when I use spark.sql, I get the following error: org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'emp_id' expecting {'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 20) == SQL == insert into ss.tab(emp_id

Performance of UDAF versus Aggregator in Spark

阅读更多关于 Performance of UDAF versus Aggregator in Spark

问题 I am trying to write some performance-mindful code in Spark and wondering whether I should write an Aggregator or a User-defined Aggregate Function (UDAF) for my rollup operations on a Dataframe. I have not been able to find any data anywhere on how fast each of these methods are and which you should be using for spark 2.0+. 来源： https://stackoverflow.com/questions/45356452/performance-of-udaf-versus-aggregator-in-spark

How to enable Tungsten optimization in Spark 2?

阅读更多关于 How to enable Tungsten optimization in Spark 2?

问题 I just built Spark 2 with hive support and deploy it to a cluster with Hortonworks 2.3.4. However I find that this Spark 2.0.3 is slower than the standard spark 1.5.3 that comes with HDP 2.3 When I check explain it seems that my Spark 2.0.3 is not using tungsten. Do I need to create special build to enable Tungsten? Spark 1.5.3 Explain == Physical Plan == TungstenAggregate(key=[id#2], functions=[], output=[id#2]) TungstenExchange hashpartitioning(id#2) TungstenAggregate(key=[id#2], functions=