spark-dataframe

PySpark: Add a new column with a tuple created from columns

送分小仙女□ 提交于 2020-01-02 05:28:08
问题 Here I have a dateframe created as follow, df = spark.createDataFrame([('a',5,'R','X'),('b',7,'G','S'),('c',8,'G','S')], ["Id","V1","V2","V3"]) It looks like +---+---+---+---+ | Id| V1| V2| V3| +---+---+---+---+ | a| 5| R| X| | b| 7| G| S| | c| 8| G| S| +---+---+---+---+ I'm looking to add a column that is a tuple consisting of V1,V2,V3. The result should look like +---+---+---+---+-------+ | Id| V1| V2| V3|V_tuple| +---+---+---+---+-------+ | a| 5| R| X|(5,R,X)| | b| 7| G| S|(7,G,S)| | c| 8|

Workaround for importing spark implicits everywhere

*爱你&永不变心* 提交于 2020-01-02 04:45:08
问题 I'm new to Spark 2.0 and using datasets in our code base. I'm kinda noticing that I need to import spark.implicits._ everywhere in our code. For example: File A class A { def job(spark: SparkSession) = { import spark.implcits._ //create dataset ds val b = new B(spark) b.doSomething(ds) doSomething(ds) } private def doSomething(ds: Dataset[Foo], spark: SparkSession) = { import spark.implicits._ ds.map(e => 1) } } File B class B(spark: SparkSession) { def doSomething(ds: Dataset[Foo]) = {

Converting row values into a column array in spark dataframe

假如想象 提交于 2020-01-01 19:36:09
问题 I am working on spark dataframes and I need to do a group by of a column and convert the column values of grouped rows into an array of elements as new column. Example : Input: employee | Address ------------------ Micheal | NY Micheal | NJ Output: employee | Address ------------------ Micheal | (NY,NJ) Any help is highly appreciated.! 回答1: Here is an alternate solution Where I have converted the dataframe to an rdd for the transformations and converted it back a dataFrame using sqlContext

How to convert datetime from string format into datetime format in pyspark?

回眸只為那壹抹淺笑 提交于 2020-01-01 14:42:01
问题 I created a dataframe using sqlContext and I have a problem with the datetime format as it is identified as string. df2 = sqlContext.createDataFrame(i[1]) df2.show df2.printSchema() Result: 2016-07-05T17:42:55.238544+0900 2016-07-05T17:17:38.842567+0900 2016-06-16T19:54:09.546626+0900 2016-07-05T17:27:29.227750+0900 2016-07-05T18:44:12.319332+0900 string (nullable = true) Since the datetime schema is a string, I want to change it to datetime format as follows: df3 = df2.withColumn('_1', df2['

How to convert datetime from string format into datetime format in pyspark?

只谈情不闲聊 提交于 2020-01-01 14:40:14
问题 I created a dataframe using sqlContext and I have a problem with the datetime format as it is identified as string. df2 = sqlContext.createDataFrame(i[1]) df2.show df2.printSchema() Result: 2016-07-05T17:42:55.238544+0900 2016-07-05T17:17:38.842567+0900 2016-06-16T19:54:09.546626+0900 2016-07-05T17:27:29.227750+0900 2016-07-05T18:44:12.319332+0900 string (nullable = true) Since the datetime schema is a string, I want to change it to datetime format as follows: df3 = df2.withColumn('_1', df2['

Change output file name in Spark Streaming

让人想犯罪 __ 提交于 2020-01-01 12:10:49
问题 I am running a Spark job which performs extremely well as far as the logic goes. However, the name of my output files are in the format part-00000,part-00001 etc., when I use saveAsTextFile to save the files in a s3 bucket. Is there a way to change the output filename? Thank you. 回答1: In Spark, you can use saveAsNewAPIHadoopFile and set mapreduce.output.basename parameter in hadoop configuration to change the prefix (Just the "part" prefix) val hadoopConf = new Configuration() hadoopConf.set(

How can I build a CoordinateMatrix in Spark using a DataFrame?

给你一囗甜甜゛ 提交于 2020-01-01 11:58:10
问题 I am trying to use the Spark implementation of the ALS algorithm for recommendation systems, so I built the DataFrame depicted below, as training data: |--------------|--------------|--------------| | userId | itemId | rating | |--------------|--------------|--------------| Now, I would like to create a sparse matrix, to represent the interactions between every user and every item. The matrix will be sparse because if there is no interaction between a user and an item, the corresponding value

how to select all columns that starts with a common label

时光毁灭记忆、已成空白 提交于 2020-01-01 08:21:11
问题 I have a dataframe in Spark 1.6 and want to select just some columns out of it. The column names are like: colA, colB, colC, colD, colE, colF-0, colF-1, colF-2 I know I can do like this to select specific columns: df.select("colA", "colB", "colE") but how to select, say "colA", "colB" and all the colF-* columns at once? Is there a way like in Pandas? 回答1: First grab the column names with df.columns , then filter down to just the column names you want .filter(_.startsWith("colF")) . This gives

How to do mathematical operation with two column in dataframe using pyspark

拈花ヽ惹草 提交于 2020-01-01 05:40:32
问题 I have dataframe with three column "x" ,"y" and "z" x y z bn 12452 221 mb 14521 330 pl 12563 160 lo 22516 142 I need to create a another column which is derived by this formula (m = z / y+z) So the new data frameshould look something like this: x y z m bn 12452 221 .01743 mb 14521 330 .02222 pl 12563 160 .01257 lo 22516 142 .00626 回答1: df = sqlContext.createDataFrame([('bn', 12452, 221), ('mb', 14521, 330)], ['x', 'y', 'z']) df = df.withColumn('m', df['z'] / (df['y'] + df['z'])) df.head(2) 来源

Getting NullPointerException using spark-csv with DataFrames

十年热恋 提交于 2020-01-01 05:24:26
问题 Running through the spark-csv README there's sample Java code like this import org.apache.spark.sql.SQLContext; import org.apache.spark.sql.types.*; SQLContext sqlContext = new SQLContext(sc); StructType customSchema = new StructType( new StructField("year", IntegerType, true), new StructField("make", StringType, true), new StructField("model", StringType, true), new StructField("comment", StringType, true), new StructField("blank", StringType, true)); DataFrame df = sqlContext.read() .format