spark-dataframe | 易学教程

PySpark: Add a new column with a tuple created from columns

阅读更多关于 PySpark: Add a new column with a tuple created from columns

问题 Here I have a dateframe created as follow, df = spark.createDataFrame([('a',5,'R','X'),('b',7,'G','S'),('c',8,'G','S')], ["Id","V1","V2","V3"]) It looks like +---+---+---+---+ | Id| V1| V2| V3| +---+---+---+---+ | a| 5| R| X| | b| 7| G| S| | c| 8| G| S| +---+---+---+---+ I'm looking to add a column that is a tuple consisting of V1,V2,V3. The result should look like +---+---+---+---+-------+ | Id| V1| V2| V3|V_tuple| +---+---+---+---+-------+ | a| 5| R| X|(5,R,X)| | b| 7| G| S|(7,G,S)| | c| 8|

Workaround for importing spark implicits everywhere

阅读更多关于 Workaround for importing spark implicits everywhere

问题 I'm new to Spark 2.0 and using datasets in our code base. I'm kinda noticing that I need to import spark.implicits._ everywhere in our code. For example: File A class A { def job(spark: SparkSession) = { import spark.implcits._ //create dataset ds val b = new B(spark) b.doSomething(ds) doSomething(ds) } private def doSomething(ds: Dataset[Foo], spark: SparkSession) = { import spark.implicits._ ds.map(e => 1) } } File B class B(spark: SparkSession) { def doSomething(ds: Dataset[Foo]) = {

Converting row values into a column array in spark dataframe

阅读更多关于 Converting row values into a column array in spark dataframe

问题 I am working on spark dataframes and I need to do a group by of a column and convert the column values of grouped rows into an array of elements as new column. Example : Input: employee | Address ------------------ Micheal | NY Micheal | NJ Output: employee | Address ------------------ Micheal | (NY,NJ) Any help is highly appreciated.! 回答1: Here is an alternate solution Where I have converted the dataframe to an rdd for the transformations and converted it back a dataFrame using sqlContext

How to convert datetime from string format into datetime format in pyspark?

阅读更多关于 How to convert datetime from string format into datetime format in pyspark?

问题 I created a dataframe using sqlContext and I have a problem with the datetime format as it is identified as string. df2 = sqlContext.createDataFrame(i[1]) df2.show df2.printSchema() Result: 2016-07-05T17:42:55.238544+0900 2016-07-05T17:17:38.842567+0900 2016-06-16T19:54:09.546626+0900 2016-07-05T17:27:29.227750+0900 2016-07-05T18:44:12.319332+0900 string (nullable = true) Since the datetime schema is a string, I want to change it to datetime format as follows: df3 = df2.withColumn('_1', df2['

How to convert datetime from string format into datetime format in pyspark?

阅读更多关于 How to convert datetime from string format into datetime format in pyspark?

Change output file name in Spark Streaming

阅读更多关于 Change output file name in Spark Streaming

问题 I am running a Spark job which performs extremely well as far as the logic goes. However, the name of my output files are in the format part-00000,part-00001 etc., when I use saveAsTextFile to save the files in a s3 bucket. Is there a way to change the output filename? Thank you. 回答1: In Spark, you can use saveAsNewAPIHadoopFile and set mapreduce.output.basename parameter in hadoop configuration to change the prefix (Just the "part" prefix) val hadoopConf = new Configuration() hadoopConf.set(

How can I build a CoordinateMatrix in Spark using a DataFrame?

阅读更多关于 How can I build a CoordinateMatrix in Spark using a DataFrame?

问题 I am trying to use the Spark implementation of the ALS algorithm for recommendation systems, so I built the DataFrame depicted below, as training data: |--------------|--------------|--------------| | userId | itemId | rating | |--------------|--------------|--------------| Now, I would like to create a sparse matrix, to represent the interactions between every user and every item. The matrix will be sparse because if there is no interaction between a user and an item, the corresponding value

how to select all columns that starts with a common label

阅读更多关于 how to select all columns that starts with a common label

问题 I have a dataframe in Spark 1.6 and want to select just some columns out of it. The column names are like: colA, colB, colC, colD, colE, colF-0, colF-1, colF-2 I know I can do like this to select specific columns: df.select("colA", "colB", "colE") but how to select, say "colA", "colB" and all the colF-* columns at once? Is there a way like in Pandas? 回答1: First grab the column names with df.columns , then filter down to just the column names you want .filter(_.startsWith("colF")) . This gives

How to do mathematical operation with two column in dataframe using pyspark

阅读更多关于 How to do mathematical operation with two column in dataframe using pyspark

问题 I have dataframe with three column "x" ,"y" and "z" x y z bn 12452 221 mb 14521 330 pl 12563 160 lo 22516 142 I need to create a another column which is derived by this formula (m = z / y+z) So the new data frameshould look something like this: x y z m bn 12452 221 .01743 mb 14521 330 .02222 pl 12563 160 .01257 lo 22516 142 .00626 回答1: df = sqlContext.createDataFrame([('bn', 12452, 221), ('mb', 14521, 330)], ['x', 'y', 'z']) df = df.withColumn('m', df['z'] / (df['y'] + df['z'])) df.head(2) 来源

Getting NullPointerException using spark-csv with DataFrames

阅读更多关于 Getting NullPointerException using spark-csv with DataFrames

问题 Running through the spark-csv README there's sample Java code like this import org.apache.spark.sql.SQLContext; import org.apache.spark.sql.types.*; SQLContext sqlContext = new SQLContext(sc); StructType customSchema = new StructType( new StructField("year", IntegerType, true), new StructField("make", StringType, true), new StructField("model", StringType, true), new StructField("comment", StringType, true), new StructField("blank", StringType, true)); DataFrame df = sqlContext.read() .format