pyspark-dataframes

Pyspark: how to add Date + numeric value format

别等时光非礼了梦想. 提交于 2020-08-11 09:31:12
问题 I have a 2 dataframes looks like the following: First df1 TEST_schema = StructType([StructField("description", StringType(), True),\ StructField("date", StringType(), True)\ ]) TEST_data = [('START',20200622),('END',20201018)] rdd3 = sc.parallelize(TEST_data) df1 = sqlContext.createDataFrame(TEST_data, TEST_schema) df1.show() +-----------+--------+ |description| date| +-----------+--------+ | START|20200701| | END|20201003| +-----------+--------+ And second df2 TEST_schema = StructType(

Pyspark : how to code complicated dataframe calculation lead sum

本秂侑毒 提交于 2020-08-09 08:54:07
问题 I have given dataframe that looks like this. THIS dataframe is sorted by date, and col1 is just some random value. TEST_schema = StructType([StructField("date", StringType(), True),\ StructField("col1", IntegerType(), True),\ ]) TEST_data = [('2020-08-01',3),('2020-08-02',1),('2020-08-03',-1),('2020-08-04',-1),('2020-08-05',3),\ ('2020-08-06',-1),('2020-08-07',6),('2020-08-08',4),('2020-08-09',5)] rdd3 = sc.parallelize(TEST_data) TEST_df = sqlContext.createDataFrame(TEST_data, TEST_schema)

Spark: Prevent shuffle/exchange when joining two identically partitioned dataframes

陌路散爱 提交于 2020-07-17 05:50:10
问题 I have two dataframes df1 and df2 and I want to join these tables many times on a high cardinality field called visitor_id . I would like to perform only one initial shuffle and have all the joins take place without shuffling/exchanging data between spark executors. To do so, I have created another column called visitor_partition that consistently assigns each visitor_id a random value between [0, 1000) . I have used a custom partitioner to ensure that the df1 and df2 are exactly partitioned

pyspark dataframe withColumn command not working

杀马特。学长 韩版系。学妹 提交于 2020-07-15 09:22:32
问题 I have a input dataframe: df_input ( updated df_input ) |comment|inp_col|inp_val| |11 |a |a1 | |12 |a |a2 | |15 |b |b3 | |16 |b |b4 | |17 |c |&b | |17 |c |c5 | |17 |d |&c | |17 |d |d6 | |17 |e |&d | |17 |e |e7 | I want to replace the variable in inp_val column to its value. I have tried with the below code to create a new column. Taken the list of values which starts with '&' df_new = df_inp.select(inp_val).where(df.inp_val.substr(0, 1) == '&') Now I'm iterating over the list to replace the '

pyspark dataframe withColumn command not working

倖福魔咒の 提交于 2020-07-15 09:22:05
问题 I have a input dataframe: df_input ( updated df_input ) |comment|inp_col|inp_val| |11 |a |a1 | |12 |a |a2 | |15 |b |b3 | |16 |b |b4 | |17 |c |&b | |17 |c |c5 | |17 |d |&c | |17 |d |d6 | |17 |e |&d | |17 |e |e7 | I want to replace the variable in inp_val column to its value. I have tried with the below code to create a new column. Taken the list of values which starts with '&' df_new = df_inp.select(inp_val).where(df.inp_val.substr(0, 1) == '&') Now I'm iterating over the list to replace the '

pyspark dataframe withColumn command not working

孤人 提交于 2020-07-15 09:22:05
问题 I have a input dataframe: df_input ( updated df_input ) |comment|inp_col|inp_val| |11 |a |a1 | |12 |a |a2 | |15 |b |b3 | |16 |b |b4 | |17 |c |&b | |17 |c |c5 | |17 |d |&c | |17 |d |d6 | |17 |e |&d | |17 |e |e7 | I want to replace the variable in inp_val column to its value. I have tried with the below code to create a new column. Taken the list of values which starts with '&' df_new = df_inp.select(inp_val).where(df.inp_val.substr(0, 1) == '&') Now I'm iterating over the list to replace the '

Pyspark forward and backward fill within column level

蓝咒 提交于 2020-07-10 10:28:19
问题 I try to fill missing data in a pyspark dataframe. The pyspark dataframe looks as such: +---------+---------+-------------------+----+ | latitude|longitude| timestamplast|name| +---------+---------+-------------------+----+ | | 4.905615|2019-08-01 00:00:00| 1| |51.819645| |2019-08-01 00:00:00| 1| | 51.81964| 4.961713|2019-08-01 00:00:00| 2| | | |2019-08-01 00:00:00| 3| | 51.82918| 4.911187| | 3| | 51.82385| 4.901488|2019-08-01 00:00:03| 5| +---------+---------+-------------------+----+ Within

How can I concatenate the rows in a pyspark dataframe with multiple columns using groupby and aggregate

ぃ、小莉子 提交于 2020-07-10 03:11:13
问题 I have a pyspark dataframe with multiple columns. For example the one below. from pyspark.sql import Row l = [('Jack',"a","p"),('Jack',"b","q"),('Bell',"c","r"),('Bell',"d","s")] rdd = sc.parallelize(l) score_rdd = rdd.map(lambda x: Row(name=x[0], letters1=x[1], letters2=x[2])) score_card = sqlContext.createDataFrame(score_rdd) +----+--------+--------+ |name|letters1|letters2| +----+--------+--------+ |Jack| a| p| |Jack| b| q| |Bell| c| r| |Bell| d| s| +----+--------+--------+ Now I want to