pyspark-sql

How to apply the describe function after grouping a PySpark DataFrame?

大兔子大兔子 提交于 2020-08-25 06:57:09
问题 I want to find the cleanest way to apply the describe function to a grouped DataFrame (this question can also grow to apply any DF function to a grouped DF) I tested grouped aggregate pandas UDF with no luck. There's always a way of doing it by passing each statistics inside the agg function but that's not the proper way. If we have a sample dataframe: df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) The idea would be to do something similar to

Why agg() in PySpark is only able to summarize one column at a time? [duplicate]

断了今生、忘了曾经 提交于 2020-07-04 13:49:12
问题 This question already has answers here : Multiple Aggregate operations on the same column of a spark dataframe (3 answers) Closed 3 years ago . For the below dataframe df=spark.createDataFrame(data=[('Alice',4.300),('Bob',7.677)],schema=['name','High']) When I try to find min & max I am only getting min value in output. df.agg({'High':'max','High':'min'}).show() +-----------+ |min(High) | +-----------+ | 2094900| +-----------+ Why can't agg() give both max & min like in Pandas? 回答1: As you

How to see the dataframe in the console (equivalent of .show() for structured streaming)?

白昼怎懂夜的黑 提交于 2020-06-17 13:35:08
问题 I'm trying to see what's coming in as my DataFrame.. here is the spark code from pyspark.sql import SparkSession import pyspark.sql.functions as psf import logging import time spark = SparkSession \ .builder \ .appName("Console Example") \ .getOrCreate() logging.info("started to listen to the host..") lines = spark \ .readStream \ .format("socket") \ .option("host", "127.0.0.1") \ .option("port", 9999) \ .load() data = lines.selectExpr("CAST(value AS STRING)") query1 = data.writeStream.format

How to detect when a pattern changes in a pyspark dataframe column

血红的双手。 提交于 2020-06-11 10:39:08
问题 I have a dataframe like below: +-------------------+--------+-----------+ |DateTime |UID. |result | +-------------------+--------+-----------+ |2020-02-29 11:42:34|0000111D|30 | |2020-02-30 11:47:34|0000111D|30 | |2020-02-30 11:48:34|0000111D|30 | |2020-02-30 11:49:34|0000111D|30 | |2020-02-30 11:50:34|0000111D|30 | |2020-02-25 11:50:34|0000111D|29 | |2020-02-25 11:50:35|0000111D|29 | |2020-02-26 11:52:35|0000111D|29 | |2020-02-27 11:52:35|0000111D|29 | |2020-02-28 11:52:35|0000111D|29 |

pyspark - merge 2 columns of sets

天涯浪子 提交于 2020-06-11 06:12:12
问题 I have a spark dataframe that has 2 columns formed from the function collect_set. I would like to combine these 2 columns of sets into 1 column of set. How should I do so? They are both set of strings For Instance I have 2 columns formed from calling collect_set Fruits | Meat [Apple,Orange,Pear] [Beef, Chicken, Pork] How do I turn it into: Food [Apple,Orange,Pear, Beef, Chicken, Pork] Thank you very much for your help in advance 回答1: Let's say df has +--------------------+--------------------

PySpark groupby and max value selection

六月ゝ 毕业季﹏ 提交于 2020-06-11 05:33:19
问题 I have a PySpark dataframe like name city date satya Mumbai 13/10/2016 satya Pune 02/11/2016 satya Mumbai 22/11/2016 satya Pune 29/11/2016 satya Delhi 30/11/2016 panda Delhi 29/11/2016 brata BBSR 28/11/2016 brata Goa 30/10/2016 brata Goa 30/10/2016 I need to find-out most preferred CITY for each name and Logic is " take city as fav_city if city having max no. of city occurrence on aggregate 'name'+'city' pair". And if multiple same occurrence found then consider city with latest Date. WIll

How to drop multiple column names given in a list from Spark DataFrame?

耗尽温柔 提交于 2020-05-25 12:15:50
问题 I have a dynamic list which is created based on value of n. n = 3 drop_lst = ['a' + str(i) for i in range(n)] df.drop(drop_lst) But the above is not working. Note : My use case requires a dynamic list. If I just do the below without list it works df.drop('a0','a1','a2') How do I make drop function work with list? Spark 2.2 doesn't seem to have this capability. Is there a way to make it work without using select() ? 回答1: You can use the * operator to pass the contents of your list as arguments

What's the difference between RDD and Dataframe in Spark? [duplicate]

我是研究僧i 提交于 2020-05-17 06:09:38
问题 This question already has answers here : Difference between DataFrame, Dataset, and RDD in Spark (15 answers) Closed 9 months ago . Hi I am relatively new to apache spark. I wanted to understand the difference between RDD,dataframe and datasets. For example, I am pulling data from s3 bucket. df=spark.read.parquet("s3://output/unattributedunattributed*") In this case when I am loading data from s3, what would be RDD? Also since RDD is immutable , I can change value for df so df couldn't be rdd

Pyspark - Calculate RMSE between actuals and predictions for a groupby - AssertionError: all exprs should be Column

心不动则不痛 提交于 2020-05-09 07:10:28
问题 I have a function that calculates RMSE for the preds and actuals of an entire dataframe: def calculate_rmse(df, actual_column, prediction_column): RMSE = F.udf(lambda x, y: ((x - y) ** 2)) df = df.withColumn( "RMSE", RMSE(F.col(actual_column), F.col(prediction_column)) ) rmse = df.select(F.avg("RMSE") ** 0.5).collect() rmse = rmse[0]["POWER(avg(RMSE), 0.5)"] return rmse test = calculate_rmse(my_df, 'actuals', 'preds') 3690.4535 I would like to apply this to a groupby statement, but when I do,