pyspark-sql | 易学教程

RDD to DataFrame in pyspark (columns from rdd's first element)

阅读更多关于 RDD to DataFrame in pyspark (columns from rdd's first element)

来源： https://stackoverflow.com/questions/40255149/rdd-to-dataframe-in-pyspark-columns-from-rdds-first-element

How to apply the describe function after grouping a PySpark DataFrame?

阅读更多关于 How to apply the describe function after grouping a PySpark DataFrame?

问题 I want to find the cleanest way to apply the describe function to a grouped DataFrame (this question can also grow to apply any DF function to a grouped DF) I tested grouped aggregate pandas UDF with no luck. There's always a way of doing it by passing each statistics inside the agg function but that's not the proper way. If we have a sample dataframe: df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) The idea would be to do something similar to

Why agg() in PySpark is only able to summarize one column at a time? [duplicate]

阅读更多关于 Why agg() in PySpark is only able to summarize one column at a time? [duplicate]

问题 This question already has answers here : Multiple Aggregate operations on the same column of a spark dataframe (3 answers) Closed 3 years ago . For the below dataframe df=spark.createDataFrame(data=[('Alice',4.300),('Bob',7.677)],schema=['name','High']) When I try to find min & max I am only getting min value in output. df.agg({'High':'max','High':'min'}).show() +-----------+ |min(High) | +-----------+ | 2094900| +-----------+ Why can't agg() give both max & min like in Pandas? 回答1: As you

How to see the dataframe in the console (equivalent of .show() for structured streaming)?

阅读更多关于 How to see the dataframe in the console (equivalent of .show() for structured streaming)?

问题 I'm trying to see what's coming in as my DataFrame.. here is the spark code from pyspark.sql import SparkSession import pyspark.sql.functions as psf import logging import time spark = SparkSession \ .builder \ .appName("Console Example") \ .getOrCreate() logging.info("started to listen to the host..") lines = spark \ .readStream \ .format("socket") \ .option("host", "127.0.0.1") \ .option("port", 9999) \ .load() data = lines.selectExpr("CAST(value AS STRING)") query1 = data.writeStream.format

How to detect when a pattern changes in a pyspark dataframe column

阅读更多关于 How to detect when a pattern changes in a pyspark dataframe column

问题 I have a dataframe like below: +-------------------+--------+-----------+ |DateTime |UID. |result | +-------------------+--------+-----------+ |2020-02-29 11:42:34|0000111D|30 | |2020-02-30 11:47:34|0000111D|30 | |2020-02-30 11:48:34|0000111D|30 | |2020-02-30 11:49:34|0000111D|30 | |2020-02-30 11:50:34|0000111D|30 | |2020-02-25 11:50:34|0000111D|29 | |2020-02-25 11:50:35|0000111D|29 | |2020-02-26 11:52:35|0000111D|29 | |2020-02-27 11:52:35|0000111D|29 | |2020-02-28 11:52:35|0000111D|29 |

pyspark - merge 2 columns of sets

阅读更多关于 pyspark - merge 2 columns of sets

问题 I have a spark dataframe that has 2 columns formed from the function collect_set. I would like to combine these 2 columns of sets into 1 column of set. How should I do so? They are both set of strings For Instance I have 2 columns formed from calling collect_set Fruits | Meat [Apple,Orange,Pear] [Beef, Chicken, Pork] How do I turn it into: Food [Apple,Orange,Pear, Beef, Chicken, Pork] Thank you very much for your help in advance 回答1: Let's say df has +--------------------+--------------------

PySpark groupby and max value selection

阅读更多关于 PySpark groupby and max value selection

问题 I have a PySpark dataframe like name city date satya Mumbai 13/10/2016 satya Pune 02/11/2016 satya Mumbai 22/11/2016 satya Pune 29/11/2016 satya Delhi 30/11/2016 panda Delhi 29/11/2016 brata BBSR 28/11/2016 brata Goa 30/10/2016 brata Goa 30/10/2016 I need to find-out most preferred CITY for each name and Logic is " take city as fav_city if city having max no. of city occurrence on aggregate 'name'+'city' pair". And if multiple same occurrence found then consider city with latest Date. WIll

How to drop multiple column names given in a list from Spark DataFrame?

阅读更多关于 How to drop multiple column names given in a list from Spark DataFrame?

问题 I have a dynamic list which is created based on value of n. n = 3 drop_lst = ['a' + str(i) for i in range(n)] df.drop(drop_lst) But the above is not working. Note : My use case requires a dynamic list. If I just do the below without list it works df.drop('a0','a1','a2') How do I make drop function work with list? Spark 2.2 doesn't seem to have this capability. Is there a way to make it work without using select() ? 回答1: You can use the * operator to pass the contents of your list as arguments

What's the difference between RDD and Dataframe in Spark? [duplicate]

阅读更多关于 What's the difference between RDD and Dataframe in Spark? [duplicate]

问题 This question already has answers here : Difference between DataFrame, Dataset, and RDD in Spark (15 answers) Closed 9 months ago . Hi I am relatively new to apache spark. I wanted to understand the difference between RDD,dataframe and datasets. For example, I am pulling data from s3 bucket. df=spark.read.parquet("s3://output/unattributedunattributed*") In this case when I am loading data from s3, what would be RDD? Also since RDD is immutable , I can change value for df so df couldn't be rdd

Pyspark - Calculate RMSE between actuals and predictions for a groupby - AssertionError: all exprs should be Column

阅读更多关于 Pyspark - Calculate RMSE between actuals and predictions for a groupby - AssertionError: all exprs should be Column

问题 I have a function that calculates RMSE for the preds and actuals of an entire dataframe: def calculate_rmse(df, actual_column, prediction_column): RMSE = F.udf(lambda x, y: ((x - y) ** 2)) df = df.withColumn( "RMSE", RMSE(F.col(actual_column), F.col(prediction_column)) ) rmse = df.select(F.avg("RMSE") ** 0.5).collect() rmse = rmse[0]["POWER(avg(RMSE), 0.5)"] return rmse test = calculate_rmse(my_df, 'actuals', 'preds') 3690.4535 I would like to apply this to a groupby statement, but when I do,