pyspark-dataframes | 易学教程

pyspark-strange behavior of count function inside agg

阅读更多关于 pyspark-strange behavior of count function inside agg

问题 I am using spark 2.4.0 I am observing a strange behavior while using count function to aggregate. from pyspark.sql import functions as F tst=sqlContext.createDataFrame([(1,2),(1,5),(2,None),(2,3),(3,None),(3,None)],schema=['col1','col2']) tst.show() +----+----+ |col1|col2| +----+----+ | 1| 2| | 1| 5| | 2|null| | 2| 3| | 3|null| | 3|null| +----+----+ tst.groupby('col1').agg(F.count('col2')).show() +----+-----------+ |col1|count(col2)| +----+-----------+ | 1| 2| | 3| 0| | 2| 1| +----+----------

pyspark-strange behavior of count function inside agg

阅读更多关于 pyspark-strange behavior of count function inside agg

Adding a List element as a column to existing pyspark dataframe

阅读更多关于 Adding a List element as a column to existing pyspark dataframe

问题 I have a list lists=[0,1,2,3,5,6,7] . Order is not sequential. I have a pyspark dataframe with 9 columns. +-------------------+--------+--------+--------+--------+--------+--------+---------------+-----+----+ | date|ftt (°c)|rtt (°c)|fbt (°c)|rbt (°c)|fmt (°c)|rmt (°c)|fmhhumidityunit|index|Diff| +-------------------+--------+--------+--------+--------+--------+--------+---------------+-----+----+ |2019-02-01 05:29:47| NaN| NaN| NaN| NaN| NaN| NaN| NaN| 0| NaN| |2019-02-01 05:29:17| NaN| NaN|

Adding a List element as a column to existing pyspark dataframe

阅读更多关于 Adding a List element as a column to existing pyspark dataframe

cannot resolve column due to data type mismatch PySpark

阅读更多关于 cannot resolve column due to data type mismatch PySpark

问题 Error being faced in PySpark: pyspark.sql.utils.AnalysisException: "cannot resolve '`result_set`.`dates`.`trackers`['token']' due to data type mismatch: argument 2 requires integral type, however, ''token'' is of string type.;;\n'Project [result_parameters#517, result_set#518, <lambda>(result_set#518.dates.trackers[token]) AS result_set.dates.trackers.token#705]\n+- Relation[result_parameters#517,result_set#518] json\n" Data strucutre: -- result_set: struct (nullable = true) | |-- currency:

Premature end of Content-Length delimited message body SparkException while reading from S3 using Pyspark

阅读更多关于 Premature end of Content-Length delimited message body SparkException while reading from S3 using Pyspark

问题 I am using the below code to read S3 csv file from my local machine . from pyspark import SparkConf, SparkContext from pyspark.sql import SparkSession import configparser import os conf = SparkConf() conf.set('spark.jars', '/usr/local/spark/jars/aws-java-sdk-1.7.4.jar,/usr/local/spark/jars/hadoop-aws-2.7.4.jar') #Tried by setting this, but failed conf.set('spark.executor.memory', '8g') conf.set('spark.driver.memory', '8g') spark_session = SparkSession.builder \ .config(conf=conf) \ .appName(

PySpark first and last function over a partition in one go

阅读更多关于 PySpark first and last function over a partition in one go

问题 I have pyspark code like this, spark_df = spark_df.orderBy('id', 'a1', 'c1') out_df = spark_df.groupBy('id', 'a1', 'a2').agg( F.first('c1').alias('c1'), F.last('c2').alias('c2'), F.first('c3').alias('c3')) I need to keep the data ordered in the order id, a1 and c1. Then select columns as shown above over the group defined over the keys id, a1 and c1. Due to first and last non-determinism I changed the code to this ugly looking code which works but I'm not sure that is efficient. w_first =

How to calculate difference between dates excluding weekends in Pyspark 2.2.0

阅读更多关于 How to calculate difference between dates excluding weekends in Pyspark 2.2.0

问题 I have the below pyspark df which can be recreated by the code df = spark.createDataFrame([(1, "John Doe", "2020-11-30"),(2, "John Doe", "2020-11-27"),(3, "John Doe", "2020-11-29")], ("id", "name", "date")) +---+--------+----------+ | id| name| date| +---+--------+----------+ | 1|John Doe|2020-11-30| | 2|John Doe|2020-11-27| | 3|John Doe|2020-11-29| +---+--------+----------+ I am looking to create a udf to calculate difference between 2 rows of dates (using Lag function) excluding weekends as

How to calculate daily basis in pyspark dataframe (time series)

阅读更多关于 How to calculate daily basis in pyspark dataframe (time series)

问题 So I have a dataframe and I want to calculation some quantity let's say in daily basis..let's say we have 10 columns col1,col2,col3,col4... coln which each columns are dependent on value col1 , col2, col3 , col4.. and so on and the date resets based on the id .. +--------+----+---- +----+ date |col1|id |col2|. . |coln +--------+----+---- +----+ 2020-08-01| 0| M1 | . . . 3| 2020-08-02| 4| M1 | 10| 2020-08-03| 3| M1 | . . . 9 | 2020-08-04| 2| M1 | . . . 8 | 2020-08-05| 1| M1 | . . . 7 | 2020-08

How to load big double numbers in a PySpark DataFrame and persist it back without changing the numeric format to scientific notation or precision?

阅读更多关于 How to load big double numbers in a PySpark DataFrame and persist it back without changing the numeric format to scientific notation or precision?

问题 I have a CSV like that: COL,VAL TEST,100000000.12345679 TEST2,200000000.1234 TEST3,9999.1234679123 I want to load it having the column VAL as a numeric type (due to other requirements of the project) and then persist it back to another CSV as per structure below: +-----+------------------+ | COL| VAL| +-----+------------------+ | TEST|100000000.12345679| |TEST2| 200000000.1234| |TEST3| 9999.1234679123| +-----+------------------+ The problem I'm facing is that whenever I load it, the numbers