pyspark-dataframes

pyspark-strange behavior of count function inside agg

ぃ、小莉子 提交于 2021-01-29 02:52:43
问题 I am using spark 2.4.0 I am observing a strange behavior while using count function to aggregate. from pyspark.sql import functions as F tst=sqlContext.createDataFrame([(1,2),(1,5),(2,None),(2,3),(3,None),(3,None)],schema=['col1','col2']) tst.show() +----+----+ |col1|col2| +----+----+ | 1| 2| | 1| 5| | 2|null| | 2| 3| | 3|null| | 3|null| +----+----+ tst.groupby('col1').agg(F.count('col2')).show() +----+-----------+ |col1|count(col2)| +----+-----------+ | 1| 2| | 3| 0| | 2| 1| +----+----------

pyspark-strange behavior of count function inside agg

别说谁变了你拦得住时间么 提交于 2021-01-29 02:45:43
问题 I am using spark 2.4.0 I am observing a strange behavior while using count function to aggregate. from pyspark.sql import functions as F tst=sqlContext.createDataFrame([(1,2),(1,5),(2,None),(2,3),(3,None),(3,None)],schema=['col1','col2']) tst.show() +----+----+ |col1|col2| +----+----+ | 1| 2| | 1| 5| | 2|null| | 2| 3| | 3|null| | 3|null| +----+----+ tst.groupby('col1').agg(F.count('col2')).show() +----+-----------+ |col1|count(col2)| +----+-----------+ | 1| 2| | 3| 0| | 2| 1| +----+----------

Adding a List element as a column to existing pyspark dataframe

拜拜、爱过 提交于 2021-01-28 06:02:19
问题 I have a list lists=[0,1,2,3,5,6,7] . Order is not sequential. I have a pyspark dataframe with 9 columns. +-------------------+--------+--------+--------+--------+--------+--------+---------------+-----+----+ | date|ftt (°c)|rtt (°c)|fbt (°c)|rbt (°c)|fmt (°c)|rmt (°c)|fmhhumidityunit|index|Diff| +-------------------+--------+--------+--------+--------+--------+--------+---------------+-----+----+ |2019-02-01 05:29:47| NaN| NaN| NaN| NaN| NaN| NaN| NaN| 0| NaN| |2019-02-01 05:29:17| NaN| NaN|

Adding a List element as a column to existing pyspark dataframe

只愿长相守 提交于 2021-01-28 05:57:14
问题 I have a list lists=[0,1,2,3,5,6,7] . Order is not sequential. I have a pyspark dataframe with 9 columns. +-------------------+--------+--------+--------+--------+--------+--------+---------------+-----+----+ | date|ftt (°c)|rtt (°c)|fbt (°c)|rbt (°c)|fmt (°c)|rmt (°c)|fmhhumidityunit|index|Diff| +-------------------+--------+--------+--------+--------+--------+--------+---------------+-----+----+ |2019-02-01 05:29:47| NaN| NaN| NaN| NaN| NaN| NaN| NaN| 0| NaN| |2019-02-01 05:29:17| NaN| NaN|

cannot resolve column due to data type mismatch PySpark

妖精的绣舞 提交于 2021-01-28 05:11:16
问题 Error being faced in PySpark: pyspark.sql.utils.AnalysisException: "cannot resolve '`result_set`.`dates`.`trackers`['token']' due to data type mismatch: argument 2 requires integral type, however, ''token'' is of string type.;;\n'Project [result_parameters#517, result_set#518, <lambda>(result_set#518.dates.trackers[token]) AS result_set.dates.trackers.token#705]\n+- Relation[result_parameters#517,result_set#518] json\n" Data strucutre: -- result_set: struct (nullable = true) | |-- currency:

Premature end of Content-Length delimited message body SparkException while reading from S3 using Pyspark

我的梦境 提交于 2021-01-28 01:42:06
问题 I am using the below code to read S3 csv file from my local machine . from pyspark import SparkConf, SparkContext from pyspark.sql import SparkSession import configparser import os conf = SparkConf() conf.set('spark.jars', '/usr/local/spark/jars/aws-java-sdk-1.7.4.jar,/usr/local/spark/jars/hadoop-aws-2.7.4.jar') #Tried by setting this, but failed conf.set('spark.executor.memory', '8g') conf.set('spark.driver.memory', '8g') spark_session = SparkSession.builder \ .config(conf=conf) \ .appName(

PySpark first and last function over a partition in one go

北战南征 提交于 2021-01-27 19:54:45
问题 I have pyspark code like this, spark_df = spark_df.orderBy('id', 'a1', 'c1') out_df = spark_df.groupBy('id', 'a1', 'a2').agg( F.first('c1').alias('c1'), F.last('c2').alias('c2'), F.first('c3').alias('c3')) I need to keep the data ordered in the order id, a1 and c1. Then select columns as shown above over the group defined over the keys id, a1 and c1. Due to first and last non-determinism I changed the code to this ugly looking code which works but I'm not sure that is efficient. w_first =

How to calculate difference between dates excluding weekends in Pyspark 2.2.0

|▌冷眼眸甩不掉的悲伤 提交于 2021-01-07 06:50:49
问题 I have the below pyspark df which can be recreated by the code df = spark.createDataFrame([(1, "John Doe", "2020-11-30"),(2, "John Doe", "2020-11-27"),(3, "John Doe", "2020-11-29")], ("id", "name", "date")) +---+--------+----------+ | id| name| date| +---+--------+----------+ | 1|John Doe|2020-11-30| | 2|John Doe|2020-11-27| | 3|John Doe|2020-11-29| +---+--------+----------+ I am looking to create a udf to calculate difference between 2 rows of dates (using Lag function) excluding weekends as

How to calculate daily basis in pyspark dataframe (time series)

核能气质少年 提交于 2021-01-01 06:27:25
问题 So I have a dataframe and I want to calculation some quantity let's say in daily basis..let's say we have 10 columns col1,col2,col3,col4... coln which each columns are dependent on value col1 , col2, col3 , col4.. and so on and the date resets based on the id .. +--------+----+---- +----+ date |col1|id |col2|. . |coln +--------+----+---- +----+ 2020-08-01| 0| M1 | . . . 3| 2020-08-02| 4| M1 | 10| 2020-08-03| 3| M1 | . . . 9 | 2020-08-04| 2| M1 | . . . 8 | 2020-08-05| 1| M1 | . . . 7 | 2020-08

How to load big double numbers in a PySpark DataFrame and persist it back without changing the numeric format to scientific notation or precision?

五迷三道 提交于 2020-12-15 07:18:10
问题 I have a CSV like that: COL,VAL TEST,100000000.12345679 TEST2,200000000.1234 TEST3,9999.1234679123 I want to load it having the column VAL as a numeric type (due to other requirements of the project) and then persist it back to another CSV as per structure below: +-----+------------------+ | COL| VAL| +-----+------------------+ | TEST|100000000.12345679| |TEST2| 200000000.1234| |TEST3| 9999.1234679123| +-----+------------------+ The problem I'm facing is that whenever I load it, the numbers