pyspark-dataframes | 易学教程

Multiply two pyspark dataframe columns with different types (array[double] vs double) without breeze

阅读更多关于 Multiply two pyspark dataframe columns with different types (array[double] vs double) without breeze

问题 I have the same problem as asked here but I need a solution in pyspark and without breeze. For example if my pyspark dataframe look like this: user | weight | vec "u1" | 0.1 | [2, 4, 6] "u1" | 0.5 | [4, 8, 12] "u2" | 0.5 | [20, 40, 60] where column weight has type double and column vec has type Array[Double], I would like to get the weighted sum of the vectors per user, so that I get a dataframe that look like this: user | wsum "u1" | [2.2, 4.4, 6.6] "u2" | [10, 20, 30] To do this I have

How do I reduce a spark dataframe to a maximum amount of rows for each value in a column?

阅读更多关于 How do I reduce a spark dataframe to a maximum amount of rows for each value in a column?

Find number of rows in a given week in PySpark

阅读更多关于 Find number of rows in a given week in PySpark

问题 I have a PySpark dataframe, a small portion of which is given below: +------+-----+-------------------+-----+ | name| type| timestamp|score| +------+-----+-------------------+-----+ | name1|type1|2012-01-10 00:00:00| 11| | name1|type1|2012-01-10 00:00:10| 14| | name1|type1|2012-01-10 00:00:20| 2| | name1|type1|2012-01-10 00:00:30| 3| | name1|type1|2012-01-10 00:00:40| 55| | name1|type1|2012-01-10 00:00:50| 10| | name5|type1|2012-01-10 00:01:00| 5| | name2|type2|2012-01-10 00:01:10| 8| | name5

How to optimize percentage check and cols drop in large pyspark dataframe?

阅读更多关于 How to optimize percentage check and cols drop in large pyspark dataframe?

问题 I have a sample pandas dataframe like as shown below. But my real data is 40 million rows and 5200 columns df = pd.DataFrame({ 'subject_id':[1,1,1,1,2,2,2,2,3,3,4,4,4,4,4], 'readings' : ['READ_1','READ_2','READ_1','READ_3',np.nan,'READ_5',np.nan,'READ_8','READ_10','READ_12','READ_11','READ_14','READ_09','READ_08','READ_07'], 'val' :[5,6,7,np.nan,np.nan,7,np.nan,12,13,56,32,13,45,43,46], }) from pyspark.sql.types import * from pyspark.sql.functions import isnan, when, count, col mySchema =

PySPARK UDF on withColumn to replace column

阅读更多关于 PySPARK UDF on withColumn to replace column

问题 This UDF is written to replace a column's value with a variable. Python 2.7; Spark 2.2.0 import pyspark.sql.functions as func def updateCol(col, st): return func.expr(col).replace(func.expr(col), func.expr(st)) updateColUDF = func.udf(updateCol, StringType()) Variable L_1 to L_3 have updated columns for each row . This is how I am calling it: updatedDF = orig_df.withColumn("L1", updateColUDF("L1", func.format_string(L_1))). \ withColumn("L2", updateColUDF("L2", func.format_string(L_2))). \

Import pyspark dataframe from multiple S3 buckets, with a column denoting which bucket the entry came from

阅读更多关于 Import pyspark dataframe from multiple S3 buckets, with a column denoting which bucket the entry came from

问题 I have a list of S3 buckets partitioned by date. The first bucket titled 2019-12-1, the second 2019-12-2, etc. Each of these buckets stores parquet files that I am reading into a pyspark dataframe. The pyspark dataframe generated from each of these buckets has the exact same schema. What I would like to do is iterate over these buckets, and store all of these parquet files into a single pyspark dataframe that has a date column denoting what bucket each entry in the dataframe actually came

Compare rows of two dataframes to find the matching column count of 1's

阅读更多关于 Compare rows of two dataframes to find the matching column count of 1's

问题 I have 2 dataframes with same schema, i need to compare the rows of dataframes and keep a count of rows with at-least one column with value 1 in both the dataframes Right now i am making a list of the rows and then comparing the 2 lists to find even if one value is equal in both the list and equal to 1 rowOgList = [] for row in cat_og_df.rdd.toLocalIterator(): rowOgDict = {} for cat in categories: rowOgDict[cat] = row[cat] rowOgList.append(rowOgDict) #print(rowOgList[0]) rowPredList = [] for

pyspark one to many join operation

阅读更多关于 pyspark one to many join operation

问题 in pyspark dataframe let say there is dfA and dfB, dfA : name , class dfB : class, time if dfA.select('class').distinct().count() = n, when n is n < 100 , n > 100000, when I operating the join for this two cases how should I optimize the join? 来源： https://stackoverflow.com/questions/58026274/pyspark-one-to-many-join-operation

Split JSON string column to multiple columns

阅读更多关于 Split JSON string column to multiple columns

I'm looking for a generic solution to extract all the json fields as columns from a JSON string column. df = spark.read.load(path) df.show() File format of the files in 'path' is parquet Sample data |id | json_data | 1 | {"name":"abc", "depts":["dep01", "dep02"]} | 2 | {"name":"xyz", "depts":["dep03"],"sal":100} | 3 | {"name":"pqr", "depts":["dep02"], "address":{"city":"SF","state":"CA"}} Expected output |id | name | depts | sal | address_city | address_state | 1 | "abc" | ["dep01", "dep02"] | null| null | null | 2 | "xyz" | ["dep03"] | 100 | null | null | 3 | "pqr" | ["dep02"] | null| "SF" |

Split JSON string column to multiple columns

阅读更多关于 Split JSON string column to multiple columns

问题 I'm looking for a generic solution to extract all the json fields as columns from a JSON string column. df = spark.read.load(path) df.show() File format of the files in 'path' is parquet Sample data |id | json_data | 1 | {"name":"abc", "depts":["dep01", "dep02"]} | 2 | {"name":"xyz", "depts":["dep03"],"sal":100} | 3 | {"name":"pqr", "depts":["dep02"], "address":{"city":"SF","state":"CA"}} Expected output |id | name | depts | sal | address_city | address_state | 1 | "abc" | ["dep01", "dep02"]