pyspark-dataframes

Multiply two pyspark dataframe columns with different types (array[double] vs double) without breeze

自闭症网瘾萝莉.ら 提交于 2020-01-25 06:48:09
问题 I have the same problem as asked here but I need a solution in pyspark and without breeze. For example if my pyspark dataframe look like this: user | weight | vec "u1" | 0.1 | [2, 4, 6] "u1" | 0.5 | [4, 8, 12] "u2" | 0.5 | [20, 40, 60] where column weight has type double and column vec has type Array[Double], I would like to get the weighted sum of the vectors per user, so that I get a dataframe that look like this: user | wsum "u1" | [2.2, 4.4, 6.6] "u2" | [10, 20, 30] To do this I have

How do I reduce a spark dataframe to a maximum amount of rows for each value in a column?

天大地大妈咪最大 提交于 2020-01-23 19:39:29
问题 I need to reduce a datafame and export it to a parquet. I need to make sure that I have ex. 10000 rows for each value in a column. The dataframe I am working with looks like the following: +-------------+-------------------+ | Make| Model| +-------------+-------------------+ | PONTIAC| GRAND AM| | BUICK| CENTURY| | LEXUS| IS 300| |MERCEDES-BENZ| SL-CLASS| | PONTIAC| GRAND AM| | TOYOTA| PRIUS| | MITSUBISHI| MONTERO SPORT| |MERCEDES-BENZ| SLK-CLASS| | TOYOTA| CAMRY| | JEEP| WRANGLER| |

Find number of rows in a given week in PySpark

孤者浪人 提交于 2020-01-16 05:36:08
问题 I have a PySpark dataframe, a small portion of which is given below: +------+-----+-------------------+-----+ | name| type| timestamp|score| +------+-----+-------------------+-----+ | name1|type1|2012-01-10 00:00:00| 11| | name1|type1|2012-01-10 00:00:10| 14| | name1|type1|2012-01-10 00:00:20| 2| | name1|type1|2012-01-10 00:00:30| 3| | name1|type1|2012-01-10 00:00:40| 55| | name1|type1|2012-01-10 00:00:50| 10| | name5|type1|2012-01-10 00:01:00| 5| | name2|type2|2012-01-10 00:01:10| 8| | name5

How to optimize percentage check and cols drop in large pyspark dataframe?

删除回忆录丶 提交于 2020-01-15 09:48:08
问题 I have a sample pandas dataframe like as shown below. But my real data is 40 million rows and 5200 columns df = pd.DataFrame({ 'subject_id':[1,1,1,1,2,2,2,2,3,3,4,4,4,4,4], 'readings' : ['READ_1','READ_2','READ_1','READ_3',np.nan,'READ_5',np.nan,'READ_8','READ_10','READ_12','READ_11','READ_14','READ_09','READ_08','READ_07'], 'val' :[5,6,7,np.nan,np.nan,7,np.nan,12,13,56,32,13,45,43,46], }) from pyspark.sql.types import * from pyspark.sql.functions import isnan, when, count, col mySchema =

PySPARK UDF on withColumn to replace column

徘徊边缘 提交于 2020-01-06 05:43:08
问题 This UDF is written to replace a column's value with a variable. Python 2.7; Spark 2.2.0 import pyspark.sql.functions as func def updateCol(col, st): return func.expr(col).replace(func.expr(col), func.expr(st)) updateColUDF = func.udf(updateCol, StringType()) Variable L_1 to L_3 have updated columns for each row . This is how I am calling it: updatedDF = orig_df.withColumn("L1", updateColUDF("L1", func.format_string(L_1))). \ withColumn("L2", updateColUDF("L2", func.format_string(L_2))). \

Import pyspark dataframe from multiple S3 buckets, with a column denoting which bucket the entry came from

*爱你&永不变心* 提交于 2020-01-06 05:23:07
问题 I have a list of S3 buckets partitioned by date. The first bucket titled 2019-12-1, the second 2019-12-2, etc. Each of these buckets stores parquet files that I am reading into a pyspark dataframe. The pyspark dataframe generated from each of these buckets has the exact same schema. What I would like to do is iterate over these buckets, and store all of these parquet files into a single pyspark dataframe that has a date column denoting what bucket each entry in the dataframe actually came

Compare rows of two dataframes to find the matching column count of 1's

自闭症网瘾萝莉.ら 提交于 2020-01-04 02:32:04
问题 I have 2 dataframes with same schema, i need to compare the rows of dataframes and keep a count of rows with at-least one column with value 1 in both the dataframes Right now i am making a list of the rows and then comparing the 2 lists to find even if one value is equal in both the list and equal to 1 rowOgList = [] for row in cat_og_df.rdd.toLocalIterator(): rowOgDict = {} for cat in categories: rowOgDict[cat] = row[cat] rowOgList.append(rowOgDict) #print(rowOgList[0]) rowPredList = [] for

pyspark one to many join operation

为君一笑 提交于 2019-12-13 03:18:18
问题 in pyspark dataframe let say there is dfA and dfB, dfA : name , class dfB : class, time if dfA.select('class').distinct().count() = n, when n is n < 100 , n > 100000, when I operating the join for this two cases how should I optimize the join? 来源: https://stackoverflow.com/questions/58026274/pyspark-one-to-many-join-operation

Split JSON string column to multiple columns

那年仲夏 提交于 2019-12-01 14:37:41
I'm looking for a generic solution to extract all the json fields as columns from a JSON string column. df = spark.read.load(path) df.show() File format of the files in 'path' is parquet Sample data |id | json_data | 1 | {"name":"abc", "depts":["dep01", "dep02"]} | 2 | {"name":"xyz", "depts":["dep03"],"sal":100} | 3 | {"name":"pqr", "depts":["dep02"], "address":{"city":"SF","state":"CA"}} Expected output |id | name | depts | sal | address_city | address_state | 1 | "abc" | ["dep01", "dep02"] | null| null | null | 2 | "xyz" | ["dep03"] | 100 | null | null | 3 | "pqr" | ["dep02"] | null| "SF" |

Split JSON string column to multiple columns

北城以北 提交于 2019-12-01 11:46:28
问题 I'm looking for a generic solution to extract all the json fields as columns from a JSON string column. df = spark.read.load(path) df.show() File format of the files in 'path' is parquet Sample data |id | json_data | 1 | {"name":"abc", "depts":["dep01", "dep02"]} | 2 | {"name":"xyz", "depts":["dep03"],"sal":100} | 3 | {"name":"pqr", "depts":["dep02"], "address":{"city":"SF","state":"CA"}} Expected output |id | name | depts | sal | address_city | address_state | 1 | "abc" | ["dep01", "dep02"]