pyspark-sql

getting the new row id from pySpark SQL write to remote mysql db (JDBC)

萝らか妹 提交于 2020-01-05 06:31:12
问题 I am using pyspark-sql to create rows in a remote mysql db, using JDBC. I have two tables, parent_table(id, value) and child_table(id, value, parent_id) , so each row of parent_id may have as many rows in child_id associated to it as needed. Now I want to create some new data and insert it into the database. I'm using the code guidelines here for the write opperation, but I would like to be able to do something like: parentDf = sc.parallelize([5, 6, 7]).toDF(('value',)) parentWithIdDf =

Count occurrences of a list of substrings in a pyspark df column

若如初见. 提交于 2020-01-04 05:33:07
问题 I want to count the occurrences of list of substrings and create a column based on a column in the pyspark df which contains a long string. Input: ID History 1 USA|UK|IND|DEN|MAL|SWE|AUS 2 USA|UK|PAK|NOR 3 NOR|NZE 4 IND|PAK|NOR lst=['USA','IND','DEN'] Output : ID History Count 1 USA|UK|IND|DEN|MAL|SWE|AUS 3 2 USA|UK|PAK|NOR 1 3 NOR|NZE 0 4 IND|PAK|NOR 1 回答1: # Importing requisite packages and creating a DataFrame from pyspark.sql.functions import split, col, size, regexp_replace values = [(1,

Compare rows of two dataframes to find the matching column count of 1's

自闭症网瘾萝莉.ら 提交于 2020-01-04 02:32:04
问题 I have 2 dataframes with same schema, i need to compare the rows of dataframes and keep a count of rows with at-least one column with value 1 in both the dataframes Right now i am making a list of the rows and then comparing the 2 lists to find even if one value is equal in both the list and equal to 1 rowOgList = [] for row in cat_og_df.rdd.toLocalIterator(): rowOgDict = {} for cat in categories: rowOgDict[cat] = row[cat] rowOgList.append(rowOgDict) #print(rowOgList[0]) rowPredList = [] for

Get Last Monday in Spark

社会主义新天地 提交于 2020-01-03 18:05:53
问题 I am using Spark 2.0 with the Python API. I have a dataframe with a column of type DateType(). I would like to add a column to the dataframe containing the most recent Monday. I can do it like this: reg_schema = pyspark.sql.types.StructType([ pyspark.sql.types.StructField('AccountCreationDate', pyspark.sql.types.DateType(), True), pyspark.sql.types.StructField('UserId', pyspark.sql.types.LongType(), True) ]) reg = spark.read.schema(reg_schema).option('header', True).csv(path_to_file) reg =

Get Last Monday in Spark

孤者浪人 提交于 2020-01-03 18:05:27
问题 I am using Spark 2.0 with the Python API. I have a dataframe with a column of type DateType(). I would like to add a column to the dataframe containing the most recent Monday. I can do it like this: reg_schema = pyspark.sql.types.StructType([ pyspark.sql.types.StructField('AccountCreationDate', pyspark.sql.types.DateType(), True), pyspark.sql.types.StructField('UserId', pyspark.sql.types.LongType(), True) ]) reg = spark.read.schema(reg_schema).option('header', True).csv(path_to_file) reg =

Get Last Monday in Spark

陌路散爱 提交于 2020-01-03 18:05:09
问题 I am using Spark 2.0 with the Python API. I have a dataframe with a column of type DateType(). I would like to add a column to the dataframe containing the most recent Monday. I can do it like this: reg_schema = pyspark.sql.types.StructType([ pyspark.sql.types.StructField('AccountCreationDate', pyspark.sql.types.DateType(), True), pyspark.sql.types.StructField('UserId', pyspark.sql.types.LongType(), True) ]) reg = spark.read.schema(reg_schema).option('header', True).csv(path_to_file) reg =

Pyspark : How to pick the values till last from the first occurrence in an array based on the matching values in another column

时光总嘲笑我的痴心妄想 提交于 2020-01-03 02:46:22
问题 I have an dataframe where I need to search a value present in one column i.e., StringType in another column i.e., ArrayType but I want to pick the values from the second column till last value in array from the first occurences of the first column. Explained below with examples : Input DF is below : Employee_Name|Employee_ID|Mapped_Project_ID Name1|E101|[E101, E102, E103] Name2|E102|[E101, E102, E103] Name3|E103|[E101, E102, E103, E104, E105] Output DF Should look like as below: Employee_Name

Split one column based the value of another column in pyspark [duplicate]

懵懂的女人 提交于 2020-01-03 00:56:08
问题 This question already has an answer here : Using a column value as a parameter to a spark DataFrame function (1 answer) Closed 8 months ago . I have the following data frame +----+-------+ |item| path| +----+-------+ | a| a/b/c| | b| e/b/f| | d|e/b/d/h| | c| g/h/c| +----+-------+ I want to find relative path of an of the column "item" by locating its value in column 'path' and extracting the path's LHS as shown below +----+-------+--------+ |item| path|rel_path| +----+-------+--------+ | a| a

pyspark use dataframe inside udf

北城以北 提交于 2020-01-02 18:38:20
问题 I have two dataframes df1 +---+---+----------+ | n|val| distances| +---+---+----------+ | 1| 1|0.27308652| | 2| 1|0.24969208| | 3| 1|0.21314497| +---+---+----------+ and df2 +---+---+----------+ | x1| x2| w| +---+---+----------+ | 1| 2|0.03103427| | 1| 4|0.19012526| | 1| 10|0.26805446| | 1| 8|0.26825935| +---+---+----------+ I want to add a new column to df1 called gamma , which will contain the sum of the w value from df2 when df1.n == df2.x1 OR df1.n == df2.x2 I tried to use udf, but

how to store grouped data into json in pyspark

瘦欲@ 提交于 2020-01-01 17:38:09
问题 I am new to pyspark I have a dataset which looks like (just a snapshot of few columns) I want to group my data by key. My key is CONCAT(a.div_nbr,a.cust_nbr) My ultimate goal is to convert the data into JSON, formated like this k1[{v1,v2,....},{v1,v2,....}], k2[{v1,v2,....},{v1,v2,....}],.... e.g 248138339 [{ PRECIMA_ID:SCP 00248 0000138339, PROD_NBR:5553505, PROD_DESC:Shot and a Beer Battered Onion Rings (5553505 and 9285840) , PROD_BRND:Molly's Kitchen,PACK_SIZE:4/2.5 LB, QTY_UOM:CA } , {