pyspark-sql | 易学教程

getting the new row id from pySpark SQL write to remote mysql db (JDBC)

阅读更多关于 getting the new row id from pySpark SQL write to remote mysql db (JDBC)

问题 I am using pyspark-sql to create rows in a remote mysql db, using JDBC. I have two tables, parent_table(id, value) and child_table(id, value, parent_id) , so each row of parent_id may have as many rows in child_id associated to it as needed. Now I want to create some new data and insert it into the database. I'm using the code guidelines here for the write opperation, but I would like to be able to do something like: parentDf = sc.parallelize([5, 6, 7]).toDF(('value',)) parentWithIdDf =

Count occurrences of a list of substrings in a pyspark df column

阅读更多关于 Count occurrences of a list of substrings in a pyspark df column

问题 I want to count the occurrences of list of substrings and create a column based on a column in the pyspark df which contains a long string. Input: ID History 1 USA|UK|IND|DEN|MAL|SWE|AUS 2 USA|UK|PAK|NOR 3 NOR|NZE 4 IND|PAK|NOR lst=['USA','IND','DEN'] Output : ID History Count 1 USA|UK|IND|DEN|MAL|SWE|AUS 3 2 USA|UK|PAK|NOR 1 3 NOR|NZE 0 4 IND|PAK|NOR 1 回答1: # Importing requisite packages and creating a DataFrame from pyspark.sql.functions import split, col, size, regexp_replace values = [(1,

Compare rows of two dataframes to find the matching column count of 1's

阅读更多关于 Compare rows of two dataframes to find the matching column count of 1's

问题 I have 2 dataframes with same schema, i need to compare the rows of dataframes and keep a count of rows with at-least one column with value 1 in both the dataframes Right now i am making a list of the rows and then comparing the 2 lists to find even if one value is equal in both the list and equal to 1 rowOgList = [] for row in cat_og_df.rdd.toLocalIterator(): rowOgDict = {} for cat in categories: rowOgDict[cat] = row[cat] rowOgList.append(rowOgDict) #print(rowOgList[0]) rowPredList = [] for

Get Last Monday in Spark

阅读更多关于 Get Last Monday in Spark

问题 I am using Spark 2.0 with the Python API. I have a dataframe with a column of type DateType(). I would like to add a column to the dataframe containing the most recent Monday. I can do it like this: reg_schema = pyspark.sql.types.StructType([ pyspark.sql.types.StructField('AccountCreationDate', pyspark.sql.types.DateType(), True), pyspark.sql.types.StructField('UserId', pyspark.sql.types.LongType(), True) ]) reg = spark.read.schema(reg_schema).option('header', True).csv(path_to_file) reg =

Get Last Monday in Spark

阅读更多关于 Get Last Monday in Spark

Get Last Monday in Spark

阅读更多关于 Get Last Monday in Spark

Pyspark : How to pick the values till last from the first occurrence in an array based on the matching values in another column

阅读更多关于 Pyspark : How to pick the values till last from the first occurrence in an array based on the matching values in another column

问题 I have an dataframe where I need to search a value present in one column i.e., StringType in another column i.e., ArrayType but I want to pick the values from the second column till last value in array from the first occurences of the first column. Explained below with examples : Input DF is below : Employee_Name|Employee_ID|Mapped_Project_ID Name1|E101|[E101, E102, E103] Name2|E102|[E101, E102, E103] Name3|E103|[E101, E102, E103, E104, E105] Output DF Should look like as below: Employee_Name

Split one column based the value of another column in pyspark [duplicate]

阅读更多关于 Split one column based the value of another column in pyspark [duplicate]

问题 This question already has an answer here : Using a column value as a parameter to a spark DataFrame function (1 answer) Closed 8 months ago . I have the following data frame +----+-------+ |item| path| +----+-------+ | a| a/b/c| | b| e/b/f| | d|e/b/d/h| | c| g/h/c| +----+-------+ I want to find relative path of an of the column "item" by locating its value in column 'path' and extracting the path's LHS as shown below +----+-------+--------+ |item| path|rel_path| +----+-------+--------+ | a| a

pyspark use dataframe inside udf

阅读更多关于 pyspark use dataframe inside udf

问题 I have two dataframes df1 +---+---+----------+ | n|val| distances| +---+---+----------+ | 1| 1|0.27308652| | 2| 1|0.24969208| | 3| 1|0.21314497| +---+---+----------+ and df2 +---+---+----------+ | x1| x2| w| +---+---+----------+ | 1| 2|0.03103427| | 1| 4|0.19012526| | 1| 10|0.26805446| | 1| 8|0.26825935| +---+---+----------+ I want to add a new column to df1 called gamma , which will contain the sum of the w value from df2 when df1.n == df2.x1 OR df1.n == df2.x2 I tried to use udf, but

how to store grouped data into json in pyspark

阅读更多关于 how to store grouped data into json in pyspark

问题 I am new to pyspark I have a dataset which looks like (just a snapshot of few columns) I want to group my data by key. My key is CONCAT(a.div_nbr,a.cust_nbr) My ultimate goal is to convert the data into JSON, formated like this k1[{v1,v2,....},{v1,v2,....}], k2[{v1,v2,....},{v1,v2,....}],.... e.g 248138339 [{ PRECIMA_ID:SCP 00248 0000138339, PROD_NBR:5553505, PROD_DESC:Shot and a Beer Battered Onion Rings (5553505 and 9285840) , PROD_BRND:Molly's Kitchen,PACK_SIZE:4/2.5 LB, QTY_UOM:CA } , {