user-defined-functions

Filter Pyspark Dataframe with udf on entire row

那年仲夏 提交于 2020-08-25 07:33:49
问题 Is there a way to select the entire row as a column to input into a Pyspark filter udf? I have a complex filtering function "my_filter" that I want to apply to the entire DataFrame: my_filter_udf = udf(lambda r: my_filter(r), BooleanType()) new_df = df.filter(my_filter_udf(col("*")) But col("*") throws an error because that's not a valid operation. I know that I can convert the dataframe to an RDD and then use the RDD's filter method, but I do NOT want to convert it to an RDD and then back

SQL Function to return value from multiple columns

孤人 提交于 2020-07-23 06:51:05
问题 I've been developing a few stored procedure and I have been repeating a portion of codes that derives a column based on a few other columns. So instead of copy this piece of code from one stored procedure to another, I'm thinking of having a function that takes the input columns and produces the output columns. Basically, the function goes as: SELECT columnA, columnB, columnC, myFunction(columnA, columnB) as columnD FROM myTable As we can see, this function will take column A and column B as