Pyspark : How to pick the values till last from the first occurrence in an array based on the matching values in another column

问题

I have an dataframe where I need to search a value present in one column i.e., StringType in another column i.e., ArrayType but I want to pick the values from the second column till last value in array from the first occurences of the first column.

Explained below with examples :

Input DF is below :

Employee_Name|Employee_ID|Mapped_Project_ID
Name1|E101|[E101, E102, E103]
Name2|E102|[E101, E102, E103]
Name3|E103|[E101, E102, E103, E104, E105]

Output DF Should look like as below:

Employee_Name|Employee_ID|Mapped_Project_ID
Name1|E101|[E101, E102, E103]
Name2|E102|[E102, E103]
Name3|E103|[E103, E104, E105]

回答1:

As of Spark 2.4 you can use array_position and slice functions:

import pyspark.sql.functions as f    
from pyspark.sql.functions import array_position
from pyspark.sql.functions import slice

df = spark.createDataFrame([(["c", "b", "a","e","f"],'a')], ['arraydata','item'])

df.select(df.arraydata, f.expr("slice(arraydata,array_position(arraydata, item),size(arraydata))").alias("res")).show()

+---------------+---------+
|      arraydata|      res|
+---------------+---------+
|[c, b, a, e, f]|[a, e, f]|
+---------------+---------+

Please just translate this into your df colnames. Hope this helps.

回答2:

This is what you want i think, i have implemented it on dummy data also:

import pyspark.sql.types as T
import pyspark.sql.functions as F

df = sqlContext.createDataFrame([['E101',["E101", "E102", "E103", "E104", "E105"]]],["eid", "mapped_eid"])
df.persist()
df.show(truncate = False)

+----+------------------------------+
|eid |mapped_eid                    |
+----+------------------------------+
|E101|[E101, E102, E103, E104, E105]|
+----+------------------------------+

@F.udf(returnType=T.ArrayType(T.StringType()))
def find_element(element,temp_list):
    count = 0
    res = []
    for i in range(len(temp_list)):
        if (count == 0) and (temp_list[i] != element):
            count = 1
            res.append(temp_list[i]) 
        elif count == 1:
            res.append(temp_list[i]) 
    return res

df.withColumn(
    "res_col",
    find_element(F.col("eid"), F.col("mapped_eid"))
).show(truncate = False)

+----+------------------------------+------------------------+
|eid |mapped_eid                    |res_col                 |
+----+------------------------------+------------------------+
|E101|[E101, E102, E103, E104, E105]|[E102, E103, E104, E105]|
+----+------------------------------+------------------------+

let me know if this works for you.

来源：https://stackoverflow.com/questions/56358413/pyspark-how-to-pick-the-values-till-last-from-the-first-occurrence-in-an-array

标签

pyspark

pyspark-sql