问题
I have an dataframe where I need to search a value present in one column i.e., StringType in another column i.e., ArrayType but I want to pick the values from the second column till last value in array from the first occurences of the first column.
Explained below with examples :
Input DF is below :
Employee_Name|Employee_ID|Mapped_Project_ID
Name1|E101|[E101, E102, E103]
Name2|E102|[E101, E102, E103]
Name3|E103|[E101, E102, E103, E104, E105]
Output DF Should look like as below:
Employee_Name|Employee_ID|Mapped_Project_ID
Name1|E101|[E101, E102, E103]
Name2|E102|[E102, E103]
Name3|E103|[E103, E104, E105]
回答1:
As of Spark 2.4 you can use array_position
and slice
functions:
import pyspark.sql.functions as f
from pyspark.sql.functions import array_position
from pyspark.sql.functions import slice
df = spark.createDataFrame([(["c", "b", "a","e","f"],'a')], ['arraydata','item'])
df.select(df.arraydata, f.expr("slice(arraydata,array_position(arraydata, item),size(arraydata))").alias("res")).show()
+---------------+---------+
| arraydata| res|
+---------------+---------+
|[c, b, a, e, f]|[a, e, f]|
+---------------+---------+
Please just translate this into your df colnames. Hope this helps.
回答2:
This is what you want i think, i have implemented it on dummy data also:
import pyspark.sql.types as T
import pyspark.sql.functions as F
df = sqlContext.createDataFrame([['E101',["E101", "E102", "E103", "E104", "E105"]]],["eid", "mapped_eid"])
df.persist()
df.show(truncate = False)
+----+------------------------------+
|eid |mapped_eid |
+----+------------------------------+
|E101|[E101, E102, E103, E104, E105]|
+----+------------------------------+
@F.udf(returnType=T.ArrayType(T.StringType()))
def find_element(element,temp_list):
count = 0
res = []
for i in range(len(temp_list)):
if (count == 0) and (temp_list[i] != element):
count = 1
res.append(temp_list[i])
elif count == 1:
res.append(temp_list[i])
return res
df.withColumn(
"res_col",
find_element(F.col("eid"), F.col("mapped_eid"))
).show(truncate = False)
+----+------------------------------+------------------------+
|eid |mapped_eid |res_col |
+----+------------------------------+------------------------+
|E101|[E101, E102, E103, E104, E105]|[E102, E103, E104, E105]|
+----+------------------------------+------------------------+
let me know if this works for you.
来源:https://stackoverflow.com/questions/56358413/pyspark-how-to-pick-the-values-till-last-from-the-first-occurrence-in-an-array