问题
I have a dataframe (which is created by loading from multiple blobs in azure) where I have a column which is list of IDs. Now, I want a list of unique IDs from this entire column:
Here is an example -
df -
| col1 | col2 | col3 |
| "a" | "b" |"[q,r]"|
| "c" | "f" |"[s,r]"|
Here is my expected response:
resp = [q, r, s]
Any idea how to get there?
My current approach is to convert the strings in col3 to python lists and then maybe flaten them out somehow.
But so far I am not able to do so. I tried using user defined functions in pyspark but they only return strings and not lists.
FlatMaps only work on RDD not on Dataframes so they are out of picture.
Maybe there is way where I can specify this during the conversion from RDD to dataframe. But not sure how to do that.
回答1:
Here is a method using only DataFrame functions:
df = spark.createDataFrame([('a','b','[q,r,p]'),('c','f','[s,r]')],['col1','col2','col3'])
df=df.withColumn('col4', f.split(f.regexp_extract('col3', '\[(.*)\]',1), ','))
df.select(f.explode('col4').alias('exploded')).groupby('exploded').count().show()
回答2:
we can use UDF along with collect_list. I tried my way,
>>> from pyspark.sql import functions as F
>>> from pyspark.sql.types import *
>>> from functools import reduce
>>> df = spark.createDataFrame([('a','b','[q,r]'),('c','f','[s,r]')],['col1','col2','col3'])
>>> df.show()
+----+----+-----+
|col1|col2| col3|
+----+----+-----+
| a| b|[q,r]|
| c| f|[s,r]|
+----+----+-----+
>>> udf1 = F.udf(lambda x : [v for v in reduce(lambda x,y : set(x+y),d) if v not in ['[',']',',']],ArrayType(StringType()))
## col3 value is string of list. we concat the strings and set over it which removes duplicates.
## Also, we have converted string to set, means it will return [ ] , as values( like '[',']',',').we remove those.
>>> df.select(udf1(F.collect_list('col3')).alias('col3')).first().col3
['q', 'r', 's']
Not sure about performance. Hope this helps.!
来源:https://stackoverflow.com/questions/47793412/pyspark-dataframe-get-unique-elements-from-column-with-string-as-list-of-element