问题
I want unzip list of tuples in a column of a pyspark dataframe
Let's say a column as [(blue, 0.5), (red, 0.1), (green, 0.7)]
, I want to split into two columns, with first column as [blue, red, green]
and second column as [0.5, 0.1, 0.7]
+-----+-------------------------------------------+
|Topic| Tokens |
+-----+-------------------------------------------+
| 1| ('blue', 0.5),('red', 0.1),('green', 0.7)|
| 2| ('red', 0.9),('cyan', 0.5),('white', 0.4)|
+-----+-------------------------------------------+
which can be created with this code:
df = sqlCtx.createDataFrame(
[
(1, ('blue', 0.5),('red', 0.1),('green', 0.7)),
(2, ('red', 0.9),('cyan', 0.5),('white', 0.4))
],
('Topic', 'Tokens')
)
And, the output should look like:
+-----+--------------------------+-----------------+
|Topic| Tokens | Weights |
+-----+--------------------------+-----------------+
| 1| ['blue', 'red', 'green']| [0.5, 0.1, 0.7] |
| 2| ['red', 'cyan', 'white']| [0.9, 0.5, 0.4] |
+-----+--------------------------------------------+
回答1:
You can achieve this with simple indexing using udf():
from pyspark.sql.functions import udf, col
# create the dataframe
df = sqlCtx.createDataFrame(
[
(1, [('blue', 0.5),('red', 0.1),('green', 0.7)]),
(2, [('red', 0.9),('cyan', 0.5),('white', 0.4)])
],
('Topic', 'Tokens')
)
def get_colors(l):
return [x[0] for x in l]
def get_weights(l):
return [x[1] for x in l]
# make udfs from the above functions - Note the return types
get_colors_udf = udf(get_colors, ArrayType(StringType()))
get_weights_udf = udf(get_weights, ArrayType(FloatType()))
# use withColumn and apply the udfs
df.withColumn('Weights', get_weights_udf(col('Tokens')))\
.withColumn('Tokens', get_colors_udf(col('Tokens')))\
.select(['Topic', 'Tokens', 'Weights'])\
.show()
Output:
+-----+------------------+---------------+
|Topic| Tokens| Weights|
+-----+------------------+---------------+
| 1|[blue, red, green]|[0.5, 0.1, 0.7]|
| 2|[red, cyan, white]|[0.9, 0.5, 0.4]|
+-----+------------------+---------------+
回答2:
If schema of your DataFrame
looks like this:
root
|-- Topic: long (nullable = true)
|-- Tokens: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: string (nullable = true)
| | |-- _2: double (nullable = true)
then you can select:
from pyspark.sql.functions import col
df.select(
col("Topic"),
col("Tokens._1").alias("Tokens"), col("Tokens._2").alias("weights")
).show()
# +-----+------------------+---------------+
# |Topic| Tokens| weights|
# +-----+------------------+---------------+
# | 1|[blue, red, green]|[0.5, 0.1, 0.7]|
# | 2|[red, cyan, white]|[0.9, 0.5, 0.4]|
# +-----+------------------+---------------+
And generalized:
cols = [
col("Tokens.{}".format(n)) for n in
df.schema["Tokens"].dataType.elementType.names]
df.select("Topic", *cols)
Reference Querying Spark SQL DataFrame with complex types
来源:https://stackoverflow.com/questions/48446595/unzip-list-of-tuples-in-pyspark-dataframe