问题
I want to reshape my data from 4x3 to 2x2 in pyspark without aggregating. My current output is the following:
columns = ['FAULTY', 'value_HIGH', 'count']
vals = [
(1, 0, 141),
(0, 0, 140),
(1, 1, 21),
(0, 1, 12)
]
What I want is a contingency table with the second column as two new binary columns (value_HIGH_1
, value_HIGH_0
) and the values from the count
column - meaning:
columns = ['FAULTY', 'value_HIGH_1', 'value_HIGH_0']
vals = [
(1, 21, 141),
(0, 12, 140)
]
回答1:
You can use pivot
with a fake maximum aggregation (since you have only one element for each group):
import pyspark.sql.functions as F
df.groupBy('FAULTY').pivot('value_HIGH').agg(F.max('count')).selectExpr(
'FAULTY', '`1` as value_high_1', '`0` as value_high_0'
).show()
+------+------------+------------+
|FAULTY|value_high_1|value_high_0|
+------+------------+------------+
| 0| 12| 140|
| 1| 21| 141|
+------+------------+------------+
回答2:
Using groupby
and pivot
is the natural way to do this, but if you want to avoid any aggregation you can achieve this with a filter
and join
import pyspark.sql.functions as f
df.where("value_HIGH = 1").select("FAULTY", f.col("count").alias("value_HIGH_1"))\
.join(
df.where("value_HIGH = 0").select("FAULTY", f.col("count").alias("value_HIGH_1")),
on="FAULTY"
)\
.show()
#+------+------------+------------+
#|FAULTY|value_HIGH_1|value_HIGH_1|
#+------+------------+------------+
#| 0| 12| 140|
#| 1| 21| 141|
#+------+------------+------------+
来源:https://stackoverflow.com/questions/53103269/pyspark-reshape-data-without-aggregation