问题
I am using sparkv2.4.1, Have a scenario , where i need to convert given table structred as below
val df = Seq(
("A", "2016-01-01", "2016-12-01", "0.044999408"),
("A", "2016-01-01", "2016-12-01", "0.0449999426"),
("A", "2016-01-01", "2016-12-01", "0.045999415"),
("B", "2016-01-01", "2016-12-01", "0.0787888909"),
("B", "2016-01-01", "2016-12-01", "0.079779426"),
("B", "2016-01-01", "2016-12-01", "0.999989415"),
("C", "2016-01-01", "2016-12-01", "0.0011999408"),
("C", "2016-01-01", "2016-12-01", "0.0087999426"),
("C", "2016-01-01", "2016-12-01", "0.0089899941")
).toDF("class_type","start_date","end_date","ratio")
.withColumn("start_date", to_date($"start_date" ,"yyyy-MM-dd").cast(DateType))
.withColumn("end_date", to_date($"end_date" ,"yyyy-MM-dd").cast(DateType))
.withColumn("ratio", col("ratio").cast(DoubleType))
df.show(200)
Given Table :
+----------+----------+----------+------------+
|class_type|start_date| end_date| ratio|
+----------+----------+----------+------------+
| A|2016-01-01|2016-12-01| 0.044999408|
| A|2016-01-01|2016-12-01|0.0449999426|
| A|2016-01-01|2016-12-01| 0.045999415|
| B|2016-01-01|2016-12-01|0.0787888909|
| B|2016-01-01|2016-12-01| 0.079779426|
| B|2016-01-01|2016-12-01| 0.999989415|
| C|2016-01-01|2016-12-01|0.0011999408|
| C|2016-01-01|2016-12-01|0.0087999426|
| C|2016-01-01|2016-12-01|0.0089899941|
+----------+----------+----------+------------+
Excpected Table format
+----------+----------+------------+------------+------------+
|start_date| end_date| A| B| C|
+----------+----------+------------+------------+------------+
|2016-01-01|2016-12-01| 0.044999408|0.0787888909|0.0011999408|
|2016-01-01|2016-12-01|0.0449999426| 0.079779426|0.0087999426|
|2016-01-01|2016-12-01| 0.045999415| 0.999989415|0.0089899941|
+----------+----------+------------+------------+------------+
How can this be done ?
Tried this
val pivotDf = df.groupBy("start_date","end_date","class_type").pivot(col("class_type")).agg(first(col("ratio")))
+----------+----------+----------+-----------+------------+------------+
|start_date| end_date|class_type| A| B| C|
+----------+----------+----------+-----------+------------+------------+
|2016-01-01|2016-12-01| A|0.044999408| null| null|
|2016-01-01|2016-12-01| B| null|0.0787888909| null|
|2016-01-01|2016-12-01| C| null| null|0.0011999408|
+----------+----------+----------+-----------+------------+------------+
回答1:
Based on data on sample example, you do not have any relation between ration and class_type for subsequent rows.
If it is already ordered then you can assign rank and then use this to pivot.
This is an example of doing using rank.
import org.apache.spark.sql.types.DoubleType
import org.apache.spark.sql.types.DateType
import org.apache.spark.sql.expressions.Window
val byRatio = Window.partitionBy(col("start_date"),col("end_date"),col("class_type")).orderBy(col("ratio"))
var df = Seq(
("A", "2016-01-01", "2016-12-01", "0.044999408"),
("A", "2016-01-01", "2016-12-01", "0.0449999426"),
("A", "2016-01-01", "2016-12-01", "0.045999415"),
("B", "2016-01-01", "2016-12-01", "0.0787888909"),
("B", "2016-01-01", "2016-12-01", "0.079779426"),
("B", "2016-01-01", "2016-12-01", "0.999989415"),
("C", "2016-01-01", "2016-12-01", "0.0011999408"),
("C", "2016-01-01", "2016-12-01", "0.0087999426"),
("C", "2016-01-01", "2016-12-01", "0.0089899941")
).toDF("class_type","start_date","end_date","ratio").
withColumn("start_date", to_date($"start_date" ,"yyyy-MM-dd").cast(DateType)).
withColumn("end_date", to_date($"end_date" ,"yyyy-MM-dd").cast(DateType)).
withColumn("ratio", col("ratio").cast(DoubleType))
df = df.withColumn("class_rank",rank over byRatio)
var pivotDf = df.groupBy("start_date","end_date","class_rank").pivot("class_type").agg(max(col("ratio")))
pivotDf = pivotDf.drop(col("class_rank"))
pivotDf.show(10,false)
Based on your data, you will get output like below:
+----------+----------+------------+------------+------------+
|start_date|end_date |A |B |C |
+----------+----------+------------+------------+------------+
|2016-01-01|2016-12-01|0.044999408 |0.0787888909|0.0011999408|
|2016-01-01|2016-12-01|0.0449999426|0.079779426 |0.0087999426|
|2016-01-01|2016-12-01|0.045999415 |0.999989415 |0.0089899941|
+----------+----------+------------+------------+------------+
来源:https://stackoverflow.com/questions/66167061/transposing-table-to-given-format-in-spark