问题
I need to split a row and create a new row by changing the date columns and make the amt columns to zero as in the below example:
Input:
+---+-----------------------+-----------------------+-----+
|KEY|START_DATE |END_DATE |Amt |
+---+-----------------------+-----------------------+-----+
|0 |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|100.0|
|0 |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|200.0|
|0 |2017-10-30T00:00:00.000|2017-11-02T23:59:59.000|67.5 |->Split row based on start & date end date is between "2017-10-31T23:59:59" condition
|0 |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|55.3 |
|1 |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|22.2 |
|1 |2017-10-30T00:00:00.000|2017-11-01T23:59:59.000|11.0 |->Split row based on start & date end date is between "2017-10-31T23:59:59" condition
|1 |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|87.33|
+---+-----------------------+-----------------------+-----+
If "2017-10-31T23:59:59" is in between row start_date and end_date , then split the row into two rows by changing the end_date for one row and start_date for another row. And make the amt to zero for the new row as below:
Desired Output:
+---+-----------------------+-----------------------+-----+---+
|KEY|START_DATE |END_DATE |Amt |Ind|
+---+-----------------------+-----------------------+-----+---+
|0 |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|100.0|N |
|0 |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|200.0|N |
|0 |2017-10-30T00:00:00.000|2017-10-30T23:59:59.998|67.5 |N |->parent row (changed the END_DATE)
|0 |2017-10-30T23:59:59.999|2017-11-02T23:59:59.000|0.0 |Y |->splitted new row(changed the START_DATE and Amt=0.0)
|0 |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|55.3 |N |
|1 |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|22.2 |N |
|1 |2017-10-30T00:00:00.000|2017-10-30T23:59:59.998|11.0 |N |->parent row (changed the END_DATE)
|1 |2017-10-30T23:59:59.999|2017-11-01T23:59:59.000|0.0 |Y |->splitted new row(changed the START_DATE and Amt=0.0)
|1 |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|87.33|N |
+---+-----------------------+-----------------------+-----+---+
I have tried the below code and able to copy the row , but unable to update the rows on the fly.
val df1Columns = Seq("KEY", "START_DATE", "END_DATE", "Amt")
val df1Schema = new StructType(df1Columns.map(c => StructField(c, StringType, nullable = false)).toArray)
val input1: Array[String] = Seq("0", "2016-12-14T23:59:59.000", "2017-10-29T23:59:58.000", "100.0").toArray;
val row1: Row = Row.fromSeq(input1)
val input2: Array[String] = Seq("0", "2016-12-14T23:59:59.000", "2017-10-29T23:59:58.000", "200.0").toArray;
val row2: Row = Row.fromSeq(input2)
val input3: Array[String] = Seq("0", "2017-10-30T00:00:00.000", "2017-11-0123:59:59.000", "67.5").toArray;
val row3: Row = Row.fromSeq(input3)
val input4: Array[String] = Seq("0", "2016-12-14T23:59:59.000", "2017-10-29T23:59:58.000", "55.3").toArray;
val row4: Row = Row.fromSeq(input4)
val input5: Array[String] = Seq("1", "2016-12-14T23:59:59.000", "2017-10-29T23:59:58.000", "22.2").toArray;
val row5: Row = Row.fromSeq(input5)
val input6: Array[String] = Seq("1", "2017-10-30T00:00:00.000", "2017-11-0123:59:59.000", "11.0").toArray;
val row6: Row = Row.fromSeq(input6)
val input7: Array[String] = Seq("1", "2016-12-14T23:59:59.000", "2017-10-29T23:59:58.000", "87.33").toArray;
val row7: Row = Row.fromSeq(input7)
val rdd: RDD[Row] = spark.sparkContext.parallelize(Seq(row1, row2, row3, row4, row5, row6, row7))
val df: DataFrame = spark.createDataFrame(rdd, df1Schema)
//----------------------------------------------------------------
def encoder(columns: Seq[String]): Encoder[Row] = RowEncoder(StructType(columns.map(StructField(_, StringType, nullable = true))))
val outputColumns = Seq("KEY", "START_DATE", "END_DATE", "Amt","Ind")
val result = df.groupByKey(r => r.getAs[String]("KEY"))
.flatMapGroups((_, rowsForAkey) => {
var result: List[Row] = List()
for (row <- rowsForAkey) {
val qrDate = "2017-10-31T23:59:59"
val currRowStartDate = row.getAs[String]("START_DATE")
val rowEndDate = row.getAs[String]("END_DATE")
if (currRowStartDate <= qrDate && qrDate <= rowEndDate) //Quota
{
val rLayer = row
result = result :+ rLayer
}
val originalRow = row
result = result :+ originalRow
}
result
})(encoder(df1Columns)).toDF
df.show(false)
result.show(false)
Here is my code output:
+---+-----------------------+-----------------------+-----+
|KEY|START_DATE |END_DATE |Amt |
+---+-----------------------+-----------------------+-----+
|0 |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|100.0|
|0 |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|200.0|
|0 |2017-10-30T00:00:00.000|2017-11-0123:59:59.000 |67.5 |
|0 |2017-10-30T00:00:00.000|2017-11-0123:59:59.000 |67.5 |
|0 |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|55.3 |
|1 |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|22.2 |
|1 |2017-10-30T00:00:00.000|2017-11-0123:59:59.000 |11.0 |
|1 |2017-10-30T00:00:00.000|2017-11-0123:59:59.000 |11.0 |
|1 |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|87.33|
+---+-----------------------+-----------------------+-----+
回答1:
I would suggest you to you to go with inbuilt functions rather than going through such complex rdd way.
I have used inbuilt functions such as lit
to populate constants and udf
function to change the time in date columns
Main theme is to separate the dataframes
into two and finally union
them
(I have commented for the clarity of the codes)
import org.apache.spark.sql.functions._
//udf function to change the time
def changeTimeInDate = udf((toCopy: String, withCopied: String)=> withCopied.split("T")(0)+"T"+toCopy.split("T")(1))
//creating Ind column with N populated and saving in temporaty dataframe
val indDF = df.withColumn("Ind", lit("N"))
//filtering out the rows that match the condition mentioned in the question and then changing the Amt column and Ind column and START_DATE
val duplicatedDF = indDF.filter($"START_DATE" <= "2017-10-31T23:59:59" && $"END_DATE" >= "2017-10-31T23:59:59")
.withColumn("Amt", lit("0.0"))
.withColumn("Ind", lit("Y"))
.withColumn("START_DATE", changeTimeInDate($"END_DATE", $"START_DATE"))
//Changing the END_DATE and finally merging both
val result = indDF.withColumn("END_DATE", changeTimeInDate($"START_DATE", $"END_DATE"))
.union(duplicatedDF)
You should have the desired output
+---+-----------------------+-----------------------+-----+---+
|KEY|START_DATE |END_DATE |Amt |Ind|
+---+-----------------------+-----------------------+-----+---+
|0 |2016-12-14T23:59:59.000|2017-10-29T23:59:59.000|100.0|N |
|0 |2016-12-14T23:59:59.000|2017-10-29T23:59:59.000|55.3 |N |
|0 |2016-12-14T23:59:59.000|2017-10-29T23:59:59.000|200.0|N |
|0 |2017-10-30T00:00:00.000|2017-11-01T00:00:00.000|67.5 |N |
|0 |2017-10-30T23:59:59.000|2017-11-01T23:59:59.000|0.0 |Y |
|1 |2016-12-14T23:59:59.000|2017-10-29T23:59:59.000|22.2 |N |
|1 |2016-12-14T23:59:59.000|2017-10-29T23:59:59.000|87.33|N |
|1 |2017-10-30T00:00:00.000|2017-11-01T00:00:00.000|11.0 |N |
|1 |2017-10-30T23:59:59.000|2017-11-01T23:59:59.000|0.0 |Y |
+---+-----------------------+-----------------------+-----+---+
回答2:
It looks like you're duplicating the rows, rather than altering them.
You can replace the inside of your flatMapGroups
function with something like:
rowsForAKey.flatMap{ row =>
val qrDate = "2017-10-31T23:59:59"
val currRowStartDate = row.getAs[String]("START_DATE")
val rowEndDate = row.getAs[String]("END_DATE")
if (currRowStartDate <= qrDate && qrDate <= rowEndDate) //Quota
{
val splitDate = endOfDay(currRowStartDate)
// need to build two rows
val parentRow = Row(row(0), row(1), splitDate, row(3), "Y")
val splitRow = Row(row(0), splitDate, row(2), 0.0, "N")
List(parentRow, splitRow)
}
else {
List(row)
}
}
Basically, any time you have a for
loop building up a list like this in Scala, it's really map
or flatMap
that you want. Here, it's flatMap
since each row will give us either one or two elements in the result. I've assumed you introduce a function endOfDay
to make the right timestamp.
I realize you may be reading data in a way that gives you a DataFrame
, but I do want to offer the idea of using Dataset[Some Case Class]
instead--it'd basically be a drop-in replacement (you're basically viewing your DataFrame
as Dataset[Row]
, which is what it is, after all) and I think it would make things easier to read, plus you'd get type-checking.
Also as a heads up, if you import spark.implicits._
, you shouldn't need the encoder--everything looks to be a string or a float and those encoders are available.
来源:https://stackoverflow.com/questions/49047368/split-a-row-into-two-and-dummy-some-columns