问题
I'm trying to lag a field when it matches certain conditions, and because I need to use filters, I'm using the MAX function to lag it, as the LAG function itself doesn't work the way I need it. I have been able to do it with the code below for the ID_EVENT_LOG, but when I change the ID_EVENT_LOG inside the MAX to the column ENSAIO, so I would lag the column ENSAIO it doesn't work properly. Example below.
Dataset:
+------------+---------+------+
|ID_EVENT_LOG|ID_PAINEL|ENSAIO|
+------------+---------+------+
| 1| 1| null|
| 2| 1| null|
| 3| 1|INICIO|
| 4| 1| null|
| 5| 1| null|
| 6| 1| null|
| 7| 1| FIM|
| 8| 1| null|
| 9| 1| null|
| 10| 1| null|
| 11| 2| FIM|
| 12| 2| FIM|
| 13| 2|INICIO|
| 14| 2| null|
| 15| 2| FIM|
+------------+---------+------+
Working Code
DFReadFile = spark.read.format('csv').option("header", "true").option('sep',',').load('12_delete_between_inicio_fim_v4.csv')
DFReadFile.show()
DFReadFile.createOrReplaceTempView("12_delete_between_inicio_fim")
sqlDF = spark.sql("SELECT *, \
CASE \
WHEN (ENSAIO like '%null%') THEN \
MAX(CASE WHEN (ENSAIO like '%INICIO%') OR (ENSAIO like '%FIM%') THEN ID_EVENT_LOG END) \
OVER (PARTITION BY ID_PAINEL ORDER BY int(ID_EVENT_LOG) RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) \
ELSE 0 \
END as TESTE \
FROM 12_delete_between_inicio_fim")
sqlDF.show()
Result:
+------------+---------+------+-----+
|ID_EVENT_LOG|ID_PAINEL|ENSAIO|TESTE|
+------------+---------+------+-----+
| 1| 1| null| null|
| 2| 1| null| null|
| 3| 1|INICIO| 0|
| 4| 1| null| 3|
| 5| 1| null| 3|
| 6| 1| null| 3|
| 7| 1| FIM| 0|
| 8| 1| null| 7|
| 9| 1| null| 7|
| 10| 1| null| 7|
| 11| 2| FIM| 0|
| 12| 2| FIM| 0|
| 13| 2|INICIO| 0|
| 14| 2| null| 13|
| 15| 2| FIM| 0|
+------------+---------+------+-----+
Bug to solve:
Dataset is the same
Not working code (the only change was the ID_EVENT_LOG to ENSAIO):
DFReadFile = spark.read.format('csv').option("header", "true").option('sep',',').load('12_delete_between_inicio_fim_v4.csv')
DFReadFile.show()
DFReadFile.createOrReplaceTempView("12_delete_between_inicio_fim")
sqlDF = spark.sql("SELECT *, \
CASE \
WHEN (ENSAIO like '%null%') THEN \
MAX(CASE WHEN (ENSAIO like '%INICIO%') OR (ENSAIO like '%FIM%') THEN ENSAIO END) \
OVER (PARTITION BY ID_PAINEL ORDER BY int(ID_EVENT_LOG) RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) \
ELSE 0 \
END as TESTE \
FROM 12_delete_between_inicio_fim")
sqlDF.show()
Result:
+------------+---------+------+------+
|ID_EVENT_LOG|ID_PAINEL|ENSAIO| TESTE|
+------------+---------+------+------+
| 1| 1| null| null|
| 2| 1| null| null|
| 3| 1|INICIO| 0|
| 4| 1| null|INICIO|
| 5| 1| null|INICIO|
| 6| 1| null|INICIO|
| 7| 1| FIM| 0|
| 8| 1| null|INICIO|
| 9| 1| null|INICIO|
| 10| 1| null|INICIO|
| 11| 2| FIM| 0|
| 12| 2| FIM| 0|
| 13| 2|INICIO| 0|
| 14| 2| null|INICIO|
| 15| 2| FIM| 0|
+------------+---------+------+------+
Expected Result:
+------------+---------+------+------+
|ID_EVENT_LOG|ID_PAINEL|ENSAIO| TESTE|
+------------+---------+------+------+
| 1| 1| null| null|
| 2| 1| null| null|
| 3| 1|INICIO| 0|
| 4| 1| null|INICIO|
| 5| 1| null|INICIO|
| 6| 1| null|INICIO|
| 7| 1| FIM| 0|
| 8| 1| null| FIM|
| 9| 1| null| FIM|
| 10| 1| null| FIM|
| 11| 2| FIM| 0|
| 12| 2| FIM| 0|
| 13| 2|INICIO| 0|
| 14| 2| null|INICIO|
| 15| 2| FIM| 0|
+------------+---------+------+------+
Thank you in advance
来源:https://stackoverflow.com/questions/64123064/spark-sql-lag-result-gets-different-rows-when-i-change-column