PySpark sql compare records on each day and report the differences

怎甘沉沦 提交于 2019-12-24 11:13:06

问题


so the problem I have is I have this dataset:

and it shows the businesses are doing business in the specific days. what i want to achieve is to report which businesses are added on what day. Perhaps Im lookign for some answer as:

I managed to tide up all the records using this sql:

select [Date]
,Mnemonic
,securityDesc
,sum(cast(TradedVolume as money)) as TradedVolumSum
FROM SomeTable
group by [Date],Mnemonic,securityDesc

but I dont know how to compare each days record with the other day and export the non existence record on the following day to another table. I tired sql over partition cluase but it makes it complex. I can either use sql or Pyspark sql python combination.

could you let me how I can resolve this problem?


回答1:


Below is the dataframe operation for your question you might need to tweak a little bit as I dont have the sample data for it, written the code by seeing your data, please let me know if that solves your problem:

import pyspark.sql.functions as F
from pyspark.sql import Window

some_win = Window.partitionBy("securityDesc").orderBy(F.col("[date]").asc())
some_table.withColumn(
    "buisness_added_day",
    F.first(F.col("id")).over(some_win)
).select(
    "buisness_added_day",
    "securityDesc",
    "TradedVolumSum",
    "Mnemonic"
).distinct().orderBy("buisness_added_day").show()


来源:https://stackoverflow.com/questions/52140278/pyspark-sql-compare-records-on-each-day-and-report-the-differences

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!