Remove duplicate rows, regardless of new information -PySpark

纵然是瞬间 提交于 2020-01-15 10:15:39

问题


Say I have a dataframe like so:

ID         Media
1         imgix.com/20830dk
2         imgix.com/202398pwe
3         imgix.com/lvw0923dk
4         imgix.com/082kldcm
4         imgix.com/lks032m
4         imgix.com/903248

I'd like to end up with:

ID         Media
1         imgix.com/20830dk
2         imgix.com/202398pwe
3         imgix.com/lvw0923dk
4         imgix.com/082kldcm

Even though that causes me to lose 2 links for ID = 4, I don't care. Is there a simple way to do this in python/pyspark?


回答1:


  1. Group by on col('ID')
  2. Use collect_list with agg to aggregate the list
  3. Call getItem(0) to extract first element from the aggregated list

    df.groupBy('ID').agg(collect_list('Media').getItem(0).alias('Media')).show()
    



回答2:


Anton and pault are correct:

df.drop_duplicates(subset=['ID']) 

does indeed work



来源:https://stackoverflow.com/questions/50685522/remove-duplicate-rows-regardless-of-new-information-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!