Remove duplicate rows, regardless of new information -PySpark

问题

Say I have a dataframe like so:

ID         Media
1         imgix.com/20830dk
2         imgix.com/202398pwe
3         imgix.com/lvw0923dk
4         imgix.com/082kldcm
4         imgix.com/lks032m
4         imgix.com/903248

I'd like to end up with:

ID         Media
1         imgix.com/20830dk
2         imgix.com/202398pwe
3         imgix.com/lvw0923dk
4         imgix.com/082kldcm

Even though that causes me to lose 2 links for ID = 4, I don't care. Is there a simple way to do this in python/pyspark?

回答1:

Group by on col('ID')
Use collect_list with agg to aggregate the list

Call getItem(0) to extract first element from the aggregated list

df.groupBy('ID').agg(collect_list('Media').getItem(0).alias('Media')).show()

回答2:

Anton and pault are correct:

df.drop_duplicates(subset=['ID'])

does indeed work

来源：https://stackoverflow.com/questions/50685522/remove-duplicate-rows-regardless-of-new-information-pyspark

标签

pyspark

apache-spark-sql

distinct

pyspark-sql

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!