Remove all rows that are duplicates with respect to some rows

浪尽此生 提交于 2019-12-24 09:58:12

问题


I've seen a couple questions like this but not a satisfactory answer for my situation. Here is a sample DataFrame:

+------+-----+----+
|    id|value|type|
+------+-----+----+
|283924|  1.5|   0|
|283924|  1.5|   1|
|982384|  3.0|   0|
|982384|  3.0|   1|
|892383|  2.0|   0|
|892383|  2.5|   1|
+------+-----+----+

I want to identify duplicates by just the "id" and "value" columns, and then remove all instances.

In this case:

  • Rows 1 and 2 are duplicates (again we are ignoring the "type" column)
  • Rows 3 and 4 are duplicates, and therefore only rows 5 & 6 should remain:

The output would be:

+------+-----+----+
|    id|value|type|
+------+-----+----+
|892383|  2.5|   1|
|892383|  2.0|   0|
+------+-----+----+

I've tried

df.dropDuplicates(subset = ['id', 'value'], keep = False)

But the "keep" feature isn't in PySpark (as it is in pandas.DataFrame.drop_duplicates.

How else could I do this?


回答1:


You can do that using the window functions

from pyspark.sql import Window, functions as F
df.withColumn(
  'fg', 
  F.count("id").over(Window.partitionBy("id", "value"))
).where("fg = 1").drop("fg").show()



回答2:


You can groupBy the id and type to get the count. Then use join to filter out the rows in your DataFrame where the count is not 1:

df.join(
    df.groupBy('id', 'value').count().where('count = 1').drop('count'), on=['id', 'value']
).show()
#+------+-----+----+
#|    id|value|type|
#+------+-----+----+
#|892383|  2.5|   1|
#|892383|  2.0|   0|
#+------+-----+----+


来源:https://stackoverflow.com/questions/51522719/remove-all-rows-that-are-duplicates-with-respect-to-some-rows

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!