Remove all rows that are duplicates with respect to some rows

问题

I've seen a couple questions like this but not a satisfactory answer for my situation. Here is a sample DataFrame:

+------+-----+----+
|    id|value|type|
+------+-----+----+
|283924|  1.5|   0|
|283924|  1.5|   1|
|982384|  3.0|   0|
|982384|  3.0|   1|
|892383|  2.0|   0|
|892383|  2.5|   1|
+------+-----+----+

I want to identify duplicates by just the "id" and "value" columns, and then remove all instances.

In this case:

Rows 1 and 2 are duplicates (again we are ignoring the "type" column)
Rows 3 and 4 are duplicates, and therefore only rows 5 & 6 should remain:

The output would be:

+------+-----+----+
|    id|value|type|
+------+-----+----+
|892383|  2.5|   1|
|892383|  2.0|   0|
+------+-----+----+

I've tried

df.dropDuplicates(subset = ['id', 'value'], keep = False)

But the "keep" feature isn't in PySpark (as it is in pandas.DataFrame.drop_duplicates.

How else could I do this?

回答1:

You can do that using the window functions

from pyspark.sql import Window, functions as F
df.withColumn(
  'fg', 
  F.count("id").over(Window.partitionBy("id", "value"))
).where("fg = 1").drop("fg").show()

回答2:

You can groupBy the id and type to get the count. Then use join to filter out the rows in your DataFrame where the count is not 1:

df.join(
    df.groupBy('id', 'value').count().where('count = 1').drop('count'), on=['id', 'value']
).show()
#+------+-----+----+
#|    id|value|type|
#+------+-----+----+
#|892383|  2.5|   1|
#|892383|  2.0|   0|
#+------+-----+----+

来源：https://stackoverflow.com/questions/51522719/remove-all-rows-that-are-duplicates-with-respect-to-some-rows

标签

python

pyspark

apache-spark-sql

pyspark-sql