问题
I've seen a couple questions like this but not a satisfactory answer for my situation. Here is a sample DataFrame:
+------+-----+----+
| id|value|type|
+------+-----+----+
|283924| 1.5| 0|
|283924| 1.5| 1|
|982384| 3.0| 0|
|982384| 3.0| 1|
|892383| 2.0| 0|
|892383| 2.5| 1|
+------+-----+----+
I want to identify duplicates by just the "id"
and "value"
columns, and then remove all instances.
In this case:
- Rows 1 and 2 are duplicates (again we are ignoring the "type" column)
- Rows 3 and 4 are duplicates, and therefore only rows 5 & 6 should remain:
The output would be:
+------+-----+----+
| id|value|type|
+------+-----+----+
|892383| 2.5| 1|
|892383| 2.0| 0|
+------+-----+----+
I've tried
df.dropDuplicates(subset = ['id', 'value'], keep = False)
But the "keep" feature isn't in PySpark (as it is in pandas.DataFrame.drop_duplicates.
How else could I do this?
回答1:
You can do that using the window functions
from pyspark.sql import Window, functions as F
df.withColumn(
'fg',
F.count("id").over(Window.partitionBy("id", "value"))
).where("fg = 1").drop("fg").show()
回答2:
You can groupBy
the id
and type
to get the count. Then use join
to filter out the rows in your DataFrame where the count is not 1:
df.join(
df.groupBy('id', 'value').count().where('count = 1').drop('count'), on=['id', 'value']
).show()
#+------+-----+----+
#| id|value|type|
#+------+-----+----+
#|892383| 2.5| 1|
#|892383| 2.0| 0|
#+------+-----+----+
来源:https://stackoverflow.com/questions/51522719/remove-all-rows-that-are-duplicates-with-respect-to-some-rows