I have this spark DataFrame:
+---+-----+------+----+------------+------------+
| ID| ID2|Number|Name|Opening_Ho
To extend on pault's really great answer: I often need to subset a dataframe to only entries that are repeated x times, and since I need to do this really often, I turned this into a function that I just import with lots of other helper functions in the beginning of my scripts:
import pyspark.sql.functions as f
from pyspark.sql import Window
def get_entries_with_frequency(df, cols, num):
if type(cols)==str:
cols = [cols]
w = Window.partitionBy(cols)
return df.select('*', f.count(cols[0]).over(w).alias('dupeCount'))\
.where("dupeCount = {}".format(num))\
.drop('dupeCount')
One way to do this is by using a pyspark.sql.Window to add a column that counts the number of duplicates for each row's ("ID", "ID2", "Name")
combination. Then select only the rows where the number of duplicate is greater than 1.
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy('ID', 'ID2', 'Number')
df.select('*', f.count('ID').over(w).alias('dupeCount'))\
.where('dupeCount > 1')\
.drop('dupeCount')\
.show()
#+---+---+------+----+------------+------------+
#| ID|ID2|Number|Name|Opening_Hour|Closing_Hour|
#+---+---+------+----+------------+------------+
#|ALT|QWA| 2|null| 08:54:00| 23:25:00|
#|ALT|QWA| 2|null| 08:53:00| 23:24:00|
#|ALT|QWA| 6|null| 08:59:00| 23:30:00|
#|ALT|QWA| 6|null| 08:55:00| 23:26:00|
#+---+---+------+----+------------+------------+
I used pyspark.sql.functions.count() to count the number of items in each group. This returns a DataFrame containing all of the duplicates (the second output you showed).
If you wanted to get only one row per ("ID", "ID2", "Name")
combination, you could do using another Window to order the rows.
For example, below I add another column for the row_number and select only the rows where the duplicate count is greater than 1 and the row number is equal to 1. This guarantees one row per grouping.
w2 = Window.partitionBy('ID', 'ID2', 'Number').orderBy('ID', 'ID2', 'Number')
df.select(
'*',
f.count('ID').over(w).alias('dupeCount'),
f.row_number().over(w2).alias('rowNum')
)\
.where('(dupeCount > 1) AND (rowNum = 1)')\
.drop('dupeCount', 'rowNum')\
.show()
#+---+---+------+----+------------+------------+
#| ID|ID2|Number|Name|Opening_Hour|Closing_Hour|
#+---+---+------+----+------------+------------+
#|ALT|QWA| 2|null| 08:54:00| 23:25:00|
#|ALT|QWA| 6|null| 08:59:00| 23:30:00|
#+---+---+------+----+------------+------------+
Here is a way to do it without Window.
A DataFrame with the duplicates
df.exceptAll(df.drop_duplicates(['ID', 'ID2', 'Number'])).show()
# +---+---+------+------------+------------+
# | ID|ID2|Number|Opening_Hour|Closing_Hour|
# +---+---+------+------------+------------+
# |ALT|QWA| 2| 08:53:00| 23:24:00|
# |ALT|QWA| 6| 08:55:00| 23:26:00|
# +---+---+------+------------+------------+
A DataFrame with all duplicates (using left_anti join)
df.join(df.groupBy('ID', 'ID2', 'Number')\
.count().where('count = 1').drop('count'),
on=['ID', 'ID2', 'Number'],
how='left_anti').show()
# +---+---+------+------------+------------+
# | ID|ID2|Number|Opening_Hour|Closing_Hour|
# +---+---+------+------------+------------+
# |ALT|QWA| 2| 08:54:00| 23:25:00|
# |ALT|QWA| 2| 08:53:00| 23:24:00|
# |ALT|QWA| 6| 08:59:00| 23:30:00|
# |ALT|QWA| 6| 08:55:00| 23:26:00|
# +---+---+------+------------+------------+