Keep only duplicates from a DataFrame regarding some field

前端 未结 3 1186
青春惊慌失措
青春惊慌失措 2020-12-09 13:55

I have this spark DataFrame:

+---+-----+------+----+------------+------------+
| ID|  ID2|Number|Name|Opening_Ho         


        
相关标签:
3条回答
  • 2020-12-09 14:15

    To extend on pault's really great answer: I often need to subset a dataframe to only entries that are repeated x times, and since I need to do this really often, I turned this into a function that I just import with lots of other helper functions in the beginning of my scripts:

    import pyspark.sql.functions as f
    from pyspark.sql import Window
    def get_entries_with_frequency(df, cols, num):
      if type(cols)==str:
        cols = [cols]
      w = Window.partitionBy(cols)
      return df.select('*', f.count(cols[0]).over(w).alias('dupeCount'))\
               .where("dupeCount = {}".format(num))\
               .drop('dupeCount')
    
    0 讨论(0)
  • 2020-12-09 14:19

    One way to do this is by using a pyspark.sql.Window to add a column that counts the number of duplicates for each row's ("ID", "ID2", "Name") combination. Then select only the rows where the number of duplicate is greater than 1.

    import pyspark.sql.functions as f
    from pyspark.sql import Window
    
    w = Window.partitionBy('ID', 'ID2', 'Number')
    df.select('*', f.count('ID').over(w).alias('dupeCount'))\
        .where('dupeCount > 1')\
        .drop('dupeCount')\
        .show()
    #+---+---+------+----+------------+------------+
    #| ID|ID2|Number|Name|Opening_Hour|Closing_Hour|
    #+---+---+------+----+------------+------------+
    #|ALT|QWA|     2|null|    08:54:00|    23:25:00|
    #|ALT|QWA|     2|null|    08:53:00|    23:24:00|
    #|ALT|QWA|     6|null|    08:59:00|    23:30:00|
    #|ALT|QWA|     6|null|    08:55:00|    23:26:00|
    #+---+---+------+----+------------+------------+
    

    I used pyspark.sql.functions.count() to count the number of items in each group. This returns a DataFrame containing all of the duplicates (the second output you showed).

    If you wanted to get only one row per ("ID", "ID2", "Name") combination, you could do using another Window to order the rows.

    For example, below I add another column for the row_number and select only the rows where the duplicate count is greater than 1 and the row number is equal to 1. This guarantees one row per grouping.

    w2 = Window.partitionBy('ID', 'ID2', 'Number').orderBy('ID', 'ID2', 'Number')
    df.select(
            '*',
            f.count('ID').over(w).alias('dupeCount'),
            f.row_number().over(w2).alias('rowNum')
        )\
        .where('(dupeCount > 1) AND (rowNum = 1)')\
        .drop('dupeCount', 'rowNum')\
        .show()
    #+---+---+------+----+------------+------------+
    #| ID|ID2|Number|Name|Opening_Hour|Closing_Hour|
    #+---+---+------+----+------------+------------+
    #|ALT|QWA|     2|null|    08:54:00|    23:25:00|
    #|ALT|QWA|     6|null|    08:59:00|    23:30:00|
    #+---+---+------+----+------------+------------+
    
    0 讨论(0)
  • 2020-12-09 14:29

    Here is a way to do it without Window.

    A DataFrame with the duplicates

    df.exceptAll(df.drop_duplicates(['ID', 'ID2', 'Number'])).show()
    # +---+---+------+------------+------------+
    # | ID|ID2|Number|Opening_Hour|Closing_Hour|
    # +---+---+------+------------+------------+
    # |ALT|QWA|     2|    08:53:00|    23:24:00|
    # |ALT|QWA|     6|    08:55:00|    23:26:00|
    # +---+---+------+------------+------------+
    

    A DataFrame with all duplicates (using left_anti join)

    df.join(df.groupBy('ID', 'ID2', 'Number')\
              .count().where('count = 1').drop('count'),
            on=['ID', 'ID2', 'Number'],
            how='left_anti').show()
    # +---+---+------+------------+------------+
    # | ID|ID2|Number|Opening_Hour|Closing_Hour|
    # +---+---+------+------------+------------+
    # |ALT|QWA|     2|    08:54:00|    23:25:00|
    # |ALT|QWA|     2|    08:53:00|    23:24:00|
    # |ALT|QWA|     6|    08:59:00|    23:30:00|
    # |ALT|QWA|     6|    08:55:00|    23:26:00|
    # +---+---+------+------------+------------+
    
    0 讨论(0)
提交回复
热议问题