How to get distinct rows in dataframe using pyspark?

前端 未结 2 509
遇见更好的自我
遇见更好的自我 2021-01-04 04:49

I understand this is just a very simple question and most likely have been answered somewhere, but as a beginner I still don\'t get it and am looking for your enlightenment,

相关标签:
2条回答
  • 2021-01-04 05:21

    If df is the name of your DataFrame, there are two ways to get unique rows:

    df2 = df.distinct()
    

    or

    df2 = df.drop_duplicates()
    
    0 讨论(0)
  • 2021-01-04 05:35

    The normal distinct not so user friendly, because you cant set the column. In this case enough for you:

    df = df.distinct()
    

    but if you have other value in date column, you wont get back the distinct elements from host:

    +--------------------+---+
    |                host|day|
    +--------------------+---+
    |   in24.inetnebr.com|  1|
    |     uplherc.upl.com|  1|
    |     uplherc.upl.com|  2|
    |     uplherc.upl.com|  1|
    |     uplherc.upl.com|  1|
    |ix-esc-ca2-07.ix....|  1|
    |     uplherc.upl.com|  1|
    +--------------------+---+
    

    after distinct you will get back as follows:

    df.distinct().show()
    
    +--------------------+---+
    |                host|day|
    +--------------------+---+
    |   in24.inetnebr.com|  1|
    |     uplherc.upl.com|  2|
    |     uplherc.upl.com|  1|
    |ix-esc-ca2-07.ix....|  1|
    +--------------------+---+
    

    thus you should use this:

    df = df.dropDuplicates(['host'])
    

    it will keep the first value of day

    If you familiar with SQL language it will also works for you:

    df.createOrReplaceTempView("temp_table")
    new_df = spark.sql("select first(host), first(day) from temp_table GROUP BY host")
    
     +--------------------+-----------------+
    |  first(host, false)|first(day, false)|
    +--------------------+-----------------+
    |   in24.inetnebr.com|                1|
    |ix-esc-ca2-07.ix....|                1|
    |     uplherc.upl.com|                1|
    +--------------------+-----------------+
    
    0 讨论(0)
提交回复
热议问题