show distinct column values in pyspark dataframe: python

后端 未结 9 728
忘了有多久
忘了有多久 2020-12-23 10:55

Please suggest pyspark dataframe alternative for Pandas df[\'col\'].unique().

I want to list out all the unique values in a pyspark dataframe column.

相关标签:
9条回答
  • 2020-12-23 11:24

    You can use df.dropDuplicates(['col1','col2']) to get only distinct rows based on colX in the array.

    0 讨论(0)
  • 2020-12-23 11:27

    If you want to see the distinct values of a specific column in your dataframe , you would just need to write -

        df.select('colname').distinct().show(100,False)
    

    This would show the 100 distinct values (if 100 values are available) for the colname column in the df dataframe.

    If you want to do something fancy on the distinct values, you can save the distinct values in a vector

        a = df.select('colname').distinct()
    

    Here, a would have all the distinct values of the column colname

    0 讨论(0)
  • 2020-12-23 11:31

    If you want to select ALL(columns) data as distinct frrom a DataFrame (df), then

    df.select('*').distinct().show(10,truncate=False)

    0 讨论(0)
  • 2020-12-23 11:35

    In addition to the dropDuplicates option there is the method named as we know it in pandas drop_duplicates:

    drop_duplicates() is an alias for dropDuplicates().

    Example

    s_df = sqlContext.createDataFrame([("foo", 1),
                                       ("foo", 1),
                                       ("bar", 2),
                                       ("foo", 3)], ('k', 'v'))
    s_df.show()
    
    +---+---+
    |  k|  v|
    +---+---+
    |foo|  1|
    |foo|  1|
    |bar|  2|
    |foo|  3|
    +---+---+
    

    Drop by subset

    s_df.drop_duplicates(subset = ['k']).show()
    
    +---+---+
    |  k|  v|
    +---+---+
    |bar|  2|
    |foo|  1|
    +---+---+
    s_df.drop_duplicates().show()
    
    
    +---+---+
    |  k|  v|
    +---+---+
    |bar|  2|
    |foo|  3|
    |foo|  1|
    +---+---+
    
    0 讨论(0)
  • 2020-12-23 11:38

    This should help to get distinct values of a column:

    df.select('column1').distinct().collect()
    

    Note that .collect() doesn't have any built-in limit on how many values can return so this might be slow -- use .show() instead or add .limit(20) before .collect() to manage this.

    0 讨论(0)
  • 2020-12-23 11:40

    Run this first

    df.createOrReplaceTempView('df')
    

    Then run

    spark.sql("""
        SELECT distinct
            column name
        FROM
            df
        """).show()
    
    0 讨论(0)
提交回复
热议问题