show distinct column values in pyspark dataframe: python

后端未结

关注

 9  728

忘了有多久

Please suggest pyspark dataframe alternative for Pandas df[\'col\'].unique().

I want to list out all the unique values in a pyspark dataframe column.

相关标签:

9条回答

别那么骄傲

2020-12-23 11:24

You can use df.dropDuplicates(['col1','col2']) to get only distinct rows based on colX in the array.

0 讨论(0)
发布评论:

提交评论
- 加载中...
粉色の甜心

2020-12-23 11:27
If you want to see the distinct values of a specific column in your dataframe , you would just need to write -
```
    df.select('colname').distinct().show(100,False)
```
This would show the 100 distinct values (if 100 values are available) for the colname column in the df dataframe.

If you want to do something fancy on the distinct values, you can save the distinct values in a vector
```
    a = df.select('colname').distinct()
```
Here, a would have all the distinct values of the column colname
0 讨论(0)
发布评论:

提交评论
- 加载中...
無奈伤痛

2020-12-23 11:31

If you want to select ALL(columns) data as distinct frrom a DataFrame (df), then

df.select('*').distinct().show(10,truncate=False)

0 讨论(0)
发布评论:

提交评论
- 加载中...

旧时难觅i

2020-12-23 11:35

In addition to the dropDuplicates option there is the method named as we know it in pandas drop_duplicates:

drop_duplicates() is an alias for dropDuplicates().

Example

s_df = sqlContext.createDataFrame([("foo", 1),
                                   ("foo", 1),
                                   ("bar", 2),
                                   ("foo", 3)], ('k', 'v'))
s_df.show()

+---+---+
|  k|  v|
+---+---+
|foo|  1|
|foo|  1|
|bar|  2|
|foo|  3|
+---+---+

Drop by subset

s_df.drop_duplicates(subset = ['k']).show()

+---+---+
|  k|  v|
+---+---+
|bar|  2|
|foo|  1|
+---+---+
s_df.drop_duplicates().show()


+---+---+
|  k|  v|
+---+---+
|bar|  2|
|foo|  3|
|foo|  1|
+---+---+

0 讨论(0)

囚心锁ツ

2020-12-23 11:38
This should help to get distinct values of a column:
```
df.select('column1').distinct().collect()
```
Note that .collect() doesn't have any built-in limit on how many values can return so this might be slow -- use .show() instead or add .limit(20) before .collect() to manage this.
0 讨论(0)
发布评论:

提交评论
- 加载中...

野性不改

2020-12-23 11:40

Run this first

df.createOrReplaceTempView('df')

Then run

spark.sql("""
    SELECT distinct
        column name
    FROM
        df
    """).show()

0 讨论(0)

1 2 下一页