Please suggest pyspark dataframe alternative for Pandas df[\'col\'].unique()
.
I want to list out all the unique values in a pyspark dataframe column.
You can use df.dropDuplicates(['col1','col2'])
to get only distinct rows based on colX in the array.
If you want to see the distinct values of a specific column in your dataframe , you would just need to write -
df.select('colname').distinct().show(100,False)
This would show the 100 distinct values (if 100 values are available) for the colname column in the df dataframe.
If you want to do something fancy on the distinct values, you can save the distinct values in a vector
a = df.select('colname').distinct()
Here, a would have all the distinct values of the column colname
If you want to select ALL(columns) data as distinct frrom a DataFrame (df), then
df.select('*').distinct().show(10,truncate=False)
In addition to the dropDuplicates option there is the method named as we know it in pandas drop_duplicates:
drop_duplicates() is an alias for dropDuplicates().
Example
s_df = sqlContext.createDataFrame([("foo", 1),
("foo", 1),
("bar", 2),
("foo", 3)], ('k', 'v'))
s_df.show()
+---+---+
| k| v|
+---+---+
|foo| 1|
|foo| 1|
|bar| 2|
|foo| 3|
+---+---+
Drop by subset
s_df.drop_duplicates(subset = ['k']).show()
+---+---+
| k| v|
+---+---+
|bar| 2|
|foo| 1|
+---+---+
s_df.drop_duplicates().show()
+---+---+
| k| v|
+---+---+
|bar| 2|
|foo| 3|
|foo| 1|
+---+---+
This should help to get distinct values of a column:
df.select('column1').distinct().collect()
Note that .collect()
doesn't have any built-in limit on how many values can return so this might be slow -- use .show()
instead or add .limit(20)
before .collect()
to manage this.
Run this first
df.createOrReplaceTempView('df')
Then run
spark.sql("""
SELECT distinct
column name
FROM
df
""").show()