show distinct column values in pyspark dataframe: python

后端未结

关注

 9  729

忘了有多久

Please suggest pyspark dataframe alternative for Pandas df[\'col\'].unique().

I want to list out all the unique values in a pyspark dataframe column.

相关标签:

9条回答

天命终不由人

2020-12-23 11:42

you could do

distinct_column = 'somecol' 

distinct_column_vals = df.select(distinct_column).distinct().collect()
distinct_column_vals = [v[distinct_column] for v in distinct_column_vals]

0 讨论(0)

一向

2020-12-23 11:43

collect_set can help to get unique values from a given column of pyspark.sql.DataFrame df.select(F.collect_set("column").alias("column")).first()["column"]

0 讨论(0)
发布评论:

提交评论
- 加载中...
臣服心动

2020-12-23 11:49
Let's assume we're working with the following representation of data (two columns, k and v, where k contains three entries, two unique:
```
+---+---+
|  k|  v|
+---+---+
|foo|  1|
|bar|  2|
|foo|  3|
+---+---+
```
With a Pandas dataframe:
```
import pandas as pd
p_df = pd.DataFrame([("foo", 1), ("bar", 2), ("foo", 3)], columns=("k", "v"))
p_df['k'].unique()
```
This returns an ndarray, i.e. array(['foo', 'bar'], dtype=object)

You asked for a "pyspark dataframe alternative for pandas df['col'].unique()". Now, given the following Spark dataframe:
```
s_df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("foo", 3)], ('k', 'v'))
```
If you want the same result from Spark, i.e. an ndarray, use toPandas():
```
s_df.toPandas()['k'].unique()
```
Alternatively, if you don't need an ndarray specifically and just want a list of the unique values of column k:
```
s_df.select('k').distinct().rdd.map(lambda r: r[0]).collect()
```
Finally, you can also use a list comprehension as follows:
```
[i.k for i in s_df.select('k').distinct().collect()]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2