How to get distinct rows in dataframe using pyspark?

前端未结

关注

 2  509

I understand this is just a very simple question and most likely have been answered somewhere, but as a beginner I still don\'t get it and am looking for your enlightenment,

相关标签:

2条回答

一整个雨季

2021-01-04 05:21
If df is the name of your DataFrame, there are two ways to get unique rows:
```
df2 = df.distinct()
```
or
```
df2 = df.drop_duplicates()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

轻奢々

2021-01-04 05:35

The normal distinct not so user friendly, because you cant set the column. In this case enough for you:

df = df.distinct()

but if you have other value in date column, you wont get back the distinct elements from host:

+--------------------+---+
|                host|day|
+--------------------+---+
|   in24.inetnebr.com|  1|
|     uplherc.upl.com|  1|
|     uplherc.upl.com|  2|
|     uplherc.upl.com|  1|
|     uplherc.upl.com|  1|
|ix-esc-ca2-07.ix....|  1|
|     uplherc.upl.com|  1|
+--------------------+---+

after distinct you will get back as follows:

df.distinct().show()

+--------------------+---+
|                host|day|
+--------------------+---+
|   in24.inetnebr.com|  1|
|     uplherc.upl.com|  2|
|     uplherc.upl.com|  1|
|ix-esc-ca2-07.ix....|  1|
+--------------------+---+

thus you should use this:

df = df.dropDuplicates(['host'])

it will keep the first value of day

If you familiar with SQL language it will also works for you:

df.createOrReplaceTempView("temp_table")
new_df = spark.sql("select first(host), first(day) from temp_table GROUP BY host")

 +--------------------+-----------------+
|  first(host, false)|first(day, false)|
+--------------------+-----------------+
|   in24.inetnebr.com|                1|
|ix-esc-ca2-07.ix....|                1|
|     uplherc.upl.com|                1|
+--------------------+-----------------+

0 讨论(0)