I understand this is just a very simple question and most likely have been answered somewhere, but as a beginner I still don\'t get it and am looking for your enlightenment,
If df is the name of your DataFrame, there are two ways to get unique rows:
df2 = df.distinct()
or
df2 = df.drop_duplicates()
The normal distinct not so user friendly, because you cant set the column. In this case enough for you:
df = df.distinct()
but if you have other value in date column, you wont get back the distinct elements from host:
+--------------------+---+
| host|day|
+--------------------+---+
| in24.inetnebr.com| 1|
| uplherc.upl.com| 1|
| uplherc.upl.com| 2|
| uplherc.upl.com| 1|
| uplherc.upl.com| 1|
|ix-esc-ca2-07.ix....| 1|
| uplherc.upl.com| 1|
+--------------------+---+
after distinct you will get back as follows:
df.distinct().show()
+--------------------+---+
| host|day|
+--------------------+---+
| in24.inetnebr.com| 1|
| uplherc.upl.com| 2|
| uplherc.upl.com| 1|
|ix-esc-ca2-07.ix....| 1|
+--------------------+---+
thus you should use this:
df = df.dropDuplicates(['host'])
it will keep the first value of day
If you familiar with SQL language it will also works for you:
df.createOrReplaceTempView("temp_table")
new_df = spark.sql("select first(host), first(day) from temp_table GROUP BY host")
+--------------------+-----------------+
| first(host, false)|first(day, false)|
+--------------------+-----------------+
| in24.inetnebr.com| 1|
|ix-esc-ca2-07.ix....| 1|
| uplherc.upl.com| 1|
+--------------------+-----------------+