问题
Say I have a dataframe like so:
ID Media
1 imgix.com/20830dk
2 imgix.com/202398pwe
3 imgix.com/lvw0923dk
4 imgix.com/082kldcm
4 imgix.com/lks032m
4 imgix.com/903248
I'd like to end up with:
ID Media
1 imgix.com/20830dk
2 imgix.com/202398pwe
3 imgix.com/lvw0923dk
4 imgix.com/082kldcm
Even though that causes me to lose 2 links for ID = 4, I don't care. Is there a simple way to do this in python/pyspark?
回答1:
- Group by on col('ID')
- Use collect_list with agg to aggregate the list
Call getItem(0) to extract first element from the aggregated list
df.groupBy('ID').agg(collect_list('Media').getItem(0).alias('Media')).show()
回答2:
Anton and pault are correct:
df.drop_duplicates(subset=['ID'])
does indeed work
来源:https://stackoverflow.com/questions/50685522/remove-duplicate-rows-regardless-of-new-information-pyspark