问题
What I have:
df
Name |Vehicle
Dave |Car
Mark |Bike
Steve|Car
Dave |
Steve|
I want to drop duplicates from the Name column but only if the corresponding value in Vehicle column is null. I know I can use
df.dropduplicates(subset=['Name'])
with either Keep =
either 'First' or 'Last'
but what I am looking for is a way to drop duplicates from Name
column where the corresponding value of Vehicle
column is null
. So basically, keep the Name
if the Vehicle
column is NOT null and drop the rest. If a name does not have a duplicate,then keep that row even if the corresponding value in Vehicle
is null.
Many Thanks
回答1:
I think you need chain 2 masks with bitwise AND
(&
) with Series.notna and Series.duplicated:
m1 = df['Vehicle'].notna()
m2 = ~df['Name'].duplicated()
df1 = df[m1 & m2]
print (df1)
Name Vehicle
0 Dave Car
1 Mark Bike
2 Steve Car
If want these operations separately - first remove all NaNs rows and then remove duplicates for avoid testing duplicates in NaN
s rows (if necessary):
df2 = df.dropna(subset=['Vehicle']).drop_duplicates('Name')
print (df2)
Name Vehicle
0 Dave Car
1 Mark Bike
2 Steve Car
回答2:
this will filter out both None
and empty values (IF there are any non-None
or non-empty values that is), keeping just the first encountered value for Vehicle
import pandas as pd
df = pd.DataFrame({"Name": ["Dave", "Mark", "Steve", "Dave", "Steve"], "Vehicle": ["Car", "Bike", "Car", None, ""]})
res = df.sort_values("Vehicle", ascending=False).groupby("Name")["Vehicle"].first().reset_index()
Output:
Name Vehicle
0 Dave Car
1 Mark Bike
2 Steve Car
来源:https://stackoverflow.com/questions/59532750/drop-duplicate-if-the-value-in-another-column-is-null-pandas