问题
I'd like to remove duplicates as well as their corresponding, original values from a dataframe.
Sframe is the name of my dataframe. The fields on which I want to check for duplicates are 'TermName', 'SchoolName', and 'StudentID'.
Here's an example of what I'm starting with:
TermName SchoolName StudentID
14-15 a 1
14-15 a 1
14-15 a 1
14-15 b 2
14-15 b 2
14-15 b 3
14-15 c 4
14-15 c 5
14-15 d 6
14-15 e 7
14-15 f 8
Here's what I'm looking for:
TermName SchoolName StudentID
14-15 a 1
14-15 a 1
14-15 a 1
14-15 b 2
14-15 b 2
@Jubbles showed me how to identify and keep only the duplicate and duplicated rows (i.e. the last 6 rows in my first table example above) with this:
#unique that are duplicated only column
Sframe['dup_check_1'] = Sframe.duplicated(cols = ['TermName', 'SchoolName', 'StudentID'], take_last = False)
#duplicates only column
Sframe['dup_check_2'] = Sframe.duplicated(cols = ['TermName', 'SchoolName', 'StudentID'], take_last = True)
#remove both from Sframe (df)
Sframe = Sframe[(Sframe['dup_check_1'] == False) & (Sframe['dup_check_2'] == False)]
#delete duplicate checking columns
del Sframe['dup_check_1'], Sframe['dup_check_2']
I tried to get the rows excluded by the above code (and not the rows included above) by changing False to True here:
Sframe = Sframe[(Sframe['dup_check_1'] == True) & (Sframe['dup_check_2'] == True)]
...but it didn't work. It seems like a simple change in code from False to True, but it does not return the correct number of rows (only 6 instead of 354).
Any ideas?
来源:https://stackoverflow.com/questions/27735618/python-3-4-remove-duplicates-and-their-corresponding-values