Python 3.4 Remove duplicates and their corresponding values

问题

I'd like to remove duplicates as well as their corresponding, original values from a dataframe.

Sframe is the name of my dataframe. The fields on which I want to check for duplicates are 'TermName', 'SchoolName', and 'StudentID'.

Here's an example of what I'm starting with:

TermName SchoolName StudentID
14-15   a   1
14-15   a   1
14-15   a   1
14-15   b   2
14-15   b   2
14-15   b   3
14-15   c   4
14-15   c   5
14-15   d   6
14-15   e   7
14-15   f   8

Here's what I'm looking for:

TermName SchoolName StudentID
    14-15   a   1
    14-15   a   1
    14-15   a   1
    14-15   b   2
    14-15   b   2

@Jubbles showed me how to identify and keep only the duplicate and duplicated rows (i.e. the last 6 rows in my first table example above) with this:

#unique that are duplicated only column
Sframe['dup_check_1'] = Sframe.duplicated(cols = ['TermName', 'SchoolName', 'StudentID'], take_last = False)
#duplicates only column
Sframe['dup_check_2'] = Sframe.duplicated(cols = ['TermName', 'SchoolName', 'StudentID'], take_last = True)
#remove both from Sframe (df)
Sframe = Sframe[(Sframe['dup_check_1'] == False) & (Sframe['dup_check_2'] == False)]
#delete duplicate checking columns
del Sframe['dup_check_1'], Sframe['dup_check_2']

I tried to get the rows excluded by the above code (and not the rows included above) by changing False to True here:

Sframe = Sframe[(Sframe['dup_check_1'] == True) & (Sframe['dup_check_2'] == True)]

...but it didn't work. It seems like a simple change in code from False to True, but it does not return the correct number of rows (only 6 instead of 354).

Any ideas?

来源：https://stackoverflow.com/questions/27735618/python-3-4-remove-duplicates-and-their-corresponding-values

标签

python

dataframe

duplicate-removal