问题
I have a dataframe like this:
import pandas as pd
dic = {'A':[100,200,250,300],
'B':['ci','ci','po','pa'],
'C':['s','t','p','w']}
df = pd.DataFrame(dic)
My goal is to separate the row in 2 dataframes:
- df1 = contains all the rows that do not repeat values along column
B
(unque rows). - df2 = containts only the rows who repeat themeselves.
The result should look like this:
df1 = A B C df2 = A B C
0 250 po p 0 100 ci s
1 300 pa w 1 250 ci t
Note:
- the dataframes could be in general very big and have many values that repeat in column B so the answer should be as generic as possible
- if there are no duplicates, df2 should be empty! all the results should be in df1
回答1:
You can use Series.duplicated with parameter keep=False
to create a mask for all duplicates and then boolean indexing, ~
to invert the mask
:
mask = df.B.duplicated(keep=False)
print (mask)
0 True
1 True
2 False
3 False
Name: B, dtype: bool
print (df[mask])
A B C
0 100 ci s
1 200 ci t
print (df[~mask])
A B C
2 250 po p
3 300 pa w
来源:https://stackoverflow.com/questions/41042996/how-to-select-duplicate-rows-with-pandas