问题
This is my dataframe:
pd.DataFrame({'A':[1, 3, 3, 4, 5, 3, 3],
'B':[0, 2, 3, 4, 5, 6, 7],
'C':[[1,4,4,4], [1,4,4,4], [3,4,4,5], [3,4,4,5], [4,4,2,1], [1,2,3,4,], [7,8,9,1]]})
I want to get set\drop duplicate values of column C per row but not drop duplicate rows.
This what I hope to get:
pd.DataFrame({'A':[1, 3, 3, 4, 5, 3, 3],
'B':[0, 2, 3, 4, 5, 6, 7],
'C':[[1,4], [1,4], [3,4,5], [3,4,5], [4,2,1], [1,2,3,4,], [7,8,9,1]]})
回答1:
If you're using python 3.7>, you could could map with dict.fromkeys
, and obtain a list from the dictionary keys (the version is relevant since insertion order is maintained starting from there):
df['C'] = df.C.map(lambda x: list(dict.fromkeys(x).keys()))
For older pythons you have collections.OrderedDict:
from collections import OrderedDict
df['c']= df.C.map(lambda x: list(OrderedDict.fromkeys(x).keys()))
print(df)
A B C
0 1 0 [1, 4]
1 3 2 [1, 4]
2 3 3 [3, 4, 5]
3 4 4 [3, 4, 5]
4 5 5 [4, 2, 1]
5 3 6 [1, 2, 3, 4]
6 3 7 [7, 8, 9, 1]
As mentioned by cs95 in the comments, if we don't need to preserve order we could go with a set
for a more concise approach:
df['c'] = df.C.map(lambda x: [*{*x}])
Since several approaches have been proposed and is hard to tell how they will perform on large dataframes, probably worth benchmarking:
df = pd.concat([df]*50000, axis=0).reset_index(drop=True)
perfplot.show(
setup=lambda n: df.iloc[:int(n)],
kernels=[
lambda df: df.C.map(lambda x: list(dict.fromkeys(x).keys())),
lambda df: df['C'].map(lambda x: pd.factorize(x)[1]),
lambda df: [np.unique(item) for item in df['C'].values],
lambda df: df['C'].explode().groupby(level=0).unique(),
lambda df: df.C.map(lambda x: [*{*x}]),
],
labels=['dict.from_keys', 'factorize', 'np.unique', 'explode', 'set'],
n_range=[2**k for k in range(0, 18)],
xlabel='N',
equality_check=None
)
回答2:
if order is of no importance you could cast the column to a numpy array and apply an operation on each row in a list comprehension.
import numpy as np
df['C_Unique'] = [np.unique(item) for item in df['C'].values]
print(df)
A B C C_Unique
0 1 0 [1, 4, 4, 4] [1, 4]
1 3 2 [1, 4, 4, 4] [1, 4]
2 3 3 [3, 4, 4, 5] [3, 4, 5]
3 4 4 [3, 4, 4, 5] [3, 4, 5]
4 5 5 [4, 4, 2, 1] [1, 2, 4]
5 3 6 [1, 2, 3, 4] [1, 2, 3, 4]
6 3 7 [7, 8, 9, 1] [1, 7, 8, 9]
Another method would be to to use explode
and groupby.unique
df['CExplode'] = df['C'].explode().groupby(level=0).unique()
A B C C_Unique CExplode
0 1 0 [1, 4] [1, 4] [1, 4]
1 3 2 [1, 4] [1, 4] [1, 4]
2 3 3 [3, 4, 5] [3, 4, 5] [3, 4, 5]
3 4 4 [3, 4, 5] [3, 4, 5] [3, 4, 5]
4 5 5 [4, 2, 1] [1, 2, 4] [4, 2, 1]
5 3 6 [1, 2, 3, 4] [1, 2, 3, 4] [1, 2, 3, 4]
6 3 7 [7, 8, 9, 1] [1, 7, 8, 9] [7, 8, 9, 1]
回答3:
You can use apply function in pandas.
df['C'] = df['C'].apply(lambda x: list(set(x)))
回答4:
map and factorize
Let's throw one more into the mix.
df['C'].map(pd.factorize).str[1]
0 [1, 4]
1 [1, 4]
2 [3, 4, 5]
3 [3, 4, 5]
4 [4, 2, 1]
5 [1, 2, 3, 4]
6 [7, 8, 9, 1]
Name: C, dtype: object
Or,
df['C'].map(lambda x: pd.factorize(x)[1])
0 [1, 4]
1 [1, 4]
2 [3, 4, 5]
3 [3, 4, 5]
4 [4, 2, 1]
5 [1, 2, 3, 4]
6 [7, 8, 9, 1]
Name: C, dtype: object
来源:https://stackoverflow.com/questions/62872266/drop-duplicate-list-elements-in-column-of-lists