问题
How do I get the unique values of a column of lists in pandas or numpy such that second column from
would result in 'action', 'crime', 'drama'
.
The closest (but non-functional) solutions I could come up with were:
genres = data['Genre'].unique()
But this predictably results in a TypeError saying how lists aren't hashable.
TypeError: unhashable type: 'list'
Set seemed to be a good idea but
genres = data.apply(set(), columns=['Genre'], axis=1)
but also results in a
TypeError: set() takes no keyword arguments
回答1:
If you only want to find the unique values, I'd recommend using itertools.chain.from_iterable
to concatenate all those lists
import itertools
>>> np.unique([*itertools.chain.from_iterable(df.Genre)])
array(['action', 'crime', 'drama'], dtype='<U6')
Or even faster
>>> set(itertools.chain.from_iterable(df.Genre))
{'action', 'crime', 'drama'}
Timings
df = pd.DataFrame({'Genre':[['crime','drama'],['action','crime','drama']]})
df = pd.concat([df]*10000)
%timeit set(itertools.chain.from_iterable(df.Genre))
100 loops, best of 3: 2.55 ms per loo
%timeit set([x for y in df['Genre'] for x in y])
100 loops, best of 3: 4.09 ms per loop
%timeit np.unique([*itertools.chain.from_iterable(df.Genre)])
100 loops, best of 3: 12.8 ms per loop
%timeit np.unique(df['Genre'].sum())
1 loop, best of 3: 1.65 s per loop
%timeit set(df['Genre'].sum())
1 loop, best of 3: 1.66 s per loop
回答2:
You can use explode
:
data = pd.DataFrame([
{
"title": "The Godfather: Part II",
"genres": ["crime", "drama"],
"director": "Fracis Ford Coppola"
},
{
"title": "The Dark Knight",
"genres": ["action", "crime", "drama"],
"director": "Christopher Nolan"
}
])
# Changed from data.explode("genres")["genres"].unique() as suggested by rafaelc
data["genres"].explode().unique()
Results in:
array(['crime', 'drama', 'action'], dtype=object)
回答3:
Here are some options:
# toy data
df = pd.DataFrame({'Genre':[['crime','drama'],['action','crime','drama']]})
np.unique(df['Genre'].sum())
# 109 µs ± 2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
set(df['Genre'].sum())
# 87 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
set([x for y in df['Genre'] for x in y])
# 11.8 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
回答4:
If you're just looking to extract the information and not add back to the DataFrame, you can utilize Python's set method in a for loop:
import pandas as pd
df = pd.DataFrame({'movie':[[1,2,3],[1,2,6]]})
out = set()
for row in df['movie']:
out.update({item for item in row})
print(out)
You could also wrap this in an apply call if you wanted (which would return None but update the set in place):
out = set()
df['movie'].apply(lambda x: out.update({item for item in x}))
Personally I think the for loop is a bit clearer to read.
回答5:
Not sure if it's exactly what you wanted, but this will allow you to convert it into a set.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Movie':['The Godfather', 'Dark Knight'], 'Genre': [['Crime', 'Drama'],['Crime', 'Drama', 'Action']]})
genres = []
for sublist in df['Genre']:
for item in sublist:
genres.append(item)
genre_set = set(genres)
print(genre_set)
Output: {'Action', 'Drama', 'Crime'}
回答6:
Use the power of sets for chained uniqueness. I've used this technique with huge lists, in big data like envs'. The main pro here is cut down the time needed to produce a final flat list.
- Convert the list-column into sets
- Reduce all sets into a final set, using union
Try:
from functools import reduce # for python 3
l = df.Genre.dropna().tolist()
sets = [ set(i) for i in l ]
final_set = reduce(lambda x, y: x.union(y), sets)
- In big-data like envs', like spark, use map to convert each list into a set, then reduce like the above.
- Change union to intersection, if you need to get all common values from all lists.
来源:https://stackoverflow.com/questions/58528989/pandas-get-unique-values-from-column-of-lists