pandas get unique values from column of lists

一曲冷凌霜 提交于 2021-02-07 12:37:41


How do I get the unique values of a column of lists in pandas or numpy such that second column from

would result in 'action', 'crime', 'drama'.

The closest (but non-functional) solutions I could come up with were:

 genres = data['Genre'].unique()

But this predictably results in a TypeError saying how lists aren't hashable.

TypeError: unhashable type: 'list'

Set seemed to be a good idea but

genres = data.apply(set(), columns=['Genre'], axis=1)

but also results in a TypeError: set() takes no keyword arguments


If you only want to find the unique values, I'd recommend using itertools.chain.from_iterable to concatenate all those lists

import itertools

>>> np.unique([*itertools.chain.from_iterable(df.Genre)])
array(['action', 'crime', 'drama'], dtype='<U6')

Or even faster

>>> set(itertools.chain.from_iterable(df.Genre))
{'action', 'crime', 'drama'}


df = pd.DataFrame({'Genre':[['crime','drama'],['action','crime','drama']]})
df = pd.concat([df]*10000)

%timeit set(itertools.chain.from_iterable(df.Genre))
100 loops, best of 3: 2.55 ms per loo
%timeit set([x for y in df['Genre'] for x in y])
100 loops, best of 3: 4.09 ms per loop

%timeit np.unique([*itertools.chain.from_iterable(df.Genre)])
100 loops, best of 3: 12.8 ms per loop

%timeit np.unique(df['Genre'].sum())
1 loop, best of 3: 1.65 s per loop

%timeit set(df['Genre'].sum())
1 loop, best of 3: 1.66 s per loop


You can use explode:

data = pd.DataFrame([
        "title": "The Godfather: Part II",
        "genres": ["crime", "drama"],
        "director": "Fracis Ford Coppola"
        "title": "The Dark Knight",
        "genres": ["action", "crime", "drama"],
        "director": "Christopher Nolan"
# Changed from data.explode("genres")["genres"].unique() as suggested by rafaelc

Results in:

array(['crime', 'drama', 'action'], dtype=object)


Here are some options:

# toy data
df = pd.DataFrame({'Genre':[['crime','drama'],['action','crime','drama']]})

# 109 µs ± 2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# 87 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

set([x  for y in df['Genre'] for x in y])
# 11.8 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


If you're just looking to extract the information and not add back to the DataFrame, you can utilize Python's set method in a for loop:

import pandas as pd
df = pd.DataFrame({'movie':[[1,2,3],[1,2,6]]})
out = set()
for row in df['movie']:
    out.update({item for item in row})

You could also wrap this in an apply call if you wanted (which would return None but update the set in place):

out = set()
df['movie'].apply(lambda x: out.update({item for item in x}))

Personally I think the for loop is a bit clearer to read.


Not sure if it's exactly what you wanted, but this will allow you to convert it into a set.

import pandas as pd
import numpy as np

df = pd.DataFrame({'Movie':['The Godfather', 'Dark Knight'], 'Genre': [['Crime', 'Drama'],['Crime', 'Drama', 'Action']]})

genres = []
for sublist in df['Genre']:
    for item in sublist:

genre_set = set(genres)


Output: {'Action', 'Drama', 'Crime'}


Use the power of sets for chained uniqueness. I've used this technique with huge lists, in big data like envs'. The main pro here is cut down the time needed to produce a final flat list.

  1. Convert the list-column into sets
  2. Reduce all sets into a final set, using union


from functools import reduce # for python 3

l = df.Genre.dropna().tolist()
sets = [ set(i) for i in l ]
final_set = reduce(lambda x, y: x.union(y), sets)
  • In big-data like envs', like spark, use map to convert each list into a set, then reduce like the above.
  • Change union to intersection, if you need to get all common values from all lists.

