Let\'s say my DataFrame df
is created like this:
df = pd.DataFrame({\"title\" : [\"Robin Hood\", \"Madagaskar\"],
\"genres\" :
Since pandas >= 0.25.0 we have a native method for this called explode.
This method unnests each element in a list to a new row and repeats the other columns.
So first we have to call Series.str.split on our string value to split the string to list of elements.
>>> df.assign(genres=df['genres'].str.split(', ')).explode('genres')
title genres
0 Robin Hood Action
0 Robin Hood Adventure
1 Madagaskar Family
1 Madagaskar Animation
1 Madagaskar Comedy
You can use np.repeat with numpy.concatenate for flattening.
splitted = df['genres'].str.split(',\s*')
l = splitted.str.len()
df1 = pd.DataFrame({'title': np.repeat(df['title'].values, l),
'genres':np.concatenate(splitted.values)}, columns=['title','genres'])
print (df1)
title genres
0 Robin Hood Action
1 Robin Hood Adventure
2 Madagaskar Family
3 Madagaskar Animation
4 Madagaskar Comedy
Timings:
df = pd.concat([df]*100000).reset_index(drop=True)
In [95]: %%timeit
...: splitted = df['genres'].str.split(',\s*')
...: l = splitted.str.len()
...:
...: df1 = pd.DataFrame({'title': np.repeat(df['title'].values, l),
...: 'genres':np.concatenate(splitted.values)}, columns=['title','genres'])
...:
...:
1 loop, best of 3: 709 ms per loop
In [96]: %timeit (df.set_index('title')['genres'].str.split(',\s*', expand=True).stack().reset_index(name='genre').drop('level_1',1))
1 loop, best of 3: 750 ms per loop
In [33]: (df.set_index('title')
['genres'].str.split(',\s*', expand=True)
.stack()
.reset_index(name='genre')
.drop('level_1',1))
Out[33]:
title genre
0 Robin Hood Action
1 Robin Hood Adventure
2 Madagaskar Family
3 Madagaskar Animation
4 Madagaskar Comedy
PS here you can find more generic approach.