I have a pandas dataframe
in which one column of text strings contains comma-separated values. I want to split each CSV field and create a new row per entry (as
import pandas as pd
import numpy as np
def explode_str(df, col, sep):
s = df[col]
i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
return df.iloc[i].assign(**{col: sep.join(s).split(sep)})
def explode_list(df, col):
s = df[col]
i = np.arange(len(s)).repeat(s.str.len())
return df.iloc[i].assign(**{col: np.concatenate(s)})
explode_str(a, 'var1', ',')
var1 var2
0 a 1
0 b 1
0 c 1
1 d 2
1 e 2
1 f 2
Let's create a new dataframe d
that has lists
d = a.assign(var1=lambda d: d.var1.str.split(','))
explode_list(d, 'var1')
var1 var2
0 a 1
0 b 1
0 c 1
1 d 2
1 e 2
1 f 2
I'll use np.arange
with repeat
to produce dataframe index positions that I can use with iloc
.
loc
?Because the index may not be unique and using loc
will return every row that matches a queried index.
values
attribute and slice that?When calling values
, if the entirety of the the dataframe is in one cohesive "block", Pandas will return a view of the array that is the "block". Otherwise Pandas will have to cobble together a new array. When cobbling, that array must be of a uniform dtype. Often that means returning an array with dtype that is object
. By using iloc
instead of slicing the values
attribute, I alleviate myself from having to deal with that.
assign
?When I use assign
using the same column name that I'm exploding, I overwrite the existing column and maintain its position in the dataframe.
By virtue of using iloc
on repeated positions, the resulting index shows the same repeated pattern. One repeat for each element the list or string.
This can be reset with reset_index(drop=True)
I don't want to have to split the strings prematurely. So instead I count the occurrences of the sep
argument assuming that if I were to split, the length of the resulting list would be one more than the number of separators.
I then use that sep
to join
the strings then split
.
def explode_str(df, col, sep):
s = df[col]
i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
return df.iloc[i].assign(**{col: sep.join(s).split(sep)})
Similar as for strings except I don't need to count occurrences of sep
because its already split.
I use Numpy's concatenate
to jam the lists together.
import pandas as pd
import numpy as np
def explode_list(df, col):
s = df[col]
i = np.arange(len(s)).repeat(s.str.len())
return df.iloc[i].assign(**{col: np.concatenate(s)})