I have a pandas dataframe as follows:
ticker account value date
aa assets 100,200 20121231, 20131231
bb liabilities
df.value = df.value.str.split(',')
df.date = df.date.str.split(',')
df = df.explode('value').explode("date").reset_index(drop=True)
df:
ticker account value date
0 aa assets 100 20121231
1 aa assets 100 20131231
2 aa assets 200 20121231
3 aa assets 200 20131231
4 bb liabilities 50 20141231
5 bb liabilities 50 20131231
6 bb liabilities 50 20141231
7 bb liabilities 50 20131231
You can first split columns, create Series
by stack and remove whitespaces by strip:
s1 = df.value.str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True)
s2 = df.date.str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True)
Then concat both Series
to df1
:
df1 = pd.concat([s1,s2], axis=1, keys=['value','date'])
Remove old columns value
and date
and join:
print (df.drop(['value','date'], axis=1).join(df1).reset_index(drop=True))
ticker account value date
0 aa assets 100 20121231
1 aa assets 200 20131231
2 bb liabilities 50 20141231
3 bb liabilities 150 20131231
Because I'm too new, I'm not allowed to write a comment, so I write an "answer".
@titipata your answer worked really good, but in my opinion there is a small "mistake" in your code I'm not able to find for my self.
I work with the example from this question and changed just the values.
df = pd.DataFrame([['title1', 'publisher1', '1.1,1.2', '1'],
['title2', 'publisher2', '2', '2.1,2.2']],
columns=['titel', 'publisher', 'print', 'electronic'])
explode(df, ['print', 'electronic'])
publisher titel print electronic
0 publisher1 title1 1.1 1
1 publisher1 title1 1.2 2.1
2 publisher2 title2 2 2.2
As you see, in the column 'electronic' should be in row '1' the value '1' and not '2.1'.
Because of that, the hole DataSet would change. I hope someone could help me to find a solution for this.
I wrote explode
function based on previous answers. It might be useful for anyone who want to grab and use it quickly.
def explode(df, cols, split_on=','):
"""
Explode dataframe on the given column, split on given delimeter
"""
cols_sep = list(set(df.columns) - set(cols))
df_cols = df[cols_sep]
explode_len = df[cols[0]].str.split(split_on).map(len)
repeat_list = []
for r, e in zip(df_cols.as_matrix(), explode_len):
repeat_list.extend([list(r)]*e)
df_repeat = pd.DataFrame(repeat_list, columns=cols_sep)
df_explode = pd.concat([df[col].str.split(split_on, expand=True).stack().str.strip().reset_index(drop=True)
for col in cols], axis=1)
df_explode.columns = cols
return pd.concat((df_repeat, df_explode), axis=1)
example given from @piRSquared:
df = pd.DataFrame([['aa', 'assets', '100,200', '20121231,20131231'],
['bb', 'liabilities', '50,50', '20141231,20131231']],
columns=['ticker', 'account', 'value', 'date'])
explode(df, ['value', 'date'])
output
+-----------+------+-----+--------+
| account|ticker|value| date|
+-----------+------+-----+--------+
| assets| aa| 100|20121231|
| assets| aa| 200|20131231|
|liabilities| bb| 50|20141231|
|liabilities| bb| 50|20131231|
+-----------+------+-----+--------+
I'm noticing this question a lot. That is, how do I split this column that has a list into multiple rows? I've seen it called exploding. Here are some links:
So I wrote a function that will do it.
def explode(df, columns):
idx = np.repeat(df.index, df[columns[0]].str.len())
a = df.T.reindex_axis(columns).values
concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)
But before we can use it, we need lists (or iterable) in a column.
df = pd.DataFrame([['aa', 'assets', '100,200', '20121231,20131231'],
['bb', 'liabilities', '50,50', '20141231,20131231']],
columns=['ticker', 'account', 'value', 'date'])
df
split value
and date
columns:
df.value = df.value.str.split(',')
df.date = df.date.str.split(',')
df
Now we could explode on either column or both, one after the other.
explode(df, ['value','date'])
I removed strip
from @jezrael's timing because I could not effectively add it to mine. This is a necessary step for this question as OP has spaces in strings after commas. I was aiming at providing a generic way to explode a column given it already has iterables in it and I think I've accomplished that.
code
def get_df(n=1):
return pd.DataFrame([['aa', 'assets', '100,200,200', '20121231,20131231,20131231'],
['bb', 'liabilities', '50,50', '20141231,20131231']] * n,
columns=['ticker', 'account', 'value', 'date'])
small 2 row sample
medium 200 row sample
large 2,000,000 row sample