I have a csv file that contains a number of columns. Using pandas, I read this csv file into a dataframe and have a datetime index and five or six other columns.
One of
If you want to stay in pure pandas you can throw in a tricky groupby
and apply
which ends up boiling down to a one liner if you don't count the column rename.
In [1]: import pandas as pd
In [2]: d = {'date': ['4/1/11', '4/2/11'], 'ts': [[pd.Timestamp('2012-02-29 00:00:00'), pd.Timestamp('2012-03-31 00:00:00'), pd.Timestamp('2012-04-25 00:00:00'), pd.Timestamp('2012-06-30 00:00:00')], [pd.Timestamp('2014-01-31 00:00:00')]]}
In [3]: df = pd.DataFrame(d)
In [4]: df.head()
Out[4]:
date ts
0 4/1/11 [2012-02-29 00:00:00, 2012-03-31 00:00:00, 201...
1 4/2/11 [2014-01-31 00:00:00]
In [5]: df_new = df.groupby('date').ts.apply(lambda x: pd.DataFrame(x.values[0])).reset_index().drop('level_1', axis = 1)
In [6]: df_new.columns = ['date','ts']
In [7]: df_new.head()
Out[7]:
date ts
0 4/1/11 2012-02-29
1 4/1/11 2012-03-31
2 4/1/11 2012-04-25
3 4/1/11 2012-06-30
4 4/2/11 2014-01-31
Since the goal is to take the value of a column (in this case date) and repeat it for all values of the multiple rows you intend to create from the list it's useful to think of pandas indexing.
We want the date to become the single index for the new rows so we use groupby
which puts the desired row value into an index. Then inside that operation I want to split only this list for this date which is what apply
will do for us.
I'm passing apply
a pandas Series
which consists of a single list but I can access that list via a .values[0]
which pushes the sole row of the Series
to an array with a single entry.
To turn the list into a set of rows that will be passed back to the indexed date I can just make it a DataFrame
. This incurs the penalty of picking up an extra index but we end up dropping that. We could make this an index itself but that would preclude dupe values.
Once this is passed back out I have a multi-index but I can force this into the row format we desire by reset_index
. Then we simply drop the unwanted index.
It sounds involved but really we're just leverage the natural behaviors of pandas functions to avoid explicitly iterating or looping.
Speed wise this tends to be pretty good and since it relies on apply
any parallelization tricks that work with apply
work here.
Optionally if you want it to be robust to multiple dates each with a nested list:
df_new = df.groupby('date').ts.apply(lambda x: pd.DataFrame([item for sublist in x.values for item in sublist]))
at which point the one liner is getting dense and you should probably throw into a function.
The way I did it was split the list into seperate columns, and then melt
ed it to put each timestamp in a separate row.
In [48]: df = pd.DataFrame([[1,2,[1,2,4]],[4,5,[1,3]],],columns=['a','b','TimeStamp'])
...: df
Out[48]:
a b TimeStamp
0 1 2 [1, 2, 4]
1 4 5 [1, 3]
You can convert the column to a list and then back to a DataFrame
to split it into columns:
In [53]: TScolumns = pd.DataFrame(df.TimeStamp.tolist(), )
...: TScolumns
Out[53]:
0 1 2
0 1 2 4
1 1 3 NaN
And then splice it onto the original dataframe
In [90]: df = df.drop('TimeStamp',axis=1)
In [58]: split = pd.concat([df, TScolumns], axis=1)
...: split
Out[58]:
a b 0 1 2
0 1 2 1 2 4
1 4 5 1 3 NaN
Finally, use melt
to get it into the shape you want:
In [89]: pd.melt(split, id_vars=['a', 'b'], value_name='TimeStamp')
Out[89]:
a b variable TimeStamp
0 1 2 0 1
1 4 5 0 1
2 1 2 1 2
3 4 5 1 3
4 1 2 2 4
5 4 5 2 NaN
Probably not the best way from performance perspective, but still, you can leverage itertools
package:
from pandas import DataFrame, Timestamp
import itertools
d = {'date': ['4/1/11', '4/2/11'], 'ts': [[Timestamp('2012-02-29 00:00:00'), Timestamp('2012-03-31 00:00:00'), Timestamp('2012-04-25 00:00:00'), Timestamp('2012-06-30 00:00:00')], [Timestamp('2014-01-31 00:00:00')]]}
df = DataFrame(d)
res = df.to_dict()
data = []
for x in res['date'].keys():
data.append(itertools.izip_longest([res['date'][x]], res['ts'][x], fillvalue=res['date'][x]))
new_data = list(itertools.chain.from_iterable(data))
df2 = DataFrame(new_data, columns=['date', 'timestamp'])
print df2
Will print :
date timestamp
0 4/1/11 2012-02-29
1 4/1/11 2012-03-31
2 4/1/11 2012-04-25
3 4/1/11 2012-06-30
4 4/2/11 2014-01-31
This doesn't feel very pythonic, but it works (provided your createDate is unique!)
Apply will only return more rows than it gets with a groupby, so we're going to use groupby artificially (i.e. groupby a column of unique values, so each group is one line).
def splitRows(x):
# Extract the actual list of time-stamps.
theList = x.TimeStamps.iloc[0]
# Each row will be a dictionary in this list.
listOfNewRows = list()
# Iterate over items in list of timestamps,
# putting each one in a dictionary to later convert to a row,
# then adding the dictionary to a list.
for i in theList:
newRow = dict()
newRow['CreateDate'] = x.CreateDate.iloc[0]
newRow['TimeStamps'] = i
listOfNewRows.append(newRow)
# Now convert these dictionaries into rows in a new dataframe and return it.
return pd.DataFrame(listOfNewRows)
df.groupby('CreateDate', as_index = False, group_keys = False).apply(splitRows)
Followup: If CreateDate is NOT unique, you can just reset the index to a new column and groupby that.