问题
I need to reshape a dataframe that looks like df1 and turn it into df2. There are 2 considerations for this procedure:
- I need to be able to set the number of rows to be sliced as a parameter (length).
- I need to split date and time from the index, and use date in the reshape as the column names and keep time as the index.
Current df1
2007-08-07 18:00:00 1
2007-08-08 00:00:00 2
2007-08-08 06:00:00 3
2007-08-08 12:00:00 4
2007-08-08 18:00:00 5
2007-11-02 18:00:00 6
2007-11-03 00:00:00 7
2007-11-03 06:00:00 8
2007-11-03 12:00:00 9
2007-11-03 18:00:00 10
Desired Output df2 - With the parameter 'length=5'
2007-08-07 2007-11-02
18:00:00 1 6
00:00:00 2 7
06:00:00 3 8
12:00:00 4 9
18:00:00 5 10
What have I done:
My approach was to create a multi-index (Date - Time) and then do a pivot table or some sort of reshape to achieve the desired df output.
import pandas as pd
'''
First separate time and date
'''
df['TimeStamp'] = df.index
df['date'] = df.index.date
df['time'] = df.index.time
'''
Then create a way to separate the slices and make those specific dates available for then create
a multi-index.
'''
for index, row in df.iterrows():
df['Num'] = np.arange(len(df))
for index, row in df.iterrows():
if row['Num'] % 5 == 0:
df.loc[index, 'EventDate'] = df.loc[index, 'Date']
df.set_index(['EventDate', 'Hour'], inplace=True)
del df['Date']
del df['Num']
del df['TimeStamp']
Problem: There's a NaN appears next to each date of the first level of the multi-index. And even if that worked well, I can't find how to do what I need with a multiindex df.
I'm stuck. I appreciate any input.
回答1:
import numpy as np
import pandas as pd
import io
data = '''\
val
2007-08-07 18:00:00 1
2007-08-08 00:00:00 2
2007-08-08 06:00:00 3
2007-08-08 12:00:00 4
2007-08-08 18:00:00 5
2007-11-02 18:00:00 6
2007-11-03 00:00:00 7
2007-11-03 06:00:00 8
2007-11-03 12:00:00 9
2007-11-03 18:00:00 10'''
df = pd.read_table(io.BytesIO(data), sep='\s{2,}', parse_dates=True)
chunksize = 5
chunks = len(df)//chunksize
df['Date'] = np.repeat(df.index.date[::chunksize], chunksize)[:len(df)]
index = df.index.time[:chunksize]
df['Time'] = np.tile(np.arange(chunksize), chunks)
df = df.set_index(['Date', 'Time'], append=False)
df = df['val'].unstack('Date')
df.index = index
print(df)
yields
Date 2007-08-07 2007-11-02
18:00:00 1 6
00:00:00 2 7
06:00:00 3 8
12:00:00 4 9
18:00:00 5 10
Note that the final DataFrame has an index with non-unique entries. (The
18:00:00
is repeated.) Some DataFrame operations are problematic when the
index has repeated entries, so in general it is better to avoid this if
possible.
回答2:
First of all I'm assuming your datetime column is actually a datetime type if not use df['t'] = pd.to_datetime(df['t'])
to convert.
Then set your index using a multindex and unstack...
df.index = pd.MultiIndex.from_tuples(df['t'].apply(lambda x: [x.time(),x.date()]))
df['v'].unstack()
回答3:
This would be a canonical approach for pandas:
First, setup with imports and data:
import pandas as pd
import StringIO
txt = '''2007-08-07 18:00:00 1
2007-08-08 00:00:00 2
2007-08-08 06:00:00 3
2007-08-08 12:00:00 4
2007-08-08 18:00:00 5
2007-11-02 18:00:00 6
2007-11-03 00:00:00 7
2007-11-03 06:00:00 8
2007-11-03 12:00:00 9
2007-11-03 18:00:00 10'''
Now read in the DataFrame, and pivot on the correct columns:
df1 = pd.read_csv(StringIO.StringIO(txt), sep=' ',
names=['d', 't', 'n'], )
print(df1.pivot(index='t', columns='d', values='n'))
prints a pivoted df:
d 2007-08-07 2007-08-08 2007-11-02 2007-11-03
t
00:00:00 NaN 2 NaN 7
06:00:00 NaN 3 NaN 8
12:00:00 NaN 4 NaN 9
18:00:00 1 5 6 10
You won't get a length of 5, though. The following,
2007-08-07 2007-11-02
18:00:00 1 6
00:00:00 2 7
06:00:00 3 8
12:00:00 4 9
18:00:00 5 10
is incorrect, as you have 18:00:00 twice for the same date, and in your initial data, they apply to different dates.
来源:https://stackoverflow.com/questions/25877255/how-to-properly-pivot-or-reshape-a-timeseries-dataframe-in-pandas