What's the most efficient way to convert a time-series data into a cross-sectional one?

问题

Here's the thing, I have the dataset below where date is the index:

date            value
2020-01-01      100
2020-02-01      140
2020-03-01      156
2020-04-01      161
2020-05-01      170
.
.
.

And I want to transform it in this other dataset:

value_t0    value_t1    value_t2    value_t3    value_t4 ...
100         NaN         NaN         NaN         NaN      ...
140         100         NaN         NaN         NaN      ...
156         140         100         NaN         NaN      ...
161         156         140         100         NaN      ...
170         161         156         140         100      ...

First I thought about using pandas.pivot_table to do something, but that would just provide a different layout grouped by some column, which is not exactly what I want. Later, I thought about using pandasql and apply 'case when', but that wouldn't work because I would have to type dozens of lines of code. So I'm stuck here.

回答1:

try this:

new_df = pd.DataFrame({f"value_t{i}": df['value'].shift(i) for i in range(len(df))})

The series .shift(n) method can get you a single column of your desired output by shifting everything down and filling in NaNs above. So we're building a new dataframe by feeding it a dictionary of the form {column name: column data, ...}, by using dictionary comprehension to iterate through your original dataframe.

回答2:

I think the best is use numpy

values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0], 1)] = np.nan
new_df = pd.DataFrame(new_values).add_prefix('value_t')

Times for 5000 rows

%%timeit
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0],1)] = np.nan
new_df = pd.DataFrame(new_values).add_prefix('value_t')
556 ms ± 35.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
new_df = pd.DataFrame({f"value_t{i}": df['value'].shift(i) for i in range(len(df))})
1.31 s ± 36.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time without add_prefix

%%timeit
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0],1)] = np.nan
new_df = pd.DataFrame(new_values)

357 ms ± 8.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

来源：https://stackoverflow.com/questions/64124732/whats-the-most-efficient-way-to-convert-a-time-series-data-into-a-cross-section

标签

python

pandas

dataframe

data-cleaning