Add data for the missing dates based on previous hour data in pandas

问题

I have a dataframe like below :-

id	creTimestamp	CPULoad	instnceId
0	2021-01-22 18:00:00	22.0	instanceA
1	2021-01-22 19:00:00	22.5	instanceA
2	2021-01-22 20:00:00	23.5	instanceA
3	2021-01-22 18:00:00	24.0	instanceB
4	2021-01-22 19:00:00	24.5	instanceB
5	2021-01-22 20:00:00	22.5	instanceB
6	2021-01-24 18:00:00	23.0	instanceA
7	2021-01-24 19:00:00	23.5	instanceA
8	2021-01-24 20:00:00	24.0	instanceA
9	2021-01-24 18:00:00	25.5	instanceB
10	2021-01-24 19:00:00	28.5	instanceB
11	2021-01-24 20:00:00	23.5	instanceB

Missing dates date is for below:

2021-01-23 2021-01-25

I want to fill the rows for 2021-01-23 and 2021-01-25 also with the previous dates. Example, 22date HR data should be considered. I have a huge dataset where the entire data of the date can be missing for 2 hours . The dates can be generated from the future date range too. Example for 2021-02-01 18:00:00 to 2021-02-02 18:00:00

updated dataframe should be as below:-

id	creTimestamp	CPULoad	instnceId
0	2021-01-22 18:00:00	22.0	instanceA
1	2021-01-22 19:00:00	22.5	instanceA
2	2021-01-22 20:00:00	23.5	instanceA
3	2021-01-22 18:00:00	24.0	instanceB
4	2021-01-22 19:00:00	24.5	instanceB
5	2021-01-22 20:00:00	22.5	instanceB
6	2021-01-23 18:00:00	22.0	instanceA
7	2021-01-23 19:00:00	22.5	instanceA
8	2021-01-23 20:00:00	23.5	instanceA
9	2021-01-23 18:00:00	24.0	instanceB
10	2021-01-23 19:00:00	24.5	instanceB
11	2021-01-23 20:00:00	22.5	instanceB
12	2021-01-24 18:00:00	23.0	instanceA
13	2021-01-24 19:00:00	23.5	instanceA
14	2021-01-24 20:00:00	24.0	instanceA
15	2021-01-24 18:00:00	25.5	instanceB
16	2021-01-24 19:00:00	28.5	instanceB
17	2021-01-24 20:00:00	23.5	instanceB
18	2021-01-25 18:00:00	23.0	instanceA
19	2021-01-25 19:00:00	23.5	instanceA
20	2021-01-25 20:00:00	24.0	instanceA
21	2021-01-25 18:00:00	25.5	instanceB
22	2021-01-25 19:00:00	28.5	instanceB
23	2021-01-25 20:00:00	23.5	instanceB

The date range can be for last 7 days.

Please help me with this requirement.

Thanks

回答1:

This is a continuation of fill values

generate a DF that is combination of sampled hours and instances (df2)
this generates 15 rows as there are 3 times for instanceA and 2 times for instanceB across 3 dates (2+3)*3
then use same technique to fill both CPULoad and synthesized memload
tested against pandas 1.0.1 as well as 1.2.0

import pandas as pd
import io
import datetime as dt
import numpy as np
df = pd.read_csv(io.StringIO("""id  creTimestamp    CPULoad instnceId
0   2021-01-22 18:00:00 22.0    instanceA
1   2021-01-22 19:00:00 22.0    instanceA
2   2021-01-22 20:00:00 23.0    instanceB
3   2021-01-23 18:00:00 24.0    instanceA
4   2021-01-23 20:00:00 22.0    instanceA
5   2021-01-24 18:00:00 23.0    instanceB
6   2021-01-24 20:00:00 23.5    instanceA
"""), sep="\t", index_col=0)

df.creTimestamp = pd.to_datetime(df.creTimestamp)
df["memload"] = np.random.random(len(df))

# generate a DF for each time in instance in each date
df2 = (pd.merge(
    # for each time in instance
    df.assign(timestamp=df.creTimestamp.dt.time)
        .loc[:,["instnceId","timestamp"]]
        .drop_duplicates()
        .assign(foo=1),
    # for each date
    df.creTimestamp.dt.date.drop_duplicates().to_frame().assign(foo=1),
    on="foo"
).assign(creTimestamp=lambda dfa: dfa.apply(lambda r: dt.datetime.combine(r["creTimestamp"], r["timestamp"]), axis=1))
 .drop(columns="foo")
       # merge values back..
 .merge(df, on=["creTimestamp", "instnceId"], how="left")
)

# now get values to fill NaN
df2 = (df2.merge(df2.dropna().drop_duplicates(subset=["instnceId","timestamp"], keep="last"),
         on=["timestamp","instnceId"], suffixes=("","_pre"))
 .assign(CPULoad=lambda dfa: dfa.CPULoad.fillna(dfa.CPULoad_pre))
 .assign(memload=lambda dfa: dfa.memload.fillna(dfa.memload_pre))

)

output

    instnceId timestamp        creTimestamp  CPULoad    creTimestamp_pre  CPULoad_pre
0   instanceA  18:00:00 2021-01-22 18:00:00     22.0 2021-01-23 18:00:00         24.0
1   instanceA  18:00:00 2021-01-23 18:00:00     24.0 2021-01-23 18:00:00         24.0
2   instanceA  18:00:00 2021-01-24 18:00:00     24.0 2021-01-23 18:00:00         24.0
3   instanceA  19:00:00 2021-01-22 19:00:00     22.0 2021-01-22 19:00:00         22.0
4   instanceA  19:00:00 2021-01-23 19:00:00     22.0 2021-01-22 19:00:00         22.0
5   instanceA  19:00:00 2021-01-24 19:00:00     22.0 2021-01-22 19:00:00         22.0
6   instanceB  20:00:00 2021-01-22 20:00:00     23.0 2021-01-22 20:00:00         23.0
7   instanceB  20:00:00 2021-01-23 20:00:00     23.0 2021-01-22 20:00:00         23.0
8   instanceB  20:00:00 2021-01-24 20:00:00     23.0 2021-01-22 20:00:00         23.0
9   instanceA  20:00:00 2021-01-22 20:00:00     23.5 2021-01-24 20:00:00         23.5
10  instanceA  20:00:00 2021-01-23 20:00:00     22.0 2021-01-24 20:00:00         23.5
11  instanceA  20:00:00 2021-01-24 20:00:00     23.5 2021-01-24 20:00:00         23.5
12  instanceB  18:00:00 2021-01-22 18:00:00     23.0 2021-01-24 18:00:00         23.0
13  instanceB  18:00:00 2021-01-23 18:00:00     23.0 2021-01-24 18:00:00         23.0
14  instanceB  18:00:00 2021-01-24 18:00:00     23.0 2021-01-24 18:00:00         23.0

来源：https://stackoverflow.com/questions/66004487/add-data-for-the-missing-dates-based-on-previous-hour-data-in-pandas

标签

python

pandas

dataframe

datetime

time