Add data for the missing dates based on previous hour data in pandas

断了今生、忘了曾经 提交于 2021-02-11 12:51:47

问题


I have a dataframe like below :-

id creTimestamp CPULoad instnceId
0 2021-01-22 18:00:00 22.0 instanceA
1 2021-01-22 19:00:00 22.5 instanceA
2 2021-01-22 20:00:00 23.5 instanceA
3 2021-01-22 18:00:00 24.0 instanceB
4 2021-01-22 19:00:00 24.5 instanceB
5 2021-01-22 20:00:00 22.5 instanceB
6 2021-01-24 18:00:00 23.0 instanceA
7 2021-01-24 19:00:00 23.5 instanceA
8 2021-01-24 20:00:00 24.0 instanceA
9 2021-01-24 18:00:00 25.5 instanceB
10 2021-01-24 19:00:00 28.5 instanceB
11 2021-01-24 20:00:00 23.5 instanceB

Missing dates date is for below:

2021-01-23 2021-01-25

I want to fill the rows for 2021-01-23 and 2021-01-25 also with the previous dates. Example, 22date HR data should be considered. I have a huge dataset where the entire data of the date can be missing for 2 hours . The dates can be generated from the future date range too. Example for 2021-02-01 18:00:00 to 2021-02-02 18:00:00

updated dataframe should be as below:-

id creTimestamp CPULoad instnceId
0 2021-01-22 18:00:00 22.0 instanceA
1 2021-01-22 19:00:00 22.5 instanceA
2 2021-01-22 20:00:00 23.5 instanceA
3 2021-01-22 18:00:00 24.0 instanceB
4 2021-01-22 19:00:00 24.5 instanceB
5 2021-01-22 20:00:00 22.5 instanceB
6 2021-01-23 18:00:00 22.0 instanceA
7 2021-01-23 19:00:00 22.5 instanceA
8 2021-01-23 20:00:00 23.5 instanceA
9 2021-01-23 18:00:00 24.0 instanceB
10 2021-01-23 19:00:00 24.5 instanceB
11 2021-01-23 20:00:00 22.5 instanceB
12 2021-01-24 18:00:00 23.0 instanceA
13 2021-01-24 19:00:00 23.5 instanceA
14 2021-01-24 20:00:00 24.0 instanceA
15 2021-01-24 18:00:00 25.5 instanceB
16 2021-01-24 19:00:00 28.5 instanceB
17 2021-01-24 20:00:00 23.5 instanceB
18 2021-01-25 18:00:00 23.0 instanceA
19 2021-01-25 19:00:00 23.5 instanceA
20 2021-01-25 20:00:00 24.0 instanceA
21 2021-01-25 18:00:00 25.5 instanceB
22 2021-01-25 19:00:00 28.5 instanceB
23 2021-01-25 20:00:00 23.5 instanceB

The date range can be for last 7 days.

Please help me with this requirement.

Thanks


回答1:


This is a continuation of fill values

  • generate a DF that is combination of sampled hours and instances (df2)
  • this generates 15 rows as there are 3 times for instanceA and 2 times for instanceB across 3 dates (2+3)*3
  • then use same technique to fill both CPULoad and synthesized memload
  • tested against pandas 1.0.1 as well as 1.2.0
import pandas as pd
import io
import datetime as dt
import numpy as np
df = pd.read_csv(io.StringIO("""id  creTimestamp    CPULoad instnceId
0   2021-01-22 18:00:00 22.0    instanceA
1   2021-01-22 19:00:00 22.0    instanceA
2   2021-01-22 20:00:00 23.0    instanceB
3   2021-01-23 18:00:00 24.0    instanceA
4   2021-01-23 20:00:00 22.0    instanceA
5   2021-01-24 18:00:00 23.0    instanceB
6   2021-01-24 20:00:00 23.5    instanceA
"""), sep="\t", index_col=0)

df.creTimestamp = pd.to_datetime(df.creTimestamp)
df["memload"] = np.random.random(len(df))

# generate a DF for each time in instance in each date
df2 = (pd.merge(
    # for each time in instance
    df.assign(timestamp=df.creTimestamp.dt.time)
        .loc[:,["instnceId","timestamp"]]
        .drop_duplicates()
        .assign(foo=1),
    # for each date
    df.creTimestamp.dt.date.drop_duplicates().to_frame().assign(foo=1),
    on="foo"
).assign(creTimestamp=lambda dfa: dfa.apply(lambda r: dt.datetime.combine(r["creTimestamp"], r["timestamp"]), axis=1))
 .drop(columns="foo")
       # merge values back..
 .merge(df, on=["creTimestamp", "instnceId"], how="left")
)

# now get values to fill NaN
df2 = (df2.merge(df2.dropna().drop_duplicates(subset=["instnceId","timestamp"], keep="last"),
         on=["timestamp","instnceId"], suffixes=("","_pre"))
 .assign(CPULoad=lambda dfa: dfa.CPULoad.fillna(dfa.CPULoad_pre))
 .assign(memload=lambda dfa: dfa.memload.fillna(dfa.memload_pre))

)

output

    instnceId timestamp        creTimestamp  CPULoad    creTimestamp_pre  CPULoad_pre
0   instanceA  18:00:00 2021-01-22 18:00:00     22.0 2021-01-23 18:00:00         24.0
1   instanceA  18:00:00 2021-01-23 18:00:00     24.0 2021-01-23 18:00:00         24.0
2   instanceA  18:00:00 2021-01-24 18:00:00     24.0 2021-01-23 18:00:00         24.0
3   instanceA  19:00:00 2021-01-22 19:00:00     22.0 2021-01-22 19:00:00         22.0
4   instanceA  19:00:00 2021-01-23 19:00:00     22.0 2021-01-22 19:00:00         22.0
5   instanceA  19:00:00 2021-01-24 19:00:00     22.0 2021-01-22 19:00:00         22.0
6   instanceB  20:00:00 2021-01-22 20:00:00     23.0 2021-01-22 20:00:00         23.0
7   instanceB  20:00:00 2021-01-23 20:00:00     23.0 2021-01-22 20:00:00         23.0
8   instanceB  20:00:00 2021-01-24 20:00:00     23.0 2021-01-22 20:00:00         23.0
9   instanceA  20:00:00 2021-01-22 20:00:00     23.5 2021-01-24 20:00:00         23.5
10  instanceA  20:00:00 2021-01-23 20:00:00     22.0 2021-01-24 20:00:00         23.5
11  instanceA  20:00:00 2021-01-24 20:00:00     23.5 2021-01-24 20:00:00         23.5
12  instanceB  18:00:00 2021-01-22 18:00:00     23.0 2021-01-24 18:00:00         23.0
13  instanceB  18:00:00 2021-01-23 18:00:00     23.0 2021-01-24 18:00:00         23.0
14  instanceB  18:00:00 2021-01-24 18:00:00     23.0 2021-01-24 18:00:00         23.0


来源:https://stackoverflow.com/questions/66004487/add-data-for-the-missing-dates-based-on-previous-hour-data-in-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!