问题
I have a dataframe like below :-
id | creTimestamp | CPULoad | instnceId |
---|---|---|---|
0 | 2021-01-22 18:00:00 | 22.0 | instanceA |
1 | 2021-01-22 19:00:00 | 22.5 | instanceA |
2 | 2021-01-22 20:00:00 | 23.5 | instanceA |
3 | 2021-01-22 18:00:00 | 24.0 | instanceB |
4 | 2021-01-22 19:00:00 | 24.5 | instanceB |
5 | 2021-01-22 20:00:00 | 22.5 | instanceB |
6 | 2021-01-24 18:00:00 | 23.0 | instanceA |
7 | 2021-01-24 19:00:00 | 23.5 | instanceA |
8 | 2021-01-24 20:00:00 | 24.0 | instanceA |
9 | 2021-01-24 18:00:00 | 25.5 | instanceB |
10 | 2021-01-24 19:00:00 | 28.5 | instanceB |
11 | 2021-01-24 20:00:00 | 23.5 | instanceB |
Missing dates date is for below:
2021-01-23 2021-01-25
I want to fill the rows for 2021-01-23 and 2021-01-25 also with the previous dates. Example, 22date HR data should be considered. I have a huge dataset where the entire data of the date can be missing for 2 hours . The dates can be generated from the future date range too. Example for 2021-02-01 18:00:00 to 2021-02-02 18:00:00
updated dataframe should be as below:-
id | creTimestamp | CPULoad | instnceId |
---|---|---|---|
0 | 2021-01-22 18:00:00 | 22.0 | instanceA |
1 | 2021-01-22 19:00:00 | 22.5 | instanceA |
2 | 2021-01-22 20:00:00 | 23.5 | instanceA |
3 | 2021-01-22 18:00:00 | 24.0 | instanceB |
4 | 2021-01-22 19:00:00 | 24.5 | instanceB |
5 | 2021-01-22 20:00:00 | 22.5 | instanceB |
6 | 2021-01-23 18:00:00 | 22.0 | instanceA |
7 | 2021-01-23 19:00:00 | 22.5 | instanceA |
8 | 2021-01-23 20:00:00 | 23.5 | instanceA |
9 | 2021-01-23 18:00:00 | 24.0 | instanceB |
10 | 2021-01-23 19:00:00 | 24.5 | instanceB |
11 | 2021-01-23 20:00:00 | 22.5 | instanceB |
12 | 2021-01-24 18:00:00 | 23.0 | instanceA |
13 | 2021-01-24 19:00:00 | 23.5 | instanceA |
14 | 2021-01-24 20:00:00 | 24.0 | instanceA |
15 | 2021-01-24 18:00:00 | 25.5 | instanceB |
16 | 2021-01-24 19:00:00 | 28.5 | instanceB |
17 | 2021-01-24 20:00:00 | 23.5 | instanceB |
18 | 2021-01-25 18:00:00 | 23.0 | instanceA |
19 | 2021-01-25 19:00:00 | 23.5 | instanceA |
20 | 2021-01-25 20:00:00 | 24.0 | instanceA |
21 | 2021-01-25 18:00:00 | 25.5 | instanceB |
22 | 2021-01-25 19:00:00 | 28.5 | instanceB |
23 | 2021-01-25 20:00:00 | 23.5 | instanceB |
The date range can be for last 7 days.
Please help me with this requirement.
Thanks
回答1:
This is a continuation of fill values
- generate a DF that is combination of sampled hours and instances (
df2
) - this generates 15 rows as there are 3 times for instanceA and 2 times for instanceB across 3 dates (2+3)*3
- then use same technique to fill both CPULoad and synthesized memload
- tested against pandas 1.0.1 as well as 1.2.0
import pandas as pd
import io
import datetime as dt
import numpy as np
df = pd.read_csv(io.StringIO("""id creTimestamp CPULoad instnceId
0 2021-01-22 18:00:00 22.0 instanceA
1 2021-01-22 19:00:00 22.0 instanceA
2 2021-01-22 20:00:00 23.0 instanceB
3 2021-01-23 18:00:00 24.0 instanceA
4 2021-01-23 20:00:00 22.0 instanceA
5 2021-01-24 18:00:00 23.0 instanceB
6 2021-01-24 20:00:00 23.5 instanceA
"""), sep="\t", index_col=0)
df.creTimestamp = pd.to_datetime(df.creTimestamp)
df["memload"] = np.random.random(len(df))
# generate a DF for each time in instance in each date
df2 = (pd.merge(
# for each time in instance
df.assign(timestamp=df.creTimestamp.dt.time)
.loc[:,["instnceId","timestamp"]]
.drop_duplicates()
.assign(foo=1),
# for each date
df.creTimestamp.dt.date.drop_duplicates().to_frame().assign(foo=1),
on="foo"
).assign(creTimestamp=lambda dfa: dfa.apply(lambda r: dt.datetime.combine(r["creTimestamp"], r["timestamp"]), axis=1))
.drop(columns="foo")
# merge values back..
.merge(df, on=["creTimestamp", "instnceId"], how="left")
)
# now get values to fill NaN
df2 = (df2.merge(df2.dropna().drop_duplicates(subset=["instnceId","timestamp"], keep="last"),
on=["timestamp","instnceId"], suffixes=("","_pre"))
.assign(CPULoad=lambda dfa: dfa.CPULoad.fillna(dfa.CPULoad_pre))
.assign(memload=lambda dfa: dfa.memload.fillna(dfa.memload_pre))
)
output
instnceId timestamp creTimestamp CPULoad creTimestamp_pre CPULoad_pre
0 instanceA 18:00:00 2021-01-22 18:00:00 22.0 2021-01-23 18:00:00 24.0
1 instanceA 18:00:00 2021-01-23 18:00:00 24.0 2021-01-23 18:00:00 24.0
2 instanceA 18:00:00 2021-01-24 18:00:00 24.0 2021-01-23 18:00:00 24.0
3 instanceA 19:00:00 2021-01-22 19:00:00 22.0 2021-01-22 19:00:00 22.0
4 instanceA 19:00:00 2021-01-23 19:00:00 22.0 2021-01-22 19:00:00 22.0
5 instanceA 19:00:00 2021-01-24 19:00:00 22.0 2021-01-22 19:00:00 22.0
6 instanceB 20:00:00 2021-01-22 20:00:00 23.0 2021-01-22 20:00:00 23.0
7 instanceB 20:00:00 2021-01-23 20:00:00 23.0 2021-01-22 20:00:00 23.0
8 instanceB 20:00:00 2021-01-24 20:00:00 23.0 2021-01-22 20:00:00 23.0
9 instanceA 20:00:00 2021-01-22 20:00:00 23.5 2021-01-24 20:00:00 23.5
10 instanceA 20:00:00 2021-01-23 20:00:00 22.0 2021-01-24 20:00:00 23.5
11 instanceA 20:00:00 2021-01-24 20:00:00 23.5 2021-01-24 20:00:00 23.5
12 instanceB 18:00:00 2021-01-22 18:00:00 23.0 2021-01-24 18:00:00 23.0
13 instanceB 18:00:00 2021-01-23 18:00:00 23.0 2021-01-24 18:00:00 23.0
14 instanceB 18:00:00 2021-01-24 18:00:00 23.0 2021-01-24 18:00:00 23.0
来源:https://stackoverflow.com/questions/66004487/add-data-for-the-missing-dates-based-on-previous-hour-data-in-pandas