问题
I've got two DataFrames. One has a set of values corresponding to certain times and dates (df_1
). The other has a set of values corresponding to certain dates (df_2
). I want to merge these DataFrames such that the values of df_2
for dates get applied to all times of df_1
for the corresponding dates.
So, here is df_1
:
|DatetimeIndex |value_1|
|-----------------------|-------|
|2015-07-18 13:53:33.280|10 |
|2015-07-18 15:43:30.111|11 |
|2015-07-19 13:54:03.330|12 |
|2015-07-20 13:52:13.350|13 |
|2015-07-20 16:10:01.901|14 |
|2015-07-20 16:50:55.020|15 |
|2015-07-21 13:56:03.126|16 |
|2015-07-22 13:53:51.747|17 |
|2015-07-22 19:45:14.647|18 |
|2015-07-23 13:53:29.346|19 |
|2015-07-23 20:00:30.100|20 |
and here is df_2
:
|DatetimeIndex|value_2|
|-------------|-------|
|2015-07-18 |100 |
|2015-07-19 |200 |
|2015-07-20 |300 |
|2015-07-21 |400 |
|2015-07-22 |500 |
|2015-07-23 |600 |
I want to merge them like this:
|DatetimeIndex |value_1|value_2|
|-----------------------|-------|-------|
|2015-07-18 00:00:00.000|NaN |100 |
|2015-07-18 13:53:33.280|10.0 |100 |
|2015-07-18 15:43:30.111|11.0 |100 |
|2015-07-19 00:00:00.000|NaN |200 |
|2015-07-19 13:54:03.330|12.0 |200 |
|2015-07-20 00:00:00.000|NaN |300 |
|2015-07-20 13:52:13.350|13.0 |300 |
|2015-07-20 16:10:01.901|14.0 |300 |
|2015-07-20 16:50:55.020|15.0 |300 |
|2015-07-21 00:00:00.000|NaN |400 |
|2015-07-21 13:56:03.126|16.0 |400 |
|2015-07-22 00:00:00.000|NaN |500 |
|2015-07-22 13:53:51.747|17 |500 |
|2015-07-22 19:45:14.647|18 |500 |
|2015-07-23 00:00:00.000|NaN |600 |
|2015-07-23 13:53:29.346|19 |600 |
|2015-07-23 20:00:30.100|20 |600 |
So, the value_2
exists throughout the days.
What kind of merge is this called? How can it be done?
Code for the DataFrames is as follows:
import pandas as pd
df_1 = pd.DataFrame(
[
[pd.Timestamp("2015-07-18 13:53:33.280"), 10],
[pd.Timestamp("2015-07-18 15:43:30.111"), 11],
[pd.Timestamp("2015-07-19 13:54:03.330"), 12],
[pd.Timestamp("2015-07-20 13:52:13.350"), 13],
[pd.Timestamp("2015-07-20 16:10:01.901"), 14],
[pd.Timestamp("2015-07-20 16:50:55.020"), 15],
[pd.Timestamp("2015-07-21 13:56:03.126"), 16],
[pd.Timestamp("2015-07-22 13:53:51.747"), 17],
[pd.Timestamp("2015-07-22 19:45:14.647"), 18],
[pd.Timestamp("2015-07-23 13:53:29.346"), 19],
[pd.Timestamp("2015-07-23 20:00:30.100"), 20]
],
columns = [
"datetime",
"value_1"
]
)
df_1.index = df_1["datetime"]
del df_1["datetime"]
df_1.index = pd.to_datetime(df_1.index.values)
df_2 = pd.DataFrame(
[
[pd.Timestamp("2015-07-18 00:00:00"), 100],
[pd.Timestamp("2015-07-19 00:00:00"), 200],
[pd.Timestamp("2015-07-20 00:00:00"), 300],
[pd.Timestamp("2015-07-21 00:00:00"), 400],
[pd.Timestamp("2015-07-22 00:00:00"), 500],
[pd.Timestamp("2015-07-23 00:00:00"), 600]
],
columns = [
"datetime",
"value_2"
]
)
df_2
df_2.index = df_2["datetime"]
del df_2["datetime"]
df_2.index = pd.to_datetime(df_2.index.values)
回答1:
Solution
Construct a new index that is a union of the two. Then use a combination of reindex
and map
idx = df_1.index.union(df_2.index)
df_1.reindex(idx).assign(value_2=idx.floor('D').map(df_2.value_2.get))
value_1 value_2
2015-07-18 00:00:00.000 NaN 100
2015-07-18 13:53:33.280 10.0 100
2015-07-18 15:43:30.111 11.0 100
2015-07-19 00:00:00.000 NaN 200
2015-07-19 13:54:03.330 12.0 200
2015-07-20 00:00:00.000 NaN 300
2015-07-20 13:52:13.350 13.0 300
2015-07-20 16:10:01.901 14.0 300
2015-07-20 16:50:55.020 15.0 300
2015-07-21 00:00:00.000 NaN 400
2015-07-21 13:56:03.126 16.0 400
2015-07-22 00:00:00.000 NaN 500
2015-07-22 13:53:51.747 17.0 500
2015-07-22 19:45:14.647 18.0 500
2015-07-23 00:00:00.000 NaN 600
2015-07-23 13:53:29.346 19.0 600
2015-07-23 20:00:30.100 20.0 600
Explanation
- Taking the union of the two should be self explanatory. However, when taking the union, we automatically get a sorted index as well. That's convenient!
- When we reindex
df_1
with the this new and improved union of indices, some of the index values will not be present in the index ofdf_1
. Without specifying other parameters, the column values for those previously non-existent indices will benp.nan
, which is what we were going for. - I use
assign
to add columns.- I think it's cleaner
- It doesn't overwrite the dataframe I'm working with
- It pipelines well
idx.floor('D')
gives me the day while keeping the characteristic of being apd.DatetimeIndex
. This allows me tomap
right after it.pd.Index.map
takes a callable- I pass
df_2.value_2.get
which feels a lot likedict.get
(which I like)
Response to Comment
Suppose df_2
has several columns. We could use join
instead
df_1.join(df_2.loc[idx.date].set_index(idx), how='outer')
value_1 value_2
2015-07-18 00:00:00.000 NaN 100
2015-07-18 13:53:33.280 10.0 100
2015-07-18 15:43:30.111 11.0 100
2015-07-19 00:00:00.000 NaN 200
2015-07-19 13:54:03.330 12.0 200
2015-07-20 00:00:00.000 NaN 300
2015-07-20 13:52:13.350 13.0 300
2015-07-20 16:10:01.901 14.0 300
2015-07-20 16:50:55.020 15.0 300
2015-07-21 00:00:00.000 NaN 400
2015-07-21 13:56:03.126 16.0 400
2015-07-22 00:00:00.000 NaN 500
2015-07-22 13:53:51.747 17.0 500
2015-07-22 19:45:14.647 18.0 500
2015-07-23 00:00:00.000 NaN 600
2015-07-23 13:53:29.346 19.0 600
2015-07-23 20:00:30.100 20.0 600
This may seem like a better answer in that it is shorter. But it is slower for the single column case. By all means, use it for the multi-column case.
%timeit df_1.reindex(idx).assign(value_2=idx.floor('D').map(df_2.value_2.get))
%timeit df_1.join(df_2.loc[idx.date].set_index(idx), how='outer')
1.56 ms ± 69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.38 ms ± 591 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
来源:https://stackoverflow.com/questions/46654734/how-can-dataframes-be-merged-such-that-the-values-of-one-that-correspond-to-dat