Currently I have two data frames representing excel spreadsheets. I wish to join the data where the dates are equal. This is a one to many join as one spread sheet has a dat
So here's the option with merging:
Assume you have two DataFrames:
import pandas as pd
df1 = pd.DataFrame({'date': ['2015-01-01', '2015-01-02', '2015-01-03'],
'data': ['A', 'B', 'C']})
df2 = pd.DataFrame({'date': ['2015-01-01 to 2015-01-02', '2015-01-01 to 2015-01-02', '2015-01-02 to 2015-01-03'],
'data': ['E', 'F', 'G']})
Now do some cleaning to get all of the dates you need and make sure they are datetime
df1['date'] = pd.to_datetime(df1.date)
df2[['start', 'end']] = df2['date'].str.split(' to ', expand=True)
df2['start'] = pd.to_datetime(df2.start)
df2['end'] = pd.to_datetime(df2.end)
# No need for this anymore
df2 = df2.drop(columns='date')
Now merge it all together. You'll get 99x10K rows.
df = df1.assign(dummy=1).merge(df2.assign(dummy=1), on='dummy').drop(columns='dummy')
And subset to the dates that fall in between the ranges:
df[(df.date >= df.start) & (df.date <= df.end)]
# date data_x data_y start end
#0 2015-01-01 A E 2015-01-01 2015-01-02
#1 2015-01-01 A F 2015-01-01 2015-01-02
#3 2015-01-02 B E 2015-01-01 2015-01-02
#4 2015-01-02 B F 2015-01-01 2015-01-02
#5 2015-01-02 B G 2015-01-02 2015-01-03
#8 2015-01-03 C G 2015-01-02 2015-01-03
If for instance, some dates in df2
were a single date, since we're using .str.split
we will get None
for the second date. Then just use .loc
to set it appropriately.
df2 = pd.DataFrame({'date': ['2015-01-01 to 2015-01-02', '2015-01-01 to 2015-01-02', '2015-01-02 to 2015-01-03',
'2015-01-03'],
'data': ['E', 'F', 'G', 'H']})
df2[['start', 'end']] = df2['date'].str.split(' to ', expand=True)
df2.loc[df2.end.isnull(), 'end'] = df2.loc[df2.end.isnull(), 'start']
# data start end
#0 E 2015-01-01 2015-01-02
#1 F 2015-01-01 2015-01-02
#2 G 2015-01-02 2015-01-03
#3 H 2015-01-03 2015-01-03
Now the rest follows unchanged
Let's use this numpy method by @piRSquared:
df1 = pd.DataFrame({'date': ['2015-01-01', '2015-01-02', '2015-01-03'],
'data': ['A', 'B', 'C']})
df2 = pd.DataFrame({'date': ['2015-01-01 to 2015-01-02', '2015-01-01 to 2015-01-02', '2015-01-02 to 2015-01-03'],
'data': ['E', 'F', 'G']})
df2[['start', 'end']] = df2['date'].str.split(' to ', expand=True)
df2['start'] = pd.to_datetime(df2.start)
df2['end'] = pd.to_datetime(df2.end)
df1['date'] = pd.to_datetime(df1['date'])
a = df1['date'].values
bh = df2['end'].values
bl = df2['start'].values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
pd.DataFrame(np.column_stack([df1.values[i], df2.values[j]]),
columns=df1.columns.append(df2.columns))
Output:
date data date data start end
0 2015-01-01 00:00:00 A 2015-01-01 to 2015-01-02 E 2015-01-01 00:00:00 2015-01-02 00:00:00
1 2015-01-01 00:00:00 A 2015-01-01 to 2015-01-02 F 2015-01-01 00:00:00 2015-01-02 00:00:00
2 2015-01-02 00:00:00 B 2015-01-01 to 2015-01-02 E 2015-01-01 00:00:00 2015-01-02 00:00:00
3 2015-01-02 00:00:00 B 2015-01-01 to 2015-01-02 F 2015-01-01 00:00:00 2015-01-02 00:00:00
4 2015-01-02 00:00:00 B 2015-01-02 to 2015-01-03 G 2015-01-02 00:00:00 2015-01-03 00:00:00
5 2015-01-03 00:00:00 C 2015-01-02 to 2015-01-03 G 2015-01-02 00:00:00 2015-01-03 00:00:00