问题
I am new to python, and need some help with a question I am having regarding the date time function.
I have df_a
which has a column titled time
, and I am trying to create a new column id
in this df_a
.
I want the id
column to be determined by whether or not the time is contained within a range of times on df_b
columns between "date" and "date_new", for example the first row has a date of "2019-01-07 20:52:41" and "date_new" of "2019-01-07 21:07:41" (a 15 minute time interval), I would like the index for this row, to appear as my id in df_a
for when the time is "2019-01-07 20:56:30" (i.e. with id=0) and so on for all the rows in df_a
,
This question is similar, but cannot figure out how to make it work with mine as I keep getting
python assign value to pandas df if falls between range of dates in another df
s = pd.Series(df_b['id'].values,pd.IntervalIndex.from_arrays(df_b['date'],df_b['date_new']))
df_a['id']=df_a['time'].map(s)
ValueError: cannot handle non-unique indices
one caveat is that the ranges in df_b are not always unique, meaning some of the intervals contain the same periods of time, in these cases its fine if it uses the id of the first time period in df_b that it falls in, additionally there are over 200 rows in df_b, and 2000 in df_a, so it will take to long to define each time period in a for-loop type format, unless there is an easier way to do it than defining each, thank you in advance for all of your help! if this could use any clarification please let me know!
df_a
time id
2019-01-07 22:02:56 NaN
2019-01-07 21:57:12 NaN
2019-01-08 09:35:30 NaN
df_b
date date_new id
2019-01-07 21:50:56 2019-01-07 22:05:56 0
2019-01-08 09:30:30 2019-01-08 09:45:30 1
Expected Result
df_a
time id
2019-01-07 22:02:56 0
2019-01-07 21:57:12 0
2019-01-08 09:35:30 1
回答1:
Let me rephrase your problem. For each row in dataframe df_a
you want to check whether its value in df_a['time']
is in the interval given by the values in columns df_b['date']
and df_b['date_new']
. If so, set the value in df_a["id"]
as that in the corresponding df_b["id"]
.
If this is your question, this is a (very rough) solution:
for ia, ra in df_a.iterrows():
for ib, rb in df_b.iterrows():
if (ra["time"]>=rb['date']) & (ra["time"]<=rb['date_new']):
df_a.loc[ia, "id"] = rb["id"]
break
回答2:
pandas doesn't have great support for non-equi joins, which is what you are looking for, but it does have a function merge_asof
which you might want to check out:
http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.merge_asof.html
This should significantly speed up your join.
For example:
df_a = pd.DataFrame({'time': ['2019-01-07 22:02:56', '2019-01-07 21:57:12', '2019-01-08 09:35:30']})
df_b = pd.DataFrame({'date': ['2019-01-07 21:50:56', '2019-01-08 09:30:30'], 'date_new': ['2019-01-07 22:05:56', '2019-01-08 09:45:30'], 'id':[0,1]})
df_a['time'] = pd.to_datetime(df_a['time'])
df_b['date'] = pd.to_datetime(df_b['date'])
df_b['date_new'] = pd.to_datetime(df_b['date_new'])
#you need to sort df_a first before using merge_asof
df_a.sort_values('time',inplace=True)
result = pd.merge_asof(df_a, df_b, left_on='time', right_on='date')
#get rid of rows where df_a.time values are greater than df_b's new date
result = result[result.date_new > result.time]
来源:https://stackoverflow.com/questions/55454173/assessing-if-date-time-function-in-each-row-of-df-falls-within-range-of-date-tim