assessing if date time function in each row of df falls within range of date time in another df

こ雲淡風輕ζ 提交于 2020-07-21 04:26:14

问题


I am new to python, and need some help with a question I am having regarding the date time function.

I have df_a which has a column titled time, and I am trying to create a new column id in this df_a.

I want the id column to be determined by whether or not the time is contained within a range of times on df_b columns between "date" and "date_new", for example the first row has a date of "2019-01-07 20:52:41" and "date_new" of "2019-01-07 21:07:41" (a 15 minute time interval), I would like the index for this row, to appear as my id in df_a for when the time is "2019-01-07 20:56:30" (i.e. with id=0) and so on for all the rows in df_a,

This question is similar, but cannot figure out how to make it work with mine as I keep getting

python assign value to pandas df if falls between range of dates in another df

s = pd.Series(df_b['id'].values,pd.IntervalIndex.from_arrays(df_b['date'],df_b['date_new'])) 
df_a['id']=df_a['time'].map(s)

ValueError: cannot handle non-unique indices

one caveat is that the ranges in df_b are not always unique, meaning some of the intervals contain the same periods of time, in these cases its fine if it uses the id of the first time period in df_b that it falls in, additionally there are over 200 rows in df_b, and 2000 in df_a, so it will take to long to define each time period in a for-loop type format, unless there is an easier way to do it than defining each, thank you in advance for all of your help! if this could use any clarification please let me know!

df_a

time                    id
2019-01-07 22:02:56     NaN
2019-01-07 21:57:12     NaN
2019-01-08 09:35:30     NaN


df_b

date                    date_new               id
2019-01-07 21:50:56    2019-01-07 22:05:56     0
2019-01-08 09:30:30    2019-01-08 09:45:30     1

Expected Result

df_a     
time                    id
2019-01-07 22:02:56     0
2019-01-07 21:57:12     0
2019-01-08 09:35:30     1

回答1:


Let me rephrase your problem. For each row in dataframe df_a you want to check whether its value in df_a['time'] is in the interval given by the values in columns df_b['date'] and df_b['date_new']. If so, set the value in df_a["id"] as that in the corresponding df_b["id"].

If this is your question, this is a (very rough) solution:

for ia, ra in df_a.iterrows():
    for ib, rb in df_b.iterrows():
        if (ra["time"]>=rb['date']) & (ra["time"]<=rb['date_new']):
            df_a.loc[ia, "id"] = rb["id"]
            break



回答2:


pandas doesn't have great support for non-equi joins, which is what you are looking for, but it does have a function merge_asof which you might want to check out: http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.merge_asof.html

This should significantly speed up your join.

For example:

df_a = pd.DataFrame({'time': ['2019-01-07 22:02:56', '2019-01-07 21:57:12', '2019-01-08 09:35:30']})
df_b = pd.DataFrame({'date': ['2019-01-07 21:50:56', '2019-01-08 09:30:30'], 'date_new': ['2019-01-07 22:05:56', '2019-01-08 09:45:30'], 'id':[0,1]})
df_a['time'] = pd.to_datetime(df_a['time'])
df_b['date'] = pd.to_datetime(df_b['date'])
df_b['date_new'] = pd.to_datetime(df_b['date_new'])

#you need to sort df_a first before using merge_asof
df_a.sort_values('time',inplace=True)
result = pd.merge_asof(df_a, df_b, left_on='time', right_on='date')

#get rid of rows where df_a.time values are greater than df_b's new date
result = result[result.date_new > result.time]


来源:https://stackoverflow.com/questions/55454173/assessing-if-date-time-function-in-each-row-of-df-falls-within-range-of-date-tim

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!