Python dataframe check if a value in a column dataframe is within a range of values reported in another dataframe

試著忘記壹切 提交于 2020-01-15 05:31:29

问题


Apology if the problemis trivial but as a python newby I wasn't able to find the right solution.

I have two dataframes and I need to add a column to the first dataframe that is true if a certain value of the first dataframe is between two values of the second dataframe otherwise false.

for example:

first_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2':[10,22,15,15,7,130,2]})
second_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2_start':[5,20,11,11,5,110,220],'code2_end':[15,25,20,20,10,120,230]})

first_df

  code1 code2
0   1   10
1   1   22
2   2   15
3   2   15
4   3   7
5   1   130
6   1   2

second_df

  code1 code2_end   code2_start
0   1   15          5
1   1   25          20
2   2   20          11
3   2   20          11
4   3   10          5
5   1   120         110
6   1   230         220

For each row in the first dataframe I should check if the value reported in the code2 columne is between one of the possible range identified by the row of the second dataframe second_df for example:

in row 1 of first_df code1=1 and code2=22

checking second_df I have 4 rows with code1=1, rows 0,1,5 and 6, the value code2=22 is in the interval identified by code2_start=20 and code2_end=25 so the function should return True.

Considering an example where the function should return False,

in row 5 of first_df code1=1 and code2=130

but there is no interval containing 130 where code1=1

I have tried to use this function

def check(first_df,second_df):
    for i in range(len(first_df):
        return ((second_df.code2_start <= first_df.code2[i]) & (second_df.code2_end <= first_df.code2[i]) & (second_df.code1 == first_df.code1[i])).any()

and to vectorize it

first_df['output'] = np.vectorize(check)(first_df, second_df)

but obviously with no success.

I would be happy for any input you could provide.

thx.

A.

As a practical example:

first_df.code1[0] = 1

therefore I need to search on second_df all the istances where

second_df.code1 == first_df.code1[0]
0     True
1     True
2    False
3    False
4    False
5     True
6     True

for the instances 0,1,5,6 where the status is True I need to check if the value

first_df.code2[0]
10

is between one of the range identified by

second_df[second_df.code1 == first_df.code1[0]][['code2_start','code2_end']]
    code2_start code2_end
0   5           15
1   20          25
5   110         120
6   220         230

since the value of first_df.code2[0] is 10 it is between 5 and 15 so the range identified by row 0 therefore my function should return True. In case of first_df.code1[6] the value vould still be 1 therefore the range table would be still the same above but first_df.code2[6] is 2 in this case and there is no interval containing 2 therefore the resut should be False.


回答1:


first_df['output'] = (second_df.code2_start <= first_df.code2) & (second_df.code2_end <= first_df.code2)

This works because when you do something like: second_df.code2_start <= first_df.code2

You get a boolean Series. If you then perform a logical AND on two of these boolean series, you get a Series which has value True where both Series were True and False otherwise.

Here's an example:

>>> import pandas as pd
>>> a = pd.DataFrame([{1:2,2:4,3:6},{1:3,2:6,3:9},{1:4,2:8,3:10}])
>>> a['output'] = (a[2] <= a[3]) & (a[2] >= a[1])
>>> a
   1  2   3 output
0  2  4   6   True
1  3  6   9   True
2  4  8  10   True

EDIT:

So based on your updated question and my new interpretation of your problem, I would do something like this:

import pandas as pd

# Define some data to work with
df_1 = pd.DataFrame([{'c1':1,'c2':5},{'c1':1,'c2':10},{'c1':1,'c2':20},{'c1':2,'c2':8}])
df_2 = pd.DataFrame([{'c1':1,'start':3,'end':6},{'c1':1,'start':7,'end':15},{'c1':2,'start':5,'end':15}])

# Function checks if c2 value is within any range matching c1 value
def checkRange(x, code_range):
    idx = code_range.c1 == x.c1
    code_range = code_range.loc[idx]
    check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
    return check.any()

# Apply the checkRange function to each row of the DataFrame
df_1['output'] = df_1.apply(lambda x: checkRange(x, df_2), axis=1)

What I do here is define a function called checkRange which takes as input x, a single row of df_1 and code_range, the entire df_2 DataFrame. It first finds the rows of code_range which have the same c1 value as the given row, x.c1. Then the non matching rows are discarded. This is done in the first 2 lines:

idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]

Next, we get a boolean Series which tells us if x.c2 falls within any of the ranges given in the reduced code_range DataFrame:

check = (code_range.start <= x.c2) & (code_range.end >= x.c2)

Finally, since we only care that the x.c2 falls within one of the ranges, we return the value of check.any(). When we call any() on a boolean Series, it will return True if any of the values in the Series are True.

To call the checkRange function on each row of df_1, we can use apply(). I define a lambda expression in order to send the checkRange function the row as well as df_2. axis=1 means that the function will be called on each row (instead of each column) for the DataFrame.



来源:https://stackoverflow.com/questions/38334845/python-dataframe-check-if-a-value-in-a-column-dataframe-is-within-a-range-of-val

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!