for loop using iterrows in pandas

前端 未结 1 857
时光取名叫无心
时光取名叫无心 2021-01-15 04:11

I have 2 dataframes as follows:

data1 looks like this:

id          address       
1          11123451
2          78947591

data2 loo

相关标签:
1条回答
  • 2021-01-15 04:38

    You can use first cross join with merge and then filter values by boolean indexing. Last remove unecessary columns by drop:

    data1['tmp'] = 1
    data2['tmp'] = 1
    df = pd.merge(data1, data2, on='tmp', how='outer')
    df = df[(df.lowerbound_address <= df.address) & (df.upperbound_address >= df.address)]
    df = df.drop(['lowerbound_address','upperbound_address', 'tmp'], axis=1)
    print (df)
       id   address place
    1   1  11123451     Y
    2   2  78947591     X
    

    Another solution with itertuples, last create DataFrame.from_records:

    places = []
    for row1 in data1.itertuples():
        for row2 in data2.itertuples():
            #print (row1.address)
            if (row2.lowerbound_address <= row1.address <= row2.upperbound_address):
                places.append((row1.id, row1.address, row2.place))    
    print (places)
    [(1, 11123451, 'Y'), (2, 78947591, 'X')]
    
    df = pd.DataFrame.from_records(places)
    df.columns=['id','address','place']
    print (df)
       id   address place
    0   1  11123451     Y
    1   2  78947591     X
    

    Another solution with apply:

    def f(x):
        for row2 in data2.itertuples():
            if (row2.lowerbound_address <= x <= row2.upperbound_address):
                return pd.Series([x, row2.place], index=['address','place'])
    
    df = data1.set_index('id')['address'].apply(f).reset_index()
    print (df)
       id   address place
    0   1  11123451     Y
    1   2  78947591     X
    

    EDIT:

    Timings:

    N = 1000:

    If saome values are not in range, in solution b and c are omited. Check last row of df1.

    In [73]: %timeit (data1.set_index('id')['address'].apply(f).reset_index())
    1 loop, best of 3: 2.06 s per loop
    
    In [74]: %timeit (a(df1a, df2a))
    1 loop, best of 3: 82.2 ms per loop
    
    In [75]: %timeit (b(df1b, df2b))
    1 loop, best of 3: 3.17 s per loop
    
    In [76]: %timeit (c(df1c, df2c))
    100 loops, best of 3: 2.71 ms per loop
    

    Code for timings:

    np.random.seed(123)
    N = 1000
    data1 = pd.DataFrame({'id':np.arange(1,N+1), 
                       'address': np.random.randint(N*10, size=N)}, columns=['id','address'])
    
    #add last row with value out of range
    data1.loc[data1.index[-1]+1, ['id','address']] = [data1.index[-1]+1, -1]
    data1 = data1.astype(int)
    print (data1.tail())
    
    data2 = pd.DataFrame({'lowerbound_address':np.arange(1, N*10,10), 
                          'upperbound_address':np.arange(10,N*10+10, 10),
                          'place': np.random.randint(40, size=N)})
    
    print (data2.tail())
    df1a, df1b, df1c = data1.copy(),data1.copy(),data1.copy()
    df2a, df2b ,df2c = data2.copy(),data2.copy(),data2.copy()
    

    def a(data1, data2):
        data1['tmp'] = 1
        data2['tmp'] = 1
        df = pd.merge(data1, data2, on='tmp', how='outer')
        df = df[(df.lowerbound_address <= df.address) & (df.upperbound_address >= df.address)]
        df = df.drop(['lowerbound_address','upperbound_address', 'tmp'], axis=1)
        return (df)
    

    def b(data1, data2):
        places = []
        for row1 in data1.itertuples():
            for row2 in data2.itertuples():
                #print (row1.address)
                if (row2.lowerbound_address <= row1.address <= row2.upperbound_address):
                    places.append((row1.id, row1.address, row2.place))    
    
            df = pd.DataFrame.from_records(places)
            df.columns=['id','address','place']
    
        return (df)
    

    def f(x):
        #use for ... else for add NaN to values out of range
        #http://stackoverflow.com/q/9979970/2901002
        for row2 in data2.itertuples():
            if (row2.lowerbound_address <= x <= row2.upperbound_address):
                 return pd.Series([x, row2.place], index=['address','place'])
        else:
            return pd.Series([x, np.nan], index=['address','place'])
    

    def c(data1,data2):
        data1 = data1.sort_values('address')
        data2 = data2.sort_values('lowerbound_address')
        df = pd.merge_asof(data1, data2, left_on='address', right_on='lowerbound_address')
        df = df.drop(['lowerbound_address','upperbound_address'], axis=1)
        return df.sort_values('id')
    
    
    print (data1.set_index('id')['address'].apply(f).reset_index())
    print (a(df1a, df2a))
    print (b(df1b, df2b))
    print (c(df1c, df2c))
    

    Only solution c with merge_asof works very nice with large DataFrame:

    N=1M:

    In [84]: %timeit (c(df1c, df2c))
    1 loop, best of 3: 525 ms per loop
    

    More about merge asof in docs.

    0 讨论(0)
提交回复
热议问题