I\'m attempting to populate a column in a data frame based on whether the index value of that record falls within a range defined by two columns in another data frame.
d
import pandas as pd
import numpy as np
# Here is your existing dataframe
df_existing = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
# Create a new empty dataframe with specific column names and data types
df_new = pd.DataFrame(index=None)
columns = ['field01','field02','field03','field04']
dtypes = [str,int,int,int]
for c,d in zip(columns, dtypes):
df_new[c] = pd.Series(dtype=d)
# Set the index on the new dataframe to same as existing
df_new['new_index'] = df_existing.index
df_new.set_index('new_index', inplace=True)
# Fill the new dataframe with specific fields from the existing dataframe
df_new[['field02','field03']] = df_existing[['B','C']]
print df_new
Alternative solution:
classdict = df2.set_index("CLASS").to_dict("index")
rangedict = {}
for key,value in classdict.items():
# get all items in range and assign value (the key)
for item in list(range(value["START"],value["STOP"]+1)):
rangedict[item] = key
extract rangedict:
{2: 1, 3: 1, 5: 2, 6: 2, 7: 2, 8: 3}
now map and possibly format(?):
df1['CLASS'] = df1.index.to_series().map(rangedict)
df1.applymap("{0:.0f}".format)
outputs:
a CLASS
0 4 nan
1 45 nan
2 7 1
3 5 1
4 48 nan
5 44 2
6 22 2
7 89 2
8 45 3
9 44 nan
10 23 nan
You can use IntervalIndex (requires v0.20.0).
First construct the index:
df2.index = pd.IntervalIndex.from_arrays(df2['START'], df2['STOP'], closed='both')
df2
Out:
START STOP CLASS
[2, 3] 2 3 1
[5, 7] 5 7 2
[8, 8] 8 8 3
Now if you index into the second DataFrame it will lookup the value in the intervals. For example,
df2.loc[6]
Out:
START 5
STOP 7
CLASS 2
Name: [5, 7], dtype: int64
returns the second class. I don't know if it can be used with merge or with merge_asof but as an alternative you can use map:
df1['CLASS'] = df1.index.to_series().map(df2['CLASS'])
Note that I first converted the index to a Series to be able to use the Series.map method. This results in
df1
Out:
a CLASS
0 4 NaN
1 45 NaN
2 7 1.0
3 5 1.0
4 48 NaN
5 44 2.0
6 22 2.0
7 89 2.0
8 45 3.0
9 44 NaN
10 23 NaN