问题
I have two big dataframes (100K rows), One has 'values', one has 'types'. I want to assign a 'type' from df2 to a column in df1 based on depth. The depths are given as depth 'From' and depth 'To' columns. The 'types' are also defined by depth 'From' and 'To'. BUT they are NOT the same intervals. df1 depths may span multiple df2 types.
I want to assign the df2 'types' to df1 and if there are multiple types, try and capture that information too. Example below.
import pandas as pd
import numpy as np
df1=pd.DataFrame(np.array([[1,3,0.001],[3,5,0.005],[5,7,0.002],[7,10,0.001]]),columns=['From', 'To', 'val'])
df2=pd.DataFrame(np.array([[0.0,4,'A'],[4,5,'B'],[5,6,'C'],[6,8,'D'],[8,10,'E']]),columns=['From', 'To', 'Type'])
df1
Out[1]:
From To val
0 1.0 3.0 0.001
1 3.0 5.0 0.005
2 5.0 7.0 0.002
3 7.0 10.0 0.001
df2
Out[2]:
From To Type
0 0 4 A
1 4 5 B
2 5 6 C
3 6 8 D
4 8 10 E
Possible acceptable output:
Out[4]:
From To val Type
0 1 3 0.001 A
1 3 5 0.005 1 unit A,2 units B
2 5 7 0.002 1 unit C,1 unit D
3 7 10 0.001 1 unit D, 3 units E
Percentages of Types would also be a good ouput in Type .
One solution may be to create a new dataframe with high resolution 'depths' and forward fill the types, and do a sort of VLOOKUP on the To and the From.
I also thought about the possibility of making a column in each df that is a 'set' based on the to and from cols.
Possible join or merge but need to get the data compatible first.
Don't know where to start. Hoping there is neat way to tackle this, I have basically the exact same situation as this guy but I don't speak 'R' and would like to possibly report multiple type info.
回答1:
From df2 create an auxiliary Series, marking each "starting point" of a unit (a range of length 1):
units = df2.set_index('Type').apply(lambda row: pd.Series(
range(row.From, row.To)), axis=1).stack()\
.reset_index(level=1, drop=True)
The result is:
Type
A 0.0
A 1.0
A 2.0
A 3.0
B 4.0
C 5.0
D 6.0
D 7.0
E 8.0
E 9.0
dtype: float64
Then define a function generating Type for the current row:
def getType(row):
gr = units[units.ge(row.From) & units.lt(row.To)].groupby(level=0)
if gr.ngroups == 1:
return gr.ngroup().index[0]
txt = []
for key, grp in gr:
siz = grp.size
un = 'unit' if siz == 1 else 'units'
txt.append(f'{siz} {un} {key}')
return ','.join(txt)
And to generate Type column, apply it to each row:
df1['Type'] = df1.apply(getType, axis=1)
The result is:
From To val Type
0 1.0 3.0 0.001 A
1 3.0 5.0 0.005 1 unit A,1 unit B
2 5.0 7.0 0.002 1 unit C,1 unit D
3 7.0 10.0 0.001 1 unit D,2 units E
This result is a bit different from your expected result, but I think you created it in a bit inconsequent way.
I think that my solution is correct (at least more consequent), because:
- Row 1.0 - 3.0 is entirely within the limits of 0 4 A, so the result is just A (like in your post).
- Row 3.0 - 5.0 can be "divided" into:
- 3.0 - 4.0 is within the limits of 0 4 A (1 unit),
- 4.0 - 5.0 is within the limits of 4 5 B (also 1 unit, but you want 2 units here).
- Row 5.0 - 7.0 can be again "divided" into:
- 5.0 - 6.0 is within the limits of 5 6 C (1 unit),
- 6.0 - 7.0 is within the limits of 6 8 D (1 unit, just like you did).
- Row 7.0 - 10.0 can be "divided" into:
- 7.0 - 8.0 is within the limits of 6 8 D (1 unit, just like you did),
- 8.0 - 10.0 is within the limits of 8 10 E (2 units, not 3 as you wrote).
来源:https://stackoverflow.com/questions/58570888/combine-two-dataframes-based-on-ranges-which-may-partially-overlap-using-pandas