Combine two dataframes based on ranges which may partially overlap using Pandas and track multiple values

天大地大妈咪最大 提交于 2020-01-25 08:52:06

问题


I have two big dataframes (100K rows), One has 'values', one has 'types'. I want to assign a 'type' from df2 to a column in df1 based on depth. The depths are given as depth 'From' and depth 'To' columns. The 'types' are also defined by depth 'From' and 'To'. BUT they are NOT the same intervals. df1 depths may span multiple df2 types.

I want to assign the df2 'types' to df1 and if there are multiple types, try and capture that information too. Example below.

import pandas as pd
import numpy as np
df1=pd.DataFrame(np.array([[1,3,0.001],[3,5,0.005],[5,7,0.002],[7,10,0.001]]),columns=['From', 'To', 'val'])

df2=pd.DataFrame(np.array([[0.0,4,'A'],[4,5,'B'],[5,6,'C'],[6,8,'D'],[8,10,'E']]),columns=['From', 'To', 'Type'])

df1

Out[1]: 
   From    To    val
0   1.0   3.0  0.001
1   3.0   5.0  0.005
2   5.0   7.0  0.002
3   7.0  10.0  0.001

df2

Out[2]: 
  From  To Type
0    0   4    A
1    4   5    B
2    5   6    C
3    6   8    D
4    8  10    E

Possible acceptable output:

Out[4]: 
  From  To    val                 Type
0    1   3  0.001                    A
1    3   5  0.005   1 unit A,2 units B
2    5   7  0.002    1 unit C,1 unit D
3    7  10  0.001  1 unit D, 3 units E

Percentages of Types would also be a good ouput in Type .

One solution may be to create a new dataframe with high resolution 'depths' and forward fill the types, and do a sort of VLOOKUP on the To and the From.

I also thought about the possibility of making a column in each df that is a 'set' based on the to and from cols.

Possible join or merge but need to get the data compatible first.

Don't know where to start. Hoping there is neat way to tackle this, I have basically the exact same situation as this guy but I don't speak 'R' and would like to possibly report multiple type info.


回答1:


From df2 create an auxiliary Series, marking each "starting point" of a unit (a range of length 1):

units = df2.set_index('Type').apply(lambda row: pd.Series(
    range(row.From, row.To)), axis=1).stack()\
    .reset_index(level=1, drop=True)

The result is:

Type
A    0.0
A    1.0
A    2.0
A    3.0
B    4.0
C    5.0
D    6.0
D    7.0
E    8.0
E    9.0
dtype: float64

Then define a function generating Type for the current row:

def getType(row):
    gr = units[units.ge(row.From) & units.lt(row.To)].groupby(level=0)
    if gr.ngroups == 1:
        return gr.ngroup().index[0]
    txt = []
    for key, grp in gr:
        siz = grp.size
        un = 'unit' if siz == 1 else 'units'
        txt.append(f'{siz} {un} {key}')
    return ','.join(txt)

And to generate Type column, apply it to each row:

df1['Type'] = df1.apply(getType, axis=1)

The result is:

   From    To    val                Type
0   1.0   3.0  0.001                   A
1   3.0   5.0  0.005   1 unit A,1 unit B
2   5.0   7.0  0.002   1 unit C,1 unit D
3   7.0  10.0  0.001  1 unit D,2 units E

This result is a bit different from your expected result, but I think you created it in a bit inconsequent way.

I think that my solution is correct (at least more consequent), because:

  1. Row 1.0 - 3.0 is entirely within the limits of 0 4 A, so the result is just A (like in your post).
  2. Row 3.0 - 5.0 can be "divided" into:
    • 3.0 - 4.0 is within the limits of 0 4 A (1 unit),
    • 4.0 - 5.0 is within the limits of 4 5 B (also 1 unit, but you want 2 units here).
  3. Row 5.0 - 7.0 can be again "divided" into:
    • 5.0 - 6.0 is within the limits of 5 6 C (1 unit),
    • 6.0 - 7.0 is within the limits of 6 8 D (1 unit, just like you did).
  4. Row 7.0 - 10.0 can be "divided" into:
    • 7.0 - 8.0 is within the limits of 6 8 D (1 unit, just like you did),
    • 8.0 - 10.0 is within the limits of 8 10 E (2 units, not 3 as you wrote).


来源:https://stackoverflow.com/questions/58570888/combine-two-dataframes-based-on-ranges-which-may-partially-overlap-using-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!