Python pandas - how to group close elements

倾然丶 夕夏残阳落幕 提交于 2021-02-04 16:15:22

问题


I have a dataframe where I need to group elements with distance of no more than 1. For example, if this is my df:

     group_number  val
0              1    5
1              1    8
2              1   12
3              1   13
4              1   22
5              1   26
6              1   31
7              2    7
8              2   16
9              2   17
10             2   19
11             2   29
12             2   33
13             2   62

So I need to group both by the group_number and val where the values of val are smaller than or equal to 1.

So, in this example, lines 2 and 3 would group together, and also lines 8 and 9 would group together.

I tried using diff or related functions, but I didn't figure it out.

Any help will be appreciated!


回答1:


Using diff is the right approach - just combine it with gt and cumsum and you have your groups.

The idea is to use cumulative sum for differences bigger than your threshold. Difference larger than your threshold will become True. In contrast, differences equal or lower to your threshold will become False. Cumulatively summing over the boolean values will leave differences equal or lower to your threshold unchanged and hence they get the same group number.

max_distance = 1

df["group_diff"] = df.sort_values("val")\
                     .groupby("group_number")["val"]\
                     .diff()\
                     .gt(max_distance)\
                     .cumsum()

print(df)

    group_number    val group_diff
0   1               5   0
1   1               8   1
2   1               12  2
3   1               13  2
4   1               22  5
5   1               26  6
6   1               31  8
7   2               7   0
8   2               16  3
9   2               17  3
10  2               19  4
11  2               29  7
12  2               33  9
13  2               62  10

You can now use groupby on group_number and group_diff and see the resulting groups with the following:

grouped = df.groupby(["group_number", "group_diff"])
print(grouped.groups)

{(1, 0): Int64Index([0], dtype='int64'),
 (1, 1): Int64Index([1], dtype='int64'),
 (1, 2): Int64Index([2, 3], dtype='int64'),
 (1, 5): Int64Index([4], dtype='int64'),
 (1, 6): Int64Index([5], dtype='int64'),
 (1, 8): Int64Index([6], dtype='int64'),
 (2, 0): Int64Index([7], dtype='int64'),
 (2, 3): Int64Index([8, 9], dtype='int64'),
 (2, 4): Int64Index([10], dtype='int64'),
 (2, 7): Int64Index([11], dtype='int64'),
 (2, 9): Int64Index([12], dtype='int64'),
 (2, 10): Int64Index([13], dtype='int64')}

Thanks @jezrael for the hint of avoiding a new column to increase performance:

group_diff = df.sort_values("val")\
               .groupby("group_number")["val"]\
               .diff()\
               .gt(max_distance)\
               .cumsum()

grouped = df.groupby(["group_number", group_diff])


来源:https://stackoverflow.com/questions/48109624/python-pandas-how-to-group-close-elements

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!