问题
I have a dataframe where I need to group elements with distance of no more than 1. For example, if this is my df:
group_number val
0 1 5
1 1 8
2 1 12
3 1 13
4 1 22
5 1 26
6 1 31
7 2 7
8 2 16
9 2 17
10 2 19
11 2 29
12 2 33
13 2 62
So I need to group both by the group_number
and val
where the values of val
are smaller than or equal to 1.
So, in this example, lines 2
and 3
would group together, and also lines 8
and 9
would group together.
I tried using diff or related functions, but I didn't figure it out.
Any help will be appreciated!
回答1:
Using diff
is the right approach - just combine it with gt
and cumsum
and you have your groups.
The idea is to use cumulative sum for differences bigger than your threshold. Difference larger than your threshold will become True
. In contrast, differences equal or lower to your threshold will become False
. Cumulatively summing over the boolean values will leave differences equal or lower to your threshold unchanged and hence they get the same group number.
max_distance = 1
df["group_diff"] = df.sort_values("val")\
.groupby("group_number")["val"]\
.diff()\
.gt(max_distance)\
.cumsum()
print(df)
group_number val group_diff
0 1 5 0
1 1 8 1
2 1 12 2
3 1 13 2
4 1 22 5
5 1 26 6
6 1 31 8
7 2 7 0
8 2 16 3
9 2 17 3
10 2 19 4
11 2 29 7
12 2 33 9
13 2 62 10
You can now use groupby
on group_number and group_diff and see the resulting groups with the following:
grouped = df.groupby(["group_number", "group_diff"])
print(grouped.groups)
{(1, 0): Int64Index([0], dtype='int64'),
(1, 1): Int64Index([1], dtype='int64'),
(1, 2): Int64Index([2, 3], dtype='int64'),
(1, 5): Int64Index([4], dtype='int64'),
(1, 6): Int64Index([5], dtype='int64'),
(1, 8): Int64Index([6], dtype='int64'),
(2, 0): Int64Index([7], dtype='int64'),
(2, 3): Int64Index([8, 9], dtype='int64'),
(2, 4): Int64Index([10], dtype='int64'),
(2, 7): Int64Index([11], dtype='int64'),
(2, 9): Int64Index([12], dtype='int64'),
(2, 10): Int64Index([13], dtype='int64')}
Thanks @jezrael for the hint of avoiding a new column to increase performance:
group_diff = df.sort_values("val")\
.groupby("group_number")["val"]\
.diff()\
.gt(max_distance)\
.cumsum()
grouped = df.groupby(["group_number", group_diff])
来源:https://stackoverflow.com/questions/48109624/python-pandas-how-to-group-close-elements