Create index column based on existent groups of rows (with duplicates)

问题

The question is similar to (Update) Add index column to data.frame based on two columns

Here my example data.frame:

df = read.table(text = 'ID  Day Count   Count_group 
77661   14498   4   5
76552   14498   4   5
37008   14498   4   5
34008   14498   4   5
30004   14497   1   5
30004   14497   1   4   
28047   14496   3   4   
28049   14496   3   4   
29003   14496   3   4   
69012   14468   1   4   
69007   14467   3   4   
69012   14467   3   4   
69020   14467   3   4   
42003   13896   2   4   
42011   13896   2   4   
22001   13895   2   4   
23007   13895   2   4   
28047   14496   3   3   
28049   14496   3   3   
29003   14496   3   3   
69007   14467   3   3   
69012   14467   3   3   
69020   14467   3   3   
48005   14271   2   2   
48007   14271   2   2   
22001   13895   2   2   
23007   13895   2   2   
47011   14320   1   2   
73005   14319   1   2   
73005   14319   1   1', header = TRUE)

The Count col shows the sum of the ID values grouped by Day. The Count_group shows the sum of the unique Count vales grouped by Day and Day -1.

I need to create an index column which groups the Count_group by Day and Day -1 following the descending order of the df (with duplicates!).

Here my expected output:

ID     Day  Count Count_group index_col
77661   14498   4   5           1
76552   14498   4   5           1
37008   14498   4   5           1
34008   14498   4   5           1
30004   14497   1   5           1
30004   14497   1   4           2
28047   14496   3   4           2
28049   14496   3   4           2
29003   14496   3   4           2
69012   14468   1   4           3
69007   14467   3   4           3
69012   14467   3   4           3
69020   14467   3   4           3
42003   13896   2   4           4
42011   13896   2   4           4
22001   13895   2   4           4
23007   13895   2   4           4
28047   14496   3   3           5
28049   14496   3   3           5
29003   14496   3   3           5
69007   14467   3   3           6
69012   14467   3   3           6
69020   14467   3   3           6
48005   14271   2   2           7
48007   14271   2   2           7
22001   13895   2   2           8
23007   13895   2   2           8
47011   14320   1   2           9
73005   14319   1   2           9
73005   14319   1   1          10

And do the same but with index_col grouping by 3 days: Day, Day -1 and Day -2:

    df_2 = read.table(text = 'ID Day Count Count_group
30004   14497   1   5
28047   14496   3   5
28049   14496   3   5
29003   14496   3   5
69012   14495   1   5
69007   14467   3   5
69012   14467   3   5
69020   14467   3   5
42003   14466   1   5
42011   14465   1   5
28047   14496   3   4
28049   14496   3   4
29003   14496   3   4
69012   14995   1   4
22001   13895   2   4
23007   13895   2   4
28047   13894   2   4
28049   13894   2   4
42003   14466   1   2
42011   14465   1   2
28047   13894   2   2
28049   13894   2   2
69012   14995   1   1
42011   14465   1   1', header = TRUE)

Expected output:

ID     Day  Count Count_group index_col
30004   14497   1   5           1
28047   14496   3   5           1
28049   14496   3   5           1
29003   14496   3   5           1
69012   14495   1   5           1
69007   14467   3   5           2
69012   14467   3   5           2
69020   14467   3   5           2
42003   14466   1   5           2
42011   14465   1   5           2
28047   14496   3   4           3
28049   14496   3   4           3
29003   14496   3   4           3
69012   14995   1   4           3
22001   13895   2   4           4
23007   13895   2   4           4
28047   13894   2   4           4
28049   13894   2   4           4
42003   14466   1   2           5
42011   14465   1   2           5
28047   13894   2   2           6
28049   13894   2   2           6
69012   14995   1   1           7
42011   14465   1   1           8

Do you have any suggestion? I desire to create a generic code that could be applied (with a few adjustments) to both df, df_2 and to other data.frames with grouping variable of n days.

回答1:

Using dplyr:

df %>% mutate(index_col = cumsum(!c(+Inf,diff(Day))%in%c(0,-1)))

Explanation:

 c(+Inf,diff(Day))

As you want two consecutive days, I compute the difference on the Day with diff(Day). As diff return vector of size n-1, I have to add a value for the top of the vector, I choose +Inf.

!(... %in% c(0,-1))

I test that the value is the same Day or Day-1 as they must be grouped, I want when it's not the case.

cumsum(...)

Finally, I use cumsum to know how many of change occur.

The output:

It's work for your two exemples

> df %>% mutate(index_col = cumsum(!c(+Inf,diff(Day))%in%c(0,-1)))

      ID   Day Count Count_group index_col
1  30004 14497     1           4         1
2  28047 14496     3           4         1
3  28049 14496     3           4         1
4  29003 14496     3           4         1
5  69012 14468     1           4         2
6  69007 14467     3           4         2
7  69012 14467     3           4         2
8  69020 14467     3           4         2
9  42003 13896     2           4         3
10 42011 13896     2           4         3
11 22001 13895     2           4         3
12 23007 13895     2           4         3
13 28047 14496     3           3         4
14 28049 14496     3           3         4
15 29003 14496     3           3         4
16 69007 14467     3           3         5
17 69012 14467     3           3         5
18 69020 14467     3           3         5
19 48005 14271     2           2         6
20 48007 14271     2           2         6
21 22001 13895     2           2         7
22 23007 13895     2           2         7
23 47011 14320     1           2         8
24 73005 14319     1           2         8
25 73005 14319     1           1         8

and

> df_2 %>% mutate(index_col = cumsum(!c(+Inf,diff(Day))%in%c(0,-1)))

      ID   Day Count Count_group index_col
1  30004 14497     1           5         1
2  28047 14496     3           5         1
3  28049 14496     3           5         1
4  29003 14496     3           5         1
5  69012 14495     1           5         1
6  69007 14467     3           5         2
7  69012 14467     3           5         2
8  69020 14467     3           5         2
9  42003 14466     1           5         2
10 42011 14465     1           5         2
11 28047 14496     3           4         3
12 28049 14496     3           4         3
13 29003 14496     3           4         3
14 69012 14495     1           4         3
15 22001 13895     2           4         4
16 23007 13895     2           4         4
17 28047 13894     2           4         4
18 28049 13894     2           4         4
19 42003 14466     1           2         5
20 42011 14465     1           2         5
21 28047 13894     2           2         6
22 28049 13894     2           2         6
23 69012 14995     1           1         7
24 42011 14465     1           1         8

来源：https://stackoverflow.com/questions/44250468/create-index-column-based-on-existent-groups-of-rows-with-duplicates

标签

indexing

grouping

add

col