Aggregate adjacent rows, ignoring certain columns

问题

I have a df like below

> head(df)
  OrderId           Timestamp ErrorCode
1 3000000 1455594300434609920        NA
2 3000001 1455594300434614272        NA
3 3000000 1455594300440175104         0
4 3000001 1455594300440179712         0
5 3000002 1455594303468741120        NA
6 3000002 1455594303469326848         0

I need to collapse row in a way that output is something like below

> head(df)
  OrderId         Timestamp1  Timestamp2       ErrorCode Diff
 3000000 1455594300434609920  1455594300440175104      0
 3000001 1455594300434614272  1455594300440179712      0
 3000002 1455594303468741120  1455594303469326848      0

I used df2=aggregate(Timestamp~.,df,FUN=toString) But output is

   OrderId ErrorCode           Timestamp
10 3000001         0 1455594300440179712
11 3000002         0 1455594303469326848
12 3000003         0 1455594303713897984

When I dropped the ErrorCode column and used the same command, I get an expected output

> head(kf)
  OrderId           Timestamp
1 3000000 1455594300434609920
2 3000001 1455594300434614272
3 3000000 1455594300440175104
4 3000001 1455594300440179712
5 3000002 1455594303468741120
6 3000002 1455594303469326848
> kf2=aggregate(Timestamp~.,kf,FUN=toString)
head(kf2)
   OrderId                                Timestamp
10 3000001 1455594300434614272, 1455594300440179712
11 3000002 1455594303468741120, 1455594303469326848
12 3000003 1455594303711330816, 1455594303713897984

How do I aggregate it in the above manner without removing ErrorCode column. There must be some little thing I am missing.

回答1:

I take it you're actually looking just to reshape your data into a wide format with separate columns for timestamp 1 and 2. One way is to first add a new column that defines the time point of the measurement and then melt and cast the data using reshape2.

# Add an index to the data.frame
for (i in unique(df$OrderId)) {
  ii <- df$OrderId == i
  df$time_ind[ii] <- seq_along(ii[ii])
}

library(reshape2)

df_long <- melt(df, id.vars = c("OrderId", "time_ind"),
                measure.vars = c("Timestamp", "ErrorCode"))

dcast(df_long, OrderId ~ variable + time_ind)

which will give you

  OrderId         Timestamp_1         Timestamp_2 ErrorCode_1 ErrorCode_2
1 3000000 1455594300434609920 1455594300440175104        <NA>           0
2 3000001 1455594300434614272 1455594300440179712        <NA>           0
3 3000002 1455594303468741120 1455594303469326848        <NA>           0

来源：https://stackoverflow.com/questions/42642091/aggregate-adjacent-rows-ignoring-certain-columns

标签

aggregate

reshape