R combining duplicate rows in a time series with different column types in a datatable

本秂侑毒 提交于 2020-05-29 10:51:22

问题


This question is building up on another question R combining duplicate rows by ID with different column types in a dataframe. I have a datatable with a column time and some other columns of different types (factors and numerics). Here is an example:

dt <- data.table(time  = c(1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4),
             abst  = c(0, NA, 2, NA, NA, NA, 0, 0, NA, 2, NA, 3, 4),
             farbe = as.factor(c("keine", NA, "keine", NA, NA, NA, "keine", "keine", NA, NA, NA, "rot", "blau")),
             gier  = c(0, NA, 5, NA, NA, NA, 0, 0, NA, 1, NA, 6, 2),
             goff  = as.factor(c("haus", "maus", "toll", NA, "haus", NA, "maus", NA, NA, NA, NA, NA, "maus")),
             huft  = as.factor(c(NA, NA, NA, NA, NA, "wolle", NA, NA, "wolle", NA, NA, "holz", NA)),
             mode  = c(4, 2, NA, NA, 6, 5, 0, NA, NA, NA, NA, NA, 3))

Now I want to combine the duplicate times in column time. The numeric columns are defined as the mean value of all identical IDs (without the NAs!). The factor columns are combined into one. The NAs can be omitted.

dtRes <- data.table(time  = c(1, 1, 1, 2, 3, 4, 4),
                abst  = c(1, 1, 1, 0, 0, 3, 3),
                farbe = as.factor(c("keine", "keine", "keine", "keine", "keine", "rot", "blau")),
                gier  = c(2.5, 2.5, 2.5, 0, 0, 3, 3),
                goff  = as.factor(c("haus", "maus", "toll", "maus", NA, "maus", "maus")),
                huft  = as.factor(c(NA, NA, NA, "wolle", "wolle", "holz", "holz")),
                mode  = c(4, 4, 4, 2.5, NA, 3, 3))

I need some fast calculation for this, because I have about a million observations.

Some extra thoughts to this problem: farbe may not be unique. In this case I think the best idea for my data is to have a duplicate row but only with a different farbe, so there are 2 identical times and all the rest stays the same but different values for farbe. This should be just very rare case, but would be a great addition.

Also: I have a lot more numeric and factor columns in my real data so I don't want to define every single column separately. In some data tables there are no factor columns. So the solution has to work even if there are no numeric (time is always there and numeric) or factor columns.

Thx in advance!


回答1:


We can do a group by mean

library(data.table)
library(tidyr)
library(dplyr)
dt[, lapply(.SD, function(x) if(is.numeric(x)) mean(x, na.rm = TRUE)
     else toString(unique(x[!is.na(x)]))), .(time)] %>%
     separate_rows(farbe, goff)
# A tibble: 7 x 7
#   time  abst farbe  gier goff   huft     mode
#  <dbl> <dbl> <chr> <dbl> <chr>  <chr>   <dbl>
#1     1     1 keine   2.5 "haus" ""        4  
#2     1     1 keine   2.5 "maus" ""        4  
#3     1     1 keine   2.5 "toll" ""        4  
#4     2     0 keine   0   "maus" "wolle"   2.5
#5     3     0 keine   0   ""     "wolle" NaN  
#6     4     3 rot     3   "maus" "holz"    3  
#7     4     3 blau    3   "maus" "holz"    3  

Or with cSplit

library(splitstackshape)
cSplit(dt[, lapply(.SD, function(x) if(is.numeric(x)) 
    mean(x, na.rm = TRUE) else toString(unique(x[!is.na(x)]))), .(time)], 
    c('farbe', 'goff'), sep= ',\\s*', 'long', fixed = FALSE)
#   time abst farbe gier goff  huft mode
#1:    1    1 keine  2.5 haus        4.0
#2:    1    1  <NA>  2.5 maus        4.0
#3:    1    1  <NA>  2.5 toll        4.0
#4:    2    0 keine  0.0 maus wolle  2.5
#5:    3    0 keine  0.0 <NA> wolle  NaN
#6:    4    3   rot  3.0 maus  holz  3.0
#7:    4    3  blau  3.0 <NA>  holz  3.0


来源:https://stackoverflow.com/questions/61876254/r-combining-duplicate-rows-in-a-time-series-with-different-column-types-in-a-dat

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!