问题
This question is building up on another question R combining duplicate rows by ID with different column types in a dataframe. I have a datatable with a column time
and some other columns of different types (factors and numerics). Here is an example:
dt <- data.table(time = c(1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4),
abst = c(0, NA, 2, NA, NA, NA, 0, 0, NA, 2, NA, 3, 4),
farbe = as.factor(c("keine", NA, "keine", NA, NA, NA, "keine", "keine", NA, NA, NA, "rot", "blau")),
gier = c(0, NA, 5, NA, NA, NA, 0, 0, NA, 1, NA, 6, 2),
goff = as.factor(c("haus", "maus", "toll", NA, "haus", NA, "maus", NA, NA, NA, NA, NA, "maus")),
huft = as.factor(c(NA, NA, NA, NA, NA, "wolle", NA, NA, "wolle", NA, NA, "holz", NA)),
mode = c(4, 2, NA, NA, 6, 5, 0, NA, NA, NA, NA, NA, 3))
Now I want to combine the duplicate times in column time
. The numeric columns are defined as the mean value of all identical IDs (without the NAs!). The factor columns are combined into one. The NAs can be omitted.
dtRes <- data.table(time = c(1, 1, 1, 2, 3, 4, 4),
abst = c(1, 1, 1, 0, 0, 3, 3),
farbe = as.factor(c("keine", "keine", "keine", "keine", "keine", "rot", "blau")),
gier = c(2.5, 2.5, 2.5, 0, 0, 3, 3),
goff = as.factor(c("haus", "maus", "toll", "maus", NA, "maus", "maus")),
huft = as.factor(c(NA, NA, NA, "wolle", "wolle", "holz", "holz")),
mode = c(4, 4, 4, 2.5, NA, 3, 3))
I need some fast calculation for this, because I have about a million observations.
Some extra thoughts to this problem: farbe
may not be unique. In this case I think the best idea for my data is to have a duplicate row but only with a different farbe
, so there are 2 identical times and all the rest stays the same but different values for farbe
. This should be just very rare case, but would be a great addition.
Also: I have a lot more numeric and factor columns in my real data so I don't want to define every single column separately. In some data tables there are no factor columns. So the solution has to work even if there are no numeric (time
is always there and numeric) or factor columns.
Thx in advance!
回答1:
We can do a group by mean
library(data.table)
library(tidyr)
library(dplyr)
dt[, lapply(.SD, function(x) if(is.numeric(x)) mean(x, na.rm = TRUE)
else toString(unique(x[!is.na(x)]))), .(time)] %>%
separate_rows(farbe, goff)
# A tibble: 7 x 7
# time abst farbe gier goff huft mode
# <dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl>
#1 1 1 keine 2.5 "haus" "" 4
#2 1 1 keine 2.5 "maus" "" 4
#3 1 1 keine 2.5 "toll" "" 4
#4 2 0 keine 0 "maus" "wolle" 2.5
#5 3 0 keine 0 "" "wolle" NaN
#6 4 3 rot 3 "maus" "holz" 3
#7 4 3 blau 3 "maus" "holz" 3
Or with cSplit
library(splitstackshape)
cSplit(dt[, lapply(.SD, function(x) if(is.numeric(x))
mean(x, na.rm = TRUE) else toString(unique(x[!is.na(x)]))), .(time)],
c('farbe', 'goff'), sep= ',\\s*', 'long', fixed = FALSE)
# time abst farbe gier goff huft mode
#1: 1 1 keine 2.5 haus 4.0
#2: 1 1 <NA> 2.5 maus 4.0
#3: 1 1 <NA> 2.5 toll 4.0
#4: 2 0 keine 0.0 maus wolle 2.5
#5: 3 0 keine 0.0 <NA> wolle NaN
#6: 4 3 rot 3.0 maus holz 3.0
#7: 4 3 blau 3.0 <NA> holz 3.0
来源:https://stackoverflow.com/questions/61876254/r-combining-duplicate-rows-in-a-time-series-with-different-column-types-in-a-dat