问题
I'd like to summarize a set of observations in a datatable and could use some help with the syntax.
I think this is as simple as a join but I'm trying to identify that specific values were seen on a specific observation DAY even if its across multiple measurements or sensors on that day.
- observations are summarized by date
- observations date have varied counts of measurements (rows per date)
- 'M'easurement columns indicate that a specific value was observed in ANY sensor for the day.
I've created 2 sample sets of data that I hope will clarify the goal. I've also created an image of an excel spreadsheet that hopes to show the relationship between the data.
library(data.table)
raw <- data.table(
Date = as.Date(c("2013-5-4","2013-5-4","2013-5-4", "2013-5-9","2013-5-9", "2013-5-16","2013-5-16","2013-5-16", "2013-5-30")),
S1 = c(4, 2, 3, 1, 1, 8, 7, 3, 3),
S2 = c(2, 5, 2, 4, 4, 9, 1, 6, 4),
S3 = c(6, 2, 2, 7, 3, 2, 7, 2, 1)
)
summarized <- data.table(
Date = as.Date(c("2013-5-4", "2013-5-9", "2013-5-16", "2013-5-30")),
M1 = c(FALSE,TRUE,TRUE,TRUE),
M2 = c(TRUE,FALSE,TRUE,FALSE),
M3 = c(TRUE,TRUE,TRUE,TRUE),
M4 = c(TRUE,FALSE,FALSE,TRUE),
M5 = c(TRUE,FALSE,FALSE,FALSE),
M6 = c(TRUE,FALSE,TRUE,FALSE),
M7 = c(FALSE,TRUE,TRUE,FALSE),
M8 = c(FALSE,FALSE,TRUE,FALSE),
M9 = c(FALSE,FALSE,TRUE,FALSE),
M10 = c(FALSE,FALSE,TRUE,FALSE)
)
Excel
Raw is the measurements input. Multiple measurements can happen on the same observation date (i.e. multiple rows).
Summarized is what I'm hoping to get out. Rows are summarized and the 'm'easurement columns merely indicate that the value (following the M, i.e. M1, M2) was observed on the day in any of the V columns. For example, the number 2 was seen on the first and last observation on 5/16, but the number 5 was not seen in any of the 9 values on 5/16.
I think I need to use a join but how to calculate the M columns escapes me.
Any help is much appreciated.
Question: is there a name for this type of operation in data science or mathematics?
Update: I'm trying the following
setkey(raw,Date)
s <- data.table( Date=unique(raw$Date)) # get a datatable of the unique dates
setkey(s,Date)
s[raw, M1:=(length(na.omit(match(c(raw$V1,raw$v2,raw$v3),1)))>=1)]
Note that the values are not what's expected for 5-4 (should be FALSE). I think this is becuase the raw rows are not being constrained in my match statement.
Date M1
1: 2013-05-04 TRUE
2: 2013-05-09 TRUE
3: 2013-05-16 TRUE
4: 2013-05-30 TRUE
My guess is I need to use something different to subset the raw rows in the join.
回答1:
This seems to work:
raw[,lapply(1:10,`%in%`,unique(unlist(.SD))),by=Date]
The result is
Date V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1: 2013-05-04 FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
2: 2013-05-09 TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
3: 2013-05-16 TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
4: 2013-05-30 TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
If you want the columns with "M" instead of "V", use c(M=1:10)
in place of 1:10
.
回答2:
This is a reshaping problem.
First, since it doesn't matter which sensor the data came from, let's aggregate your three columns into one column.
temp <- raw[,Reduce(union,list(S1,S2,S3)),by=Date]
Now we want to reshape from "long" to "wide" format.
A data table solution borrowed from this answer:
setkey(temp,Date,V1)
temp[CJ(unique(Date),unique(V1)), list(.N)][,
setNames(as.list(as.logical(N)), paste0("M",unique(V1))), by = Date]
# Date M1 M2 M3 M4 M5 M6 M7 M8 M9
# 1: 2013-05-04 FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
# 2: 2013-05-09 TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE
# 3: 2013-05-16 TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
# 4: 2013-05-30 TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
Base reshape
works this way:
as.data.table(reshape(temp, timevar = "V1", v.names = "V1", idvar = "Date", direction = "wide"))
# Date V1.4 V1.2 V1.3 V1.5 V1.6 V1.1 V1.7 V1.8 V1.9
# 1: 2013-05-04 4 2 3 5 6 NA NA NA NA
# 2: 2013-05-09 4 NA 3 NA NA 1 7 NA NA
# 3: 2013-05-16 NA 2 3 NA 6 1 7 8 9
# 4: 2013-05-30 4 NA 3 NA NA 1 NA NA NA
## to order by column
temp2 <- as.data.table(reshape(temp[order(V1)], timevar = "V1", v.names = "V1", idvar = "Date", direction = "wide"))
# Date V1.1 V1.2 V1.3 V1.4 V1.5 V1.6 V1.7 V1.8 V1.9
# 1: 2013-05-09 1 NA 3 4 NA NA 7 NA NA
# 2: 2013-05-16 1 2 3 NA NA 6 7 8 9
# 3: 2013-05-30 1 NA 3 4 NA NA NA NA NA
# 4: 2013-05-04 NA 2 3 4 5 6 NA NA NA
##converts to logical true/false
temp2[,lapply(.SD,function(x) {x[is.na(x)] <- 0; as.logical(x)}), by = Date]
# Date vv V1.1 V1.2 V1.3 V1.4 V1.5 V1.6 V1.7 V1.8 V1.9
# 1: 2013-05-09 TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE
# 2: 2013-05-16 TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
# 3: 2013-05-30 TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
# 4: 2013-05-04 TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
The package reshape2
is a bit more intuitive:
require(reshape2)
## dummy variable for TRUE/FALSE
temp[,vv := TRUE]
temp_reshape2 <- as.data.table(dcast(temp, Date ~ V1, value.var = "vv"))
## replace NA with FALSE
temp_reshape2[, lapply(.SD, function(x) {x[is.na(x)] <- FALSE; x}), by = Date]
# Date 1 2 3 4 5 6 7 8 9
# 1: 2013-05-04 FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
# 2: 2013-05-09 TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE
# 3: 2013-05-16 TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
# 4: 2013-05-30 TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
And for completion, a lame parse-eval solution:
limits <- temp[,c(min(V1),max(V1))]
sapply(temp[,min(V1) : max(V1)], function(x) {
temp[,eval(parse(text=paste0("M",x," := any(abs(V1 - ",x,") < .Machine$double.eps)"))),by = Date]
})
来源:https://stackoverflow.com/questions/18871614/r-datatable-join-and-constrain-rows