R DataTable Join and constrain rows

只谈情不闲聊 提交于 2019-12-11 02:49:11

问题


I'd like to summarize a set of observations in a datatable and could use some help with the syntax.

I think this is as simple as a join but I'm trying to identify that specific values were seen on a specific observation DAY even if its across multiple measurements or sensors on that day.

  • observations are summarized by date
  • observations date have varied counts of measurements (rows per date)
  • 'M'easurement columns indicate that a specific value was observed in ANY sensor for the day.

I've created 2 sample sets of data that I hope will clarify the goal. I've also created an image of an excel spreadsheet that hopes to show the relationship between the data.

library(data.table)
raw <- data.table(
  Date = as.Date(c("2013-5-4","2013-5-4","2013-5-4", "2013-5-9","2013-5-9", "2013-5-16","2013-5-16","2013-5-16", "2013-5-30")),
  S1 = c(4, 2, 3, 1, 1, 8, 7, 3, 3),
  S2 = c(2, 5, 2, 4, 4, 9, 1, 6, 4),
  S3 = c(6, 2, 2, 7, 3, 2, 7, 2, 1)
)

summarized <- data.table(
  Date = as.Date(c("2013-5-4", "2013-5-9", "2013-5-16", "2013-5-30")),
  M1 = c(FALSE,TRUE,TRUE,TRUE),
  M2 = c(TRUE,FALSE,TRUE,FALSE),
  M3 = c(TRUE,TRUE,TRUE,TRUE),
  M4 = c(TRUE,FALSE,FALSE,TRUE),
  M5 = c(TRUE,FALSE,FALSE,FALSE),
  M6 = c(TRUE,FALSE,TRUE,FALSE),
  M7 = c(FALSE,TRUE,TRUE,FALSE),
  M8 = c(FALSE,FALSE,TRUE,FALSE),
  M9 = c(FALSE,FALSE,TRUE,FALSE),
  M10 = c(FALSE,FALSE,TRUE,FALSE)
)

Excel

Raw is the measurements input. Multiple measurements can happen on the same observation date (i.e. multiple rows).

Summarized is what I'm hoping to get out. Rows are summarized and the 'm'easurement columns merely indicate that the value (following the M, i.e. M1, M2) was observed on the day in any of the V columns. For example, the number 2 was seen on the first and last observation on 5/16, but the number 5 was not seen in any of the 9 values on 5/16.

I think I need to use a join but how to calculate the M columns escapes me.

Any help is much appreciated.

Question: is there a name for this type of operation in data science or mathematics?

Update: I'm trying the following

setkey(raw,Date)
s <- data.table( Date=unique(raw$Date)) # get a datatable of the unique dates
setkey(s,Date)
s[raw, M1:=(length(na.omit(match(c(raw$V1,raw$v2,raw$v3),1)))>=1)]

Note that the values are not what's expected for 5-4 (should be FALSE). I think this is becuase the raw rows are not being constrained in my match statement.

         Date   M1
1: 2013-05-04 TRUE
2: 2013-05-09 TRUE
3: 2013-05-16 TRUE
4: 2013-05-30 TRUE

My guess is I need to use something different to subset the raw rows in the join.


回答1:


This seems to work:

raw[,lapply(1:10,`%in%`,unique(unlist(.SD))),by=Date]

The result is

         Date    V1    V2   V3    V4    V5    V6    V7    V8    V9   V10
1: 2013-05-04 FALSE  TRUE TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
2: 2013-05-09  TRUE FALSE TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
3: 2013-05-16  TRUE  TRUE TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
4: 2013-05-30  TRUE FALSE TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

If you want the columns with "M" instead of "V", use c(M=1:10) in place of 1:10.




回答2:


This is a reshaping problem.

First, since it doesn't matter which sensor the data came from, let's aggregate your three columns into one column.

temp <- raw[,Reduce(union,list(S1,S2,S3)),by=Date]

Now we want to reshape from "long" to "wide" format.

A data table solution borrowed from this answer:

setkey(temp,Date,V1)
temp[CJ(unique(Date),unique(V1)), list(.N)][,
  setNames(as.list(as.logical(N)), paste0("M",unique(V1))), by = Date]
         # Date    M1    M2   M3    M4    M5    M6    M7    M8    M9
# 1: 2013-05-04 FALSE  TRUE TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
# 2: 2013-05-09  TRUE FALSE TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE
# 3: 2013-05-16  TRUE  TRUE TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
# 4: 2013-05-30  TRUE FALSE TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

Base reshape works this way:

as.data.table(reshape(temp, timevar = "V1", v.names = "V1", idvar = "Date", direction = "wide"))
         # Date V1.4 V1.2 V1.3 V1.5 V1.6 V1.1 V1.7 V1.8 V1.9
# 1: 2013-05-04    4    2    3    5    6   NA   NA   NA   NA
# 2: 2013-05-09    4   NA    3   NA   NA    1    7   NA   NA
# 3: 2013-05-16   NA    2    3   NA    6    1    7    8    9
# 4: 2013-05-30    4   NA    3   NA   NA    1   NA   NA   NA

## to order by column
temp2 <- as.data.table(reshape(temp[order(V1)], timevar = "V1", v.names = "V1", idvar = "Date", direction = "wide"))
         # Date V1.1 V1.2 V1.3 V1.4 V1.5 V1.6 V1.7 V1.8 V1.9
# 1: 2013-05-09    1   NA    3    4   NA   NA    7   NA   NA
# 2: 2013-05-16    1    2    3   NA   NA    6    7    8    9
# 3: 2013-05-30    1   NA    3    4   NA   NA   NA   NA   NA
# 4: 2013-05-04   NA    2    3    4    5    6   NA   NA   NA

##converts to logical true/false
temp2[,lapply(.SD,function(x) {x[is.na(x)] <- 0; as.logical(x)}), by = Date]
       # Date   vv  V1.1  V1.2 V1.3  V1.4  V1.5  V1.6  V1.7  V1.8  V1.9
# 1: 2013-05-09 TRUE  TRUE FALSE TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE
# 2: 2013-05-16 TRUE  TRUE  TRUE TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
# 3: 2013-05-30 TRUE  TRUE FALSE TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
# 4: 2013-05-04 TRUE FALSE  TRUE TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE

The package reshape2 is a bit more intuitive:

require(reshape2)
## dummy variable for TRUE/FALSE
temp[,vv := TRUE]
temp_reshape2 <- as.data.table(dcast(temp, Date ~ V1, value.var = "vv"))
## replace NA with FALSE
temp_reshape2[, lapply(.SD, function(x) {x[is.na(x)] <- FALSE; x}), by = Date]
         # Date     1     2    3     4     5     6     7     8     9
# 1: 2013-05-04 FALSE  TRUE TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
# 2: 2013-05-09  TRUE FALSE TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE
# 3: 2013-05-16  TRUE  TRUE TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
# 4: 2013-05-30  TRUE FALSE TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

And for completion, a lame parse-eval solution:

limits <- temp[,c(min(V1),max(V1))]
sapply(temp[,min(V1) : max(V1)], function(x) {
temp[,eval(parse(text=paste0("M",x," := any(abs(V1 - ",x,") < .Machine$double.eps)"))),by = Date]
})


来源:https://stackoverflow.com/questions/18871614/r-datatable-join-and-constrain-rows

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!