Correlation between groups in R data.table

前端 未结 3 1524
南笙
南笙 2021-01-02 03:59

Is there a way of elegantly calculating the correlations between values if those values are stored by group in a single column of a data.table (other than converting the dat

相关标签:
3条回答
  • 2021-01-02 04:13

    I've since found an even simple alternative for doing this. You were actually pretty close with your dt[, cor(value, value), by="group"] approach. What you actually need is to first do a Cartesian join on the dates, and then group by. I.e.

    dt[dt, allow.cartesian=T][, cor(value, value), by=list(group, group.1)]
    

    This has the advantage that it will join the series together (rather than assume they are the same length). You can then cast this into matrix form, or leave it as it is to plot as a heatmap in ggplot etc.

    Full Example

    setkey(dt, id)
    c <- dt[dt, allow.cartesian=T][, list(Cor = cor(value, value.1)), by = list(group, group.1)]
    c
    
       group group.1       Cor
    1:     a       a 1.0000000
    2:     b       a 0.1556371
    3:     a       b 0.1556371
    4:     b       b 1.0000000
    
    dcast(c, group~group.1, value.var = "Cor")
    
      group         a         b
    1     a 1.0000000 0.1556371
    2     b 0.1556371 1.0000000
    
    0 讨论(0)
  • 2021-01-02 04:16

    There is no simple way to do this with data.table. The first way you've provided:

    cor(dt["a"]$value, dt["b"]$value)
    

    Is probably the simplest.

    An alternative is to reshape your data.table from "long" format, to "wide" format:

    > dtw <- reshape(dt, timevar="group", idvar="id", direction="wide")
    > dtw
       id    value.a    value.b
    1:  1 -0.6264538  0.3295078
    2:  2  0.1836433 -0.8204684
    3:  3 -0.8356286  0.4874291
    4:  4  1.5952808  0.7383247
    > cor(dtw[,list(value.a, value.b)])
              value.a   value.b
    value.a 1.0000000 0.1556371
    value.b 0.1556371 1.0000000
    

    Update: If you're using data.table version >= 1.9.0, then you can use dcast.data.table instead which'll be much faster. Check this post for more info.

    dcast.data.table(dt, id ~ group)
    
    0 讨论(0)
  • 2021-01-02 04:31

    I don't know a way to get it in matrix form straight away, but I find this solution useful:

    dt[, {x = value; dt[, cor(x, value), by = group]}, by=group]
    
       group group        V1
    1:     a     a 1.0000000
    2:     a     b 0.1556371
    3:     b     a 0.1556371
    4:     b     b 1.0000000
    

    since you started with a molten dataset and you end up with a molten representation of the correlation.

    Using this form you can also choose to just calculate certain pairs, in particular it is a waste of time calculating both off diagonals. For example:

     dt[, {x = value; g = group; dt[group <= g, list(cor(x, value)), by = group]}, by=group]
       group group        V1
    1:     a     a 1.0000000
    2:     b     a 0.1556371
    3:     b     b 1.0000000
    

    Alternatively, this form works just as well for the cross correlation between two sets (i.e. the block off diagonal)

    library(data.table)
    set.seed(1)             # reproducibility
    dt1 <- data.table(id=1:4, group=rep(letters[1:2], c(4,4)), value=rnorm(8))
    dt2 <- data.table(id=1:4, group=rep(letters[3:4], c(4,4)), value=rnorm(8))
    setkey(dt1, group)
    setkey(dt2, group)
    
    dt1[, {x = value; g = group; dt2[, list(cor(x, value)), by = group]}, by=group]
    
       group group          V1
    1:     a     c -0.39499814
    2:     a     d  0.74234458
    3:     b     c  0.96088312
    4:     b     d  0.08016723
    

    Obviously, if you ultimately want these in matrix form, then you can use dcast or dcast.data.table, however, notice that in the above examples you have two columns with the same name, to fix this it is worth renaming them in the j function. For the original problem:

    dcast.data.table(dt[, {x = value; g1=group; dt[, list(g1, g2=group, c =cor(x, value)), by = group]}, by=group], g1~g2, value.var = "c")
    
       g1         a         b
    1:  a 1.0000000 0.1556371
    2:  b 0.1556371 1.0000000
    
    0 讨论(0)
提交回复
热议问题