Calculate mean difference per row and per group

前端 未结 2 1877
生来不讨喜
生来不讨喜 2021-01-26 22:46

I have a data.frame with many rows and columns and I want to calculate the mean difference of each value to each of the other values within a group.
Here an exa

相关标签:
2条回答
  • 2021-01-26 23:40

    A solution using crossjoin in data.table library with a defect of removing the duplicated row from the original data frame:

    > dt <- setDT(df)[,setNames(CJ(value, value), c("value", "value1")), .(ID)][,.(value_mean_diff = sum((value-value1)^2)/.N),.(ID, value)]
    > dt
       ID value value_mean_diff
    1:  1     4        3.333333
    2:  1     5        1.666667
    3:  1     7        4.333333
    4:  2     5        2.750000
    5:  2     6        1.250000
    6:  2     8        4.250000
    

    Since duplicated rows always have the same value_mean_diff, you can always merge them to get all the duplicated rows.

    > merge(dt, df, by = c("ID", "value"))
       ID value value_mean_diff
    1:  1     4        3.333333
    2:  1     5        1.666667
    3:  1     7        4.333333
    4:  2     5        2.750000
    5:  2     6        1.250000
    6:  2     6        1.250000
    7:  2     8        4.250000
    

    Update: Since the above method is memory intensive, you can take advantage of the fact that your value_mean_diff = (value - value_mean)^2 + variance(value), which you can prove by expanding the variance based on its definition. With this as a fact, you can calculate by the following way:

    > setDT(df)[, value_mean_diff := (value - mean(value))^2 + var(value) * (.N - 1) / .N, .(ID)]
    > df
       ID value value_mean_diff
    1:  1     4        3.333333
    2:  1     5        1.666667
    3:  1     7        4.333333
    4:  2     8        4.250000
    5:  2     6        1.250000
    6:  2     5        2.750000
    7:  2     6        1.250000
    

    Keep in mind that the var() function in R calculate the sample variance so you need to convert it to population variance by multiplying a factor (n-1)/n.

    0 讨论(0)
  • 2021-01-26 23:46

    Here's a solution using only base R:

    myData <- data.frame(ID=c(1,1,1,2,2,2,2), value=c(4,5,7,8,6,5,6), diff=NA)
    for(i in 1:nrow(myData))
        myData$diff[i] <- with(data = myData,
            sum((value[i] - value[ID==ID[i]])**2)/length(value[ID==ID[i]]))
    
    myData
    
      ID value     diff
    1  1     4 3.333333
    2  1     5 1.666667
    3  1     7 4.333333
    4  2     8 4.250000
    5  2     6 1.250000
    6  2     5 2.750000
    7  2     6 1.250000
    
    0 讨论(0)
提交回复
热议问题