Calculate mean difference per row and per group

前端未结

关注

 2  1877

I have a data.frame with many rows and columns and I want to calculate the mean difference of each value to each of the other values within a group.
Here an exa

相关标签:

2条回答

名媛妹妹

2021-01-26 23:40

A solution using crossjoin in data.table library with a defect of removing the duplicated row from the original data frame:

> dt <- setDT(df)[,setNames(CJ(value, value), c("value", "value1")), .(ID)][,.(value_mean_diff = sum((value-value1)^2)/.N),.(ID, value)]
> dt
   ID value value_mean_diff
1:  1     4        3.333333
2:  1     5        1.666667
3:  1     7        4.333333
4:  2     5        2.750000
5:  2     6        1.250000
6:  2     8        4.250000

Since duplicated rows always have the same value_mean_diff, you can always merge them to get all the duplicated rows.

> merge(dt, df, by = c("ID", "value"))
   ID value value_mean_diff
1:  1     4        3.333333
2:  1     5        1.666667
3:  1     7        4.333333
4:  2     5        2.750000
5:  2     6        1.250000
6:  2     6        1.250000
7:  2     8        4.250000

Update: Since the above method is memory intensive, you can take advantage of the fact that your value_mean_diff = (value - value_mean)^2 + variance(value), which you can prove by expanding the variance based on its definition. With this as a fact, you can calculate by the following way:

> setDT(df)[, value_mean_diff := (value - mean(value))^2 + var(value) * (.N - 1) / .N, .(ID)]
> df
   ID value value_mean_diff
1:  1     4        3.333333
2:  1     5        1.666667
3:  1     7        4.333333
4:  2     8        4.250000
5:  2     6        1.250000
6:  2     5        2.750000
7:  2     6        1.250000

Keep in mind that the var() function in R calculate the sample variance so you need to convert it to population variance by multiplying a factor (n-1)/n.

0 讨论(0)

青春惊慌失措

2021-01-26 23:46

Here's a solution using only base R:

myData <- data.frame(ID=c(1,1,1,2,2,2,2), value=c(4,5,7,8,6,5,6), diff=NA)
for(i in 1:nrow(myData))
    myData$diff[i] <- with(data = myData,
        sum((value[i] - value[ID==ID[i]])**2)/length(value[ID==ID[i]]))

myData

  ID value     diff
1  1     4 3.333333
2  1     5 1.666667
3  1     7 4.333333
4  2     8 4.250000
5  2     6 1.250000
6  2     5 2.750000
7  2     6 1.250000

0 讨论(0)