Between/within standard deviations

问题

When working on a hierarchical/multilevel/panel dataset, it may be very useful to adopt a package which returns the within- and between-group standard deviations of the available variables.

This is something that with the following data in Stata can be easily done through the command

xtsum, i(momid)

I made a research, but I cannot find any R package which can do that..

edit:

Just to fix ideas, an example of hierarchical dataset could be this:

son_id       mom_id      hispanic     mom_smoke     son_birthweigth

  1            1            1            1              3950
  2            1            1            0              3890
  3            1            1            0              3990
  1            2            0            1              4200
  2            2            0            1              4120
  1            3            0            0              2975
  2            3            0            1              2980

The "multilevel" structure is given by the fact that each mother (higher level) has two or more sons (lower level). Hence, each mother defines a group of observations.

Accordingly, each dataset variable can vary either between and within mothers or only between mothers. birtweigth varies among mothers, but also within the same mother. Instead, hispanic is fixed for the same mother.

For example, the within-mother variance of son_birthweigth is:

# mom1 means
    bwt_mean1 <- (3950+3890+3990)/3
    bwt_mean2 <- (4200+4120)/2
    bwt_mean3 <- (2975+2980)/2

# Within-mother variance for birthweigth
    ((3950-bwt_mean1)^2 + (3890-bwt_mean1)^2 + (3990-bwt_mean1)^2 + 
    (4200-bwt_mean2)^2 + (4120-bwt_mean2)^2 + 
    (2975-bwt_mean3)^2 + (2980-bwt_mean3)^2)/(7-1)

While the between-mother variance is:

# overall mean of birthweigth:
# mean <- sum(data$son_birthweigth)/length(data$son_birthweigth)
    mean <- (3950+3890+3990+4200+4120+2975+2980)/7

# within variance:
    ((bwt_mean1-mean)^2 + (bwt_mean2-mean)^2 + (bwt_mean3-mean)^2)/(3-1)

回答1:

I don't know what your stata command should reproduce, but to answer the second part of question about hierarchical structure , it is easy to do this with list. For example, you define a structure like this:

tree = list(
      "var1" = list(
         "panel" = list(type ='p',mean = 1,sd=0)
         ,"cluster" = list(type = 'c',value = c(5,8,10)))
      ,"var2" = list(
          "panel" = list(type ='p',mean = 2,sd=0.5)
         ,"cluster" = list(type="c",value =c(1,2)))
)

To create this lapply is convinent to work with list

tree <- lapply(list('var1','var2'),function(x){ 
  ll <- list(panel= list(type ='p',mean = rnorm(1),sd=0), ## I use symbol here not name
             cluster= list(type = 'c',value = rnorm(3)))  ## R prefer symbols
})
names(tree) <-c('var1','var2')

You can view he structure with str

str(tree)
List of 2
 $ var1:List of 2
  ..$ panel  :List of 3
  .. ..$ type: chr "p"
  .. ..$ mean: num 0.284
  .. ..$ sd  : num 0
  ..$ cluster:List of 2
  .. ..$ type : chr "c"
  .. ..$ value: num [1:3] 0.0722 -0.9413 0.6649
 $ var2:List of 2
  ..$ panel  :List of 3
  .. ..$ type: chr "p"
  .. ..$ mean: num -0.144
  .. ..$ sd  : num 0
  ..$ cluster:List of 2
  .. ..$ type : chr "c"
  .. ..$ value: num [1:3] -0.595 -1.795 -0.439

Edit after OP clarification

I think that package reshape2 is what you want. I will demonstrate this here.

The idea here is in order to do the multilevel analysis we need to reshape the data.

First to divide the variables into two groups :identifier and measured variables. library(reshape2) dat.m <- melt(dat,id.vars=c('son_id','mom_id')) ## other columns are measured

str(dat.m)
'data.frame':   21 obs. of  4 variables:
 $ son_id  : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 1 2 1 2 3 ...
 $ mom_id  : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 3 3 1 1 1 ...
 $ variable: Factor w/ 3 levels "hispanic","mom_smoke",..: 1 1 1 1 1 1 1 2 2 2 ...
 $ value   : num  1 1 1 0 0 0 0 1 0 0 ..

Once your have data in "moten" form , you can "cast" to rearrange it in the shape that you want:

# mom1 means for all variable
 acast(dat.m,variable~mom_id,mean)
                           1    2      3
hispanic           1.0000000    0    0.0
mom_smoke          0.3333333    1    0.5
son_birthweigth 3943.3333333 4160 2977.5
# Within-mother variance for birthweigth

acast(dat.m,variable~mom_id,function(x) sum((x-mean(x))^2))
                           1    2    3
hispanic           0.0000000    0  0.0
mom_smoke          0.6666667    0  0.5
son_birthweigth 5066.6666667 3200 12.5

## overall mean of each variable
acast(dat.m,variable~.,mean)
[,1]
hispanic           0.4285714
mom_smoke          0.5714286
son_birthweigth 3729.2857143

回答2:

I know this question is four years old, but recently I wanted to do the same in R and came up with the following function. It depends on dplyr and tibble. Where: df is the dataframe, columns is a numerical vector to subset the dataframe and individuals is the column with the individuals.

xtsumR<-function(df,columns,individuals){
  df<-dplyr::arrange_(df,individuals)
  panel<-tibble::tibble()
  for (i in columns){
    v<-df %>% dplyr::group_by_() %>%
      dplyr::summarize_(
        mean=mean(df[[i]]),
        sd=sd(df[[i]]),
        min=min(df[[i]]),
        max=max(df[[i]])
      )
    v<-tibble::add_column(v,variacao="overal",.before=-1)
    v2<-aggregate(df[[i]],list(df[[individuals]]),"mean")[[2]]
    sdB<-sd(v2)
    varW<-df[[i]]-rep(v2,each=12) #
    varW<-varW+mean(df[[i]])
    sdW<-sd(varW)
    minB<-min(v2)
    maxB<-max(v2)
    minW<-min(varW)
    maxW<-max(varW)
    v<-rbind(v,c("between",NA,sdB,minB,maxB),c("within",NA,sdW,minW,maxW))
    panel<-rbind(panel,v)
  }
  var<-rep(names(df)[columns])
  n1<-rep(NA,length(columns))
  n2<-rep(NA,length(columns))
  var<-c(rbind(var,n1,n1))
  panel$var<-var
  panel<-panel[c(6,1:5)]
  names(panel)<-c("variable","variation","mean","standard.deviation","min","max")
  panel[3:6]<-as.numeric(unlist(panel[3:6]))
  panel[3:6]<-round(unlist(panel[3:6]),2)
  return(panel)
}

来源：https://stackoverflow.com/questions/14165752/between-within-standard-deviations

标签

hierarchical-data

stata

multi-level