A caterpillar plot of just the “significant” random effects from a mixed effects model

问题

I've had great experiences asking for help here before and I'm hoping to get some help again.

I'm estimating a rather large mixed effects model in which one of the random effects has over 150 different levels. That would make a standard caterpillar plot to be quite unreadable.

I would like, if all possible, to get a caterpillar plot of just the levels of the random effect that are, for a lack of better term, "significant". That is: I want a caterpillar plot in which either the random intercept or the random slope for a varying coefficient has a "confidence interval" (I know that's not quite what it is) that does not include zero.

Consider this standard model from the sleepstudy data that is standard with lme4.

library(lme4)
fit <- lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
ggCaterpillar(ranef(fit,condVar=TRUE), QQ=FALSE, likeDotplot=TRUE, reorder=FALSE)[["Subject"]]

I would get this caterpillar plot.

a caterpillar plot

The caterpillar plot I use comes from this code. Do note I tend to use less conservative bounds for the intervals (i.e. 1.645*se and not 1.96*se).

Basically, I want a caterpillar plot that would just include the levels for 308, 309, 310, 330, 331, 335, 337, 349, 350, 352, and 370 because those levels had either intercepts or slopes whose intervals did not include zero. I ask because my caterpillar plot of over 150 different levels is kind of unreadable and I think this might be a worthwhile solution to it.

Reproducible code follows. I genuinely appreciate any help.

# https://stackoverflow.com/questions/34120578/how-can-i-sort-random-effects-by-value-of-the-random-effect-not-the-intercept
ggCaterpillar <- function(re, QQ=TRUE, likeDotplot=TRUE, reorder=TRUE) {
require(ggplot2)
f <- function(x) {
pv   <- attr(x, "postVar")
cols <- 1:(dim(pv)[1])
se   <- unlist(lapply(cols, function(i) sqrt(pv[i, i, ])))
if (reorder) {
  ord  <- unlist(lapply(x, order)) + rep((0:(ncol(x) - 1)) * nrow(x), each=nrow(x))
  pDf  <- data.frame(y=unlist(x)[ord],
                     ci=1.645*se[ord],
                     nQQ=rep(qnorm(ppoints(nrow(x))), ncol(x)),
                     ID=factor(rep(rownames(x), ncol(x))[ord], levels=rownames(x)[ord]),
                     ind=gl(ncol(x), nrow(x), labels=names(x)))
} else {
  pDf  <- data.frame(y=unlist(x),
                     ci=1.645*se,
                     nQQ=rep(qnorm(ppoints(nrow(x))), ncol(x)),
                     ID=factor(rep(rownames(x), ncol(x)), levels=rownames(x)),
                     ind=gl(ncol(x), nrow(x), labels=names(x)))
}

if(QQ) {  ## normal QQ-plot
  p <- ggplot(pDf, aes(nQQ, y))
  p <- p + facet_wrap(~ ind, scales="free")
  p <- p + xlab("Standard normal quantiles") + ylab("Random effect quantiles")
} else {  ## caterpillar dotplot
  p <- ggplot(pDf, aes(ID, y)) + coord_flip()
  if(likeDotplot) {  ## imitate dotplot() -> same scales for random effects
    p <- p + facet_wrap(~ ind)
  } else {           ## different scales for random effects
    p <- p + facet_grid(ind ~ ., scales="free_y")
  }
  p <- p + xlab("Levels of the Random Effect") + ylab("Random Effect")
}

p <- p + theme(legend.position="none")
p <- p + geom_hline(yintercept=0)
p <- p + geom_errorbar(aes(ymin=y-ci, ymax=y+ci), width=0, colour="black")
p <- p + geom_point(aes(size=1.2), colour="blue") 
return(p)
}

  lapply(re, f)
}


library(lme4)
fit <- lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
ggCaterpillar(ranef(fit,condVar=TRUE), QQ=FALSE, likeDotplot=TRUE, reorder=FALSE)[["Subject"]] 
ggsave(file="sleepstudy.png")

回答1:

First, thanks for putting "significant" in quotation marks ... everyone reading this should remember that significance doesn't have any statistical meaning in this context (it might be better to use a Z-statistic (value/std.error) criterion such as |Z|>1.5 or |Z|>1.75 instead, just to emphasize that this is not an inferential threshold ...)

I ended up getting a little carried away ... I decided that it would be better to refactor/modularize things a little bit, so I wrote an augment method (designed to work with the broom package) that constructs useful data frames from ranef.mer objects ... once this is done, the manipulations you want are pretty easy.

I put the augment.ranef.mer code at the end of my answer -- it's a bit long (you'll need to source it before you can run the code here). update: this augment method has been part of the broom.mixed package for a while now ...

library(broom)
library(reshape2)
library(plyr)

Apply the augment method to the RE object:

rr <- ranef(fit,condVar=TRUE)
aa <- augment(rr)

names(aa)
## [1] "grp"       "variable"  "level"     "estimate"  "qq"        "std.error"
## [7] "p"         "lb"        "ub"

Now the ggplot code is pretty basic. I'm using geom_errorbarh(height=0) rather than geom_pointrange()+coord_flip() because ggplot2 can't use coord_flip with facet_wrap(...,scales="free") ...

## Q-Q plot:
g0 <- ggplot(aa,aes(estimate,qq,xmin=lb,xmax=ub))+
    geom_errorbarh(height=0)+
    geom_point()+facet_wrap(~variable,scale="free_x")

## regular caterpillar plot:
g1 <- ggplot(aa,aes(estimate,level,xmin=lb,xmax=ub))+
    geom_errorbarh(height=0)+
    geom_vline(xintercept=0,lty=2)+
    geom_point()+facet_wrap(~variable,scale="free_x")

Now find the levels you want to keep:

aa2 <- ddply(aa,c("grp","level"),
             transform,
             keep=any(p<0.05))
aa3 <- subset(aa2,keep)

Update caterpillar plot with only levels with "significant" slopes or intercepts:

g1 %+% aa3

If you only wanted to highlight "significant" levels rather than removing "non-significant" levels entirely

ggplot(aa2,aes(estimate,level,xmin=lb,xmax=ub,colour=factor(keep)))+
    geom_errorbarh(height=0)+
    geom_vline(xintercept=0,lty=2)+
    geom_point()+facet_wrap(~variable,scale="free_x")+
    scale_colour_manual(values=c("black","red"),guide=FALSE)

##' @importFrom reshape2 melt
##' @importFrom plyr ldply name_rows 
augment.ranef.mer <- function(x,
                                 ci.level=0.9,
                                 reorder=TRUE,
                                 order.var=1) {
    tmpf <- function(z) {
        if (is.character(order.var) && !order.var %in% names(z)) {
            order.var <- 1
            warning("order.var not found, resetting to 1")
        }
        ## would use plyr::name_rows, but want levels first
        zz <- data.frame(level=rownames(z),z,check.names=FALSE)
        if (reorder) {
            ## if numeric order var, add 1 to account for level column
            ov <- if (is.numeric(order.var)) order.var+1 else order.var
            zz$level <- reorder(zz$level, zz[,order.var+1], FUN=identity)
        }
        ## Q-Q values, for each column separately
        qq <- c(apply(z,2,function(y) {
                  qnorm(ppoints(nrow(z)))[order(order(y))]
              }))
        rownames(zz) <- NULL
        pv   <- attr(z, "postVar")
        cols <- 1:(dim(pv)[1])
        se   <- unlist(lapply(cols, function(i) sqrt(pv[i, i, ])))
        ## n.b.: depends on explicit column-major ordering of se/melt
        zzz <- cbind(melt(zz,id.vars="level",value.name="estimate"),
                     qq=qq,std.error=se)
        ## reorder columns:
        subset(zzz,select=c(variable, level, estimate, qq, std.error))
    }
    dd <- ldply(x,tmpf,.id="grp")
    ci.val <- -qnorm((1-ci.level)/2)
    transform(dd,
              p=2*pnorm(-abs(estimate/std.error)), ## 2-tailed p-val
              lb=estimate-ci.val*std.error,
              ub=estimate+ci.val*std.error)
}

来源：https://stackoverflow.com/questions/34344599/a-caterpillar-plot-of-just-the-significant-random-effects-from-a-mixed-effects

标签

ggplot2

lme4