Convergence issues on hierarchical probit model with NUTS and Metropolis

问题

I am trying to extend the hierarchical model from Gelman and Hill reproduced in PyMC3 here to binary outcome data by adding a probit transformation and making the outcome follow a Bernoulli distribution. Right now I'm using toy data, so I know the true values. Alpha should be .1, beta should be .5.

The model runs fine with a NUTS sampler before the extension. Once I add it, the estimates slowly increase and keep increasing until the model stalls anywhere between 10 and 200 iterations. Here's an image from when it made it all the way to 120 (a relatively long run).

Before the extension, Metropolis needs a good 200,000 iterations to get a good fix on the true parameter values, but it does eventually. After the extension it stalls somewhere between 30k and 50k. Unlike with NUTS, it completely crashes when you try to stop it post-stall, so I don't have a picture. Stopping it earlier gives an estimate over beta roughly over zero, but with a wide spread.

The code is pasted below.

I'm not sure if it's a sampling problem, or a specification problem. Is there a better way to specify a Probit? Any tips on other samplers to try? I've stripped my model down as far as I can for the test, and narrowed it down to breaking once I add probit extension, but I'm at a loss for what to do next.

#Generate Data
n=100
#Determine how many observations per group
evts=np.random.randint(10,100,n)
#Determine which groups will be receive treatment
x_g=np.random.binomial(1,.3,n)
#pre-create distribution of betas for groups 
mu = np.random.normal(.5,.2,n)
#preallocate space in a dataframe
i = np.zeros(evts.sum())
y_obs = pd.DataFrame({'y':i.copy(),
                      'x':i.copy(),
                      'grp':i.copy()},
                     index = range(evts.sum()))
#populate dataframe with simulated data
i=0
for grp in range(100):
    #index of observations for a given group
    ind = list(range(i,(i+evts[grp])))
    i += evts[grp]
    #generate outcomes using
    #different dgp depending on treatment
    if x_g[grp] ==1:
        #shortcut to make sure 1>p>0
        p_i = max((.1 + mu[grp]),0.01)
        p_i = min(p_i,1)
        out = np.random.binomial(1,p_i,evts[grp])
    else:
        out = np.random.binomial(1,.1,evts[grp])
    #Assign to dataframe
    y_obs.loc[ind,'y'] = out
    y_obs.loc[ind,'x'] = x_g[grp]
    y_obs.loc[ind,'grp'] = grp
y_obs = y_obs.astype(int)
print('starting model')
with pm.Model() as test_model:
    #hyperpriors
    mu_a=pm.Normal('mu_a',mu=0, sd=100**2)
    sig_a = pm.Uniform('sig_a',lower=0,upper=100)
    mu_b=pm.Normal('mu_b',mu=0, sd=100**2)
    sig_b = pm.Uniform('sig_b',lower=0,upper=100)
    #priors
    a = pm.Normal('a',mu=mu_a,sd = sig_a, shape=n)
    b = pm.Normal('b',mu=mu_b,sd = sig_b, shape=n)

    eps = pm.Uniform('eps',lower=0,upper=100)

    est = a[y_obs.grp] + b[y_obs.grp] * y_obs.x
    #I get correct estimates when I 
    #stop here using commented out line. 
#     y_hat = pm.Normal('y_hat',
#                       mu=est,
#                       sd=eps, 
#                       observed = y_obs.y)

    #Probit transformation:
    y_hat = pm.Normal('y_hat',
                      mu=est,
                      sd=eps, 
                      shape=y_obs.shape[0])

    mu_y = tt.mean(y_hat)
    eps_hat = tt.var(y_hat)
    p_hat = 0.5 * (1 + tt.erf((y_hat-mu_y) / (eps_hat*tt.sqrt(2))))

    y = pm.Bernoulli('y',p=p_hat, observed = y_obs.y)


with test_model:
    #Either:
    mu,sds,elbo = pm.variational.advi(n=100000)
    step = pm.NUTS(scaling=test_model.dict_to_array(sds),
                   is_cov=True)
    test_trace = pm.sample(200, step, start=mu)
    #or
#     step=pm.Metropolis()
#     test_trace = pm.sample(50000)

pm.traceplot(test_trace)#[-5000::3])

NOTE: edited to fix typo in line: 'step = pm.NUTS(scaling=test_model.dict_to_array(sds),`

EDIT: I made better simulation data for the probit extension for the model originally posted. (the original data generation was Now ADVI gives better estimates, so it's starting in around the right place, but NUTS still stalls very quickly (at around ten iterations). Metropolis straight up fails: I did a first round of 5000 iterations, and got an error when trying to plot the trace.

New data generation:

n=100
evts=np.random.randint(10,100,n)
x_g=np.random.binomial(1,.3,n)
i = np.zeros(evts.sum())
mu = np.random.normal(.5,.2,n)
mu0 = np.random.normal(.1,.05,n)
y_obs = pd.DataFrame({'y':i.copy(),'x':i.copy(),'grp':i.copy()},index = range(evts.sum()))
i=0
for grp in range(100):
    ind = list(range(i,(i+evts[grp])))
    i += evts[grp]
    if x_g[grp] ==1:
        est = mu0[grp] + mu[grp]
    else:
        est=mu0[grp]
    p_hat = tt.nnet.sigmoid(est).eval()
    y_obs.loc[ind,'y_hat'] = est
    y_obs.loc[ind,'y'] = np.random.binomial(1,p_hat,len(ind))
    y_obs.loc[ind,'x'] = x_g[grp]
    y_obs.loc[ind,'grp'] = grp
y_obs['grp']=y_obs.grp.astype(np.int64)

Error for metropolis when pymc3 tries to plot a density:

ValueError: v cannot be empty

回答1:

Maybe I misunderstand what you are trying to do, but shouldn't this model work:

with pm.Model() as test_model:
    #hyperpriors
    mu_a = pm.Flat('mu_a')
    sig_a = pm.HalfCauchy('sig_a', beta=2.5)

    mu_b = pm.Flat('mu_b')
    sig_b = pm.HalfCauchy('sig_b', beta=2.5)

    #priors
    a_raw = pm.Normal('a_raw', mu=0, sd=1, shape=n)
    a = pm.Deterministic('a', mu_a + sig_a * a_raw)

    b_raw = pm.Normal('b_raw', mu=0, sd=1, shape=n)
    b = pm.Deterministic('b', mu_b + sig_b * b_raw)

    est = a[y_obs.grp.values] + b[y_obs.grp.values] * y_obs.x

    y = pm.Bernoulli('y', p=tt.nnet.sigmoid(est), observed = y_obs.y)

This is a logit, not a probit model. If you want a probit for some reason, you an replace tt.nnet.sigmoid by a standard probit function.

This still has a somewhat hard time with your dataset, but I think that is because of a mistake in the data generation: you assume a constant a for all groups of 0.1, but in the model you allow the a values to differ by group. The sampler has trouble with very small values for sig_a (the true value is 0 after all...).

Edit: Some more explanations: The change to use standard normal a_raw and b_raw and then transform them to a Normal(mu=mu_a, sd=sig_a) using pm.Deterministic does not change the posterior, but it makes it easier for the sampler. It is called a non-centered parametrization. For a more in depth description of the topic see eg http://mc-stan.org/documentation/case-studies/divergences_and_bias.html, this should also help you to understand, why very small variances can be problematic.

Edit: New data generation

n = 100
evts=np.random.randint(10,100,n)
x_g=np.random.binomial(1,.3,n)
i = np.zeros(evts.sum())
mu = np.random.normal(.5,.2,n)
mu0 = np.random.normal(.1,.05,n)
y_obs = pd.DataFrame({'y':i.copy(),'x':i.copy(),'grp':i.copy()},
                     index = range(evts.sum()))
i = 0
for grp in range(100):
    ind = list(range(i,(i+evts[grp])))
    i += evts[grp]
    if x_g[grp] ==1:
        est = mu0[grp] + mu[grp]
    else:
        est=mu0[grp]
    p_hat = tt.nnet.sigmoid(est).eval()
    y_obs.loc[ind,'y_hat'] = est
    y_obs.loc[ind,'y'] = np.random.binomial(1,p_hat,len(ind))
    y_obs.loc[ind,'x'] = x_g[grp]
    y_obs.loc[ind,'grp'] = grp
y_obs['grp']=y_obs.grp.astype(np.int64)

Sampling using

with test_model:
    trace = pm.sample(2000, tune=1000, njobs=4)

This is finished after about three minutes

Auto-assigning NUTS sampler...
Initializing NUTS using advi...
  8%|▊         | 15977/200000 [00:15<02:51, 1070.66it/s]Median ELBO converged.
Finished [100%]: Average ELBO = -4,458.8

100%|██████████| 2000/2000 [02:48<00:00,  9.99it/s]

No divergent transitions:

test_trace[1000:].diverging.sum()

All using pymc3 and theano master. (Both are about to ready for a new release)

来源：https://stackoverflow.com/questions/42779123/convergence-issues-on-hierarchical-probit-model-with-nuts-and-metropolis

标签

python

hierarchical-data

bayesian

pymc3