I've recently started learning pymc3
after exclusively using emcee
for ages and I'm running into some conceptual problems.
I'm practising with Chapter 7 of Hogg's Fitting a model to data. This involves a mcmc fit to a straight line with arbitrary 2d uncertainties. I've accomplished this quite easily in emcee
, but pymc
is giving me some problems.
It essentially boils down to using a multivariate gaussian likelihood.
Here is what I have so far.
from pymc3 import * import numpy as np import matplotlib.pyplot as plt size = 200 true_intercept = 1 true_slope = 2 true_x = np.linspace(0, 1, size) # y = a + b*x true_regression_line = true_intercept + true_slope * true_x # add noise # here the errors are all the same but the real world they are usually not! std_y, std_x = 0.1, 0.1 y = true_regression_line + np.random.normal(scale=std_y, size=size) x = true_x + np.random.normal(scale=std_x, size=size) y_err = np.ones_like(y) * std_y x_err = np.ones_like(x) * std_x data = dict(x=x, y=y) with Model() as model: # model specifications in PyMC3 are wrapped in a with-statement # Define priors intercept = Normal('Intercept', 0, sd=20) gradient = Normal('gradient', 0, sd=20) # Define likelihood likelihood = MvNormal('y', mu=intercept + gradient * x, tau=1./(np.stack((y_err, x_err))**2.), observed=y) # start the mcmc! start = find_MAP() # Find starting value by optimization step = NUTS(scaling=start) # Instantiate MCMC sampling algorithm trace = sample(2000, step, start=start, progressbar=False) # draw 2000 posterior samples using NUTS sampling
This raises the error: LinAlgError: Last 2 dimensions of the array must be square
So I'm trying to pass MvNormal
the measured values for x and y (mu
s) and their associated measurement uncertainties (y_err
and x_err
). But it appears that it is not liking the 2d tau
argument.
Any ideas? This must be possible
Thanks
You may try by adapting the following model. Is a "regular" linear regression. But x
and y
have been replaced by Gaussian distributions. Here I am assuming not only the measured values of the input and output variables but also a reliable estimation of the their error (for example as provided by a measurement device). If you do not trust those error values you may instead try to estimate them from the data.
with pm.Model() as model: intercept = pm.Normal('intercept', 0, sd=20) gradient = pm.Normal('gradient', 0, sd=20) epsilon = pm.HalfCauchy('epsilon', 5) obs_x = pm.Normal('obs_x', mu=x, sd=x_err, shape=len(x)) obs_y = pm.Normal('obs_y', mu=y, sd=y_err, shape=len(y)) likelihood = pm.Normal('y', mu=intercept + gradient * obs_x, sd=epsilon, observed=obs_y) trace = pm.sample(2000)
If you are estimating the error from the data it could be reasonable to assume they could be correlated and hence, instead of using two separate Gaussian you can use a Multivariate Gaussian. In such a case you will end up with a model like the following:
df_data = pd.DataFrame(data) cov = df_data.cov() with pm.Model() as model: intercept = pm.Normal('intercept', 0, sd=20) gradient = pm.Normal('gradient', 0, sd=20) epsilon = pm.HalfCauchy('epsilon', 5) obs_xy = pm.MvNormal('obs_xy', mu=df_data, tau=pm.matrix_inverse(cov), shape=df_data.shape) yl = pm.Normal('yl', mu=intercept + gradient * obs_xy[:,0], sd=epsilon, observed=obs_xy[:,1]) mu, sds, elbo = pm.variational.advi(n=20000) step = pm.NUTS(scaling=model.dict_to_array(sds), is_cov=True) trace = pm.sample(1000, step=step, start=mu)
Notice that in the previous model the covariance matrix was computed from the data. If you are going to do that then I think is better to go with the first model, but if instead you are going to estimate the covariance matrix then the second model could be a sensible approach.
For the second model I use ADVI to initialize it. ADVI can be a good way to initialize models, often it works much better than find_MAP().
You may also want to check this repository by David Hogg. And the book Statistical Rethinking where McElreath discuss the problem of doing linear regression including the errors in the input and output variables.