问题
I'm trying to build a Bayesian multivariate ordered logit model using PyMC3. I have gotten a toy multivariate logit model working based on the examples in this book. I've also gotten an ordered logistic regression model running based on the example at the bottom of this page.
However, I cannot get an ordered, multivariate logistic regression to run. I think the issue could be the way the cutpoints are specified, specifically the shape parameter, but I'm not sure why it would be different if there are multiple independent variables than if there were just one, since the number of response categories has not changed.
Here's my code:
Data prep for MWE:
import pymc3 as pm
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris(return_X_y=False)
iris = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
columns=iris['feature_names'] + ['target'])
iris = iris.rename(index=str, columns={'sepal length (cm)': 'sepal_length', 'sepal width (cm)': 'sepal_width', 'target': 'species'})
Here is a working multivariate (binary) logit:
df = iris.loc[iris['species'].isin([0, 1])]
y = pd.Categorical(df['species']).codes
x = df[['sepal_length', 'sepal_width']].values
with pm.Model() as model_1:
alpha = pm.Normal('alpha', mu=0, sd=10)
beta = pm.Normal('beta', mu=0, sd=2, shape=x.shape[1])
mu = alpha + pm.math.dot(x, beta)
theta = 1 / (1 + pm.math.exp(-mu))
y_ = pm.Bernoulli('yl', p=theta, observed=y)
trace_1 = pm.sample(5000)
Here is a working ordered logit (with one independent variable):
x = iris['sepal_length'].values
y = pd.Categorical(iris['species']).codes
with pm.Model() as model:
cutpoints = pm.Normal("cutpoints", mu=[-2,2], sd=10, shape=2,
transform=pm.distributions.transforms.ordered)
y_ = pm.OrderedLogistic("y", cutpoints=cutpoints, eta=x, observed=y)
tr = pm.sample(1000)
Here is my attempt at a multivariate ordered logit, which breaks:
x = iris[['sepal_length', 'sepal_width']].values
y = pd.Categorical(iris['species']).codes
with pm.Model() as model:
cutpoints = pm.Normal("cutpoints", mu=[-2,2], sd=10, shape=2,
transform=pm.distributions.transforms.ordered)
y_ = pm.OrderedLogistic("y", cutpoints=cutpoints, eta=x, observed=y)
tr = pm.sample(1000)
The error I get is: "ValueError: all the input array dimensions except for the concatenation axis must match exactly."
This suggests it's a data problem (x, y), but the data looks the same as it does for the multivariate logit, which works.
How can I fix the ordered multivariate logit so it will run?
回答1:
I've never done multivariate ordinal regression before, but it seems like one must approach the modeling problem in either two ways:
- Partition in the predictor space, in which case you'd need cutlines/curves instead of points.
- Partition in a transformed space where you've projected predictor space to a scalar value and can use cutpoints again.
If you want to use the pm.OrderedLogistic
it seems like you have to do the latter, since it doesn't appear to support a multivariate eta
case out of the box.
Here's my stab at it, but again, I'm not sure this is a standard approach.
import numpy as np
import pymc3 as pm
import pandas as pd
import theano.tensor as tt
from sklearn.datasets import load_iris
# Load data
iris = load_iris(return_X_y=False)
iris = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
columns=iris['feature_names'] + ['target'])
iris = iris.rename(index=str, columns={
'sepal length (cm)': 'sepal_length',
'sepal width (cm)': 'sepal_width',
'target': 'species'})
# Prep input data
Y = pd.Categorical(iris['species']).codes
X = iris[['sepal_length', 'sepal_width']].values
# augment X for simpler regression expression
X_aug = tt.concatenate((np.ones((X.shape[0], 1)), X), axis=1)
# Model with sampling
with pm.Model() as ordered_mvlogit:
# regression coefficients
beta = pm.Normal('beta', mu=0, sd=2, shape=X.shape[1] + 1)
# transformed space (univariate real)
eta = X_aug.dot(beta)
# points for separating categories
cutpoints = pm.Normal("cutpoints", mu=np.array([-1,1]), sd=1, shape=2,
transform=pm.distributions.transforms.ordered)
y_ = pm.OrderedLogistic("y", cutpoints=cutpoints, eta=eta, observed=Y)
trace_mvordlogit = pm.sample(5000)
This seems to converge fine and yields decent intervals
If you then work the beta
and cutpoint
mean values back to predictor space, you get the following partitioning, which appears reasonable. However, sepal length and width aren't really the best for partitioning.
# Extract mean parameter values
b0, b1, b2 = trace_mvordlogit.get_values(varname='beta').mean(axis=0)
cut1, cut2 = trace_mvordlogit.get_values(varname='cutpoints').mean(axis=0)
# plotting parameters
x_min, y_min = X.min(axis=0)
x_max, y_max = X.max(axis=0)
buffer = 0.2
num_points = 37
# compute grid values
x = np.linspace(x_min - buffer, x_max + buffer, num_points)
y = np.linspace(y_min - buffer, y_max + buffer, num_points)
X_plt, Y_plt = np.meshgrid(x, y)
Z_plt = b0 + b1*X_plt + b2*Y_plt
# contour + scatter plots
plt.figure(figsize=(8,8))
plt.contourf(X_plt,Y_plt,Z_plt, levels=[-80, cut1, cut2, 50])
plt.scatter(iris.sepal_length, iris.sepal_width, c=iris.species)
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.show()
Second Order Terms
You could easily extend eta
in the model to include interactions and higher-order terms, so that the final classifier cuts can be curves instead of simple lines. For example, here is the second order model.
from sklearn.preprocessing import scale
Y = pd.Categorical(iris['species']).codes
# scale X for better sampling
X = scale(iris[['sepal_length', 'sepal_width']].values)
# augment with intercept and second-order terms
X_aug = tt.concatenate((
np.ones((X.shape[0], 1)),
X,
(X[:,0]*X[:,0]).reshape((-1,1)),
(X[:,1]*X[:,1]).reshape((-1,1)),
(X[:,0]*X[:,1]).reshape((-1,1))), axis=1)
with pm.Model() as ordered_mvlogit_second:
beta = pm.Normal('beta', mu=0, sd=2, shape=6)
eta = X_aug.dot(beta)
cutpoints = pm.Normal("cutpoints", mu=np.array([-1,1]), sd=1, shape=2,
transform=pm.distributions.transforms.ordered)
y_ = pm.OrderedLogistic("y", cutpoints=cutpoints, eta=eta, observed=Y)
trace_mvordlogit_second = pm.sample(tune=1000, draws=5000, chains=4, cores=4)
This samples nicely and all the coefficients have non-zero HPDs
And as above you can generate a plot of the classification regions
来源:https://stackoverflow.com/questions/53198259/running-a-multivariate-ordered-logit-in-pymc3