问题
I'm playing around with Hidden Markov Models for a stock market prediction problem. My data matrix contains various features for a particular security:
01-01-2001, .025, .012, .01
01-02-2001, -.005, -.023, .02
I fit a simple GaussianHMM:
from hmmlearn import GaussianHMM
mdl = GaussianHMM(n_components=3,covariance_type='diag',n_iter=1000)
mdl.fit(train[:,1:])
With the model (λ), I can decode an observation vector to find the most likely hidden state sequence corresponding to the observation vector:
print mdl.decode(test[0:5,1:])
(72.75, array([2, 1, 2, 0, 0]))
Above, I've decoded the hidden state sequence of an observation vector Ot = (O1, O2, ..., Od) which contains the first five instances in a test set. I'd like to estimate the hidden state of the sixth instance in the test set. The idea is to iterate over a discrete set of possible feature values for the sixth instance, and select the observation sequence Ot+1 with highest likelihood argmax = P(O1, O2, ..., Od+1 | λ ). Once we observe the true feature values of Od+1, we can shift the sequence (of length 5) by one and do it all over again:
l = 5
for i in xrange(len(test)-l):
values = []
for a in arange(-0.05,0.05,.01):
for b in arange(-0.05,0.05,.01):
for c in arange(-0.05,0.05,.01):
values.append(mdl.decode(vstack((test[i:i+l,1:],array([a,b,c])))))
print max(enumerate(values),key=lambda x: x[1])
The problem is that when I decode the observation vector Ot+1, the prediction with the highest likelihood is almost always the same (e.g. the estimate with highest likelihood always has feature values for Od+1 that equal [ 0.04 0.04 0.04] and is hidden state [0]):
(555, (74.71248518927949, array([2, 1, 2, 0, 0, 0]))) [ 0.04 0.04 0.04]
(555, (69.41963358191555, array([2, 2, 0, 0, 0, 0]))) [ 0.04 0.04 0.04]
(555, (77.11516871816922, array([2, 0, 0, 0, 0, 0]))) [ 0.04 0.04 0.04]
It's entirely possible that I'm misunderstanding the purpose of mdl.decode
, and thus using it incorrectly. If that's the case, how best can I go about iterating over possible values of Od+1, and then maximizing P(O1, O2, ..., Od+1 | λ)?
回答1:
Are your actual values bounded [-0.05, 0.05)
?
When I initially built a sample dataset to look at your problem, I drew random floats in [0,1]
. When I did that, I also got the same results you're observing - always the maximum value of (a,b,c)
for the sixth entry, for every sequence, and always the same predicted class. But given that the distribution of my data (evenly distributed between 0
and 1
) had a greater central tendency than the grid search values for the sixth entry (between -.05
and .05
), it made sense that HMM always picked the highest value three-tuple (.04,.04,.04)
, as that was closest to the main density of the distributions it had been trained on.
When I rebuilt the dataset with draws from a uniform distribution over the same range of possible values as we allow for the sixth entry ([-0.05,0.05)
), the output was much more varied: both O_t+1
selection and class prediction showed reasonable differentiation across sequences. From your example data, it does seem like you do at least have both positive and negative values, but you might try plotting out the distribution for each feature and see if your range of possible sixth-entry values are all plausible.
Here's some sample data and evaluation code. It will print out a message each time there's a new optimal (a,b,c)
sequence, or when the prediction for the 6th observation changes (just to show that they're not all the same). The highest likelihood for each 6-element sequence, along with predictions and the optimal 6th data point are stored in best_per_span
.
First, construct a sample dataset:
import numpy as np
import pandas as pd
dates = pd.date_range(start="01-01-2001", end="31-12-2001", freq='D')
n_obs = len(dates)
n_feat = 3
values = np.random.uniform(-.05, .05, size=n_obs*n_feat).reshape((n_obs,n_feat))
df = pd.DataFrame(values, index=dates)
df.head()
0 1 2
2001-01-01 0.020891 -0.048750 -0.027131
2001-01-02 0.013571 -0.011283 0.041322
2001-01-03 -0.008102 0.034088 -0.029202
2001-01-04 -0.019666 -0.005705 -0.003531
2001-01-05 -0.000238 -0.039251 0.029307
Now split into training and test sets:
train_pct = 0.7
train_size = round(train_pct*n_obs)
train_ix = np.random.choice(range(n_obs), size=train_size, replace=False)
train_dates = df.index[train_ix]
train = df.loc[train_dates]
test = df.loc[~df.index.isin(train_dates)]
train.shape # (255, 3)
test.shape # (110, 3)
Fit 3-state HMM on training data:
# hmm throws a lot of deprecation warnings, we'll suppress them.
import warnings
with warnings.catch_warnings():
warnings.filterwarnings("ignore",category=DeprecationWarning)
# in the most recent hmmlearn we can't import GaussianHMM directly anymore.
from hmmlearn import hmm
mdl = hmm.GaussianHMM(n_components=3, covariance_type='diag', n_iter=1000)
mdl.fit(train)
Now grid search for optimal 6th (t+1
) observation:
# length of O_t
span = 5
# final store of optimal configurations per O_t+1 sequence
best_per_span = []
# update these to demonstrate heterogenous outcomes
current_abc = None
current_pred = None
for start in range(len(test)-span):
flag = False
end = start + span
first_five = test.iloc[start:end].values
output = []
for a in np.arange(-0.05,0.05,.01):
for b in np.arange(-0.05,0.05,.01):
for c in np.arange(-0.05,0.05,.01):
sixth = np.array([a, b, c])[:, np.newaxis].T
all_six = np.append(first_five, sixth, axis=0)
output.append((mdl.decode(all_six), (a,b,c)))
best = max(output, key=lambda x: x[0][0])
best_dict = {"start":start,
"end":end,
"sixth":best[1],
"preds":best[0][1],
"lik":best[0][0]}
best_per_span.append(best_dict)
# below here is all reporting
if best_dict["sixth"] != current_abc:
current_abc = best_dict["sixth"]
flag = True
print("New abc for range {}:{} = {}".format(start, end, current_abc))
if best_dict["preds"][-1] != current_pred:
current_pred = best_dict["preds"][-1]
flag = True
print("New pred for 6th position: {}".format(current_pred))
if flag:
print("Test sequence StartIx: {}, EndIx: {}".format(start, end))
print("Best 6th value: {}".format(best_dict["sixth"]))
print("Predicted hidden state sequence: {}".format(best_dict["preds"]))
print("Likelihood: {}\n".format(best_dict["nLL"]))
Example reporting output while loop is running:
New abc for range 3:8 = [-0.01, 0.01, 0.0]
New pred for 6th position: 1
Test sequence StartIx: 3, EndIx: 8
Best 6th value: [-0.01, 0.01, 0.0]
Predicted hidden state sequence: [0 2 2 1 0 1]
Likelihood: 35.30144407374163
New abc for range 18:23 = [-0.01, -0.01, -0.01]
New pred for 6th position: 2
Test sequence StartIx: 18, EndIx: 23
Best 6th value: [-0.01, -0.01, -0.01]
Predicted hidden state sequence: [0 0 0 1 2 2]
Likelihood: 34.31813078939214
Example output from best_per_span
:
[{'end': 5,
'lik': 33.791537281734904,
'preds': array([0, 2, 0, 1, 2, 2]),
'sixth': [-0.01, -0.01, -0.01],
'start': 0},
{'end': 6,
'lik': 33.28967307589143,
'preds': array([0, 0, 1, 2, 2, 2]),
'sixth': [-0.01, -0.01, -0.01],
'start': 1},
{'end': 7,
'lik': 34.446813870838156,
'preds': array([0, 1, 2, 2, 2, 2]),
'sixth': [-0.01, -0.01, -0.01],
'start': 2}]
Aside from the reporting element, this isn't a significant change to your initial approach, but it does seem to work as expected, without maxing out each time.
来源:https://stackoverflow.com/questions/45538826/decoding-sequences-in-a-gaussianhmm