I am trying to generate a string kernel that feeds a support vector classifier. I tried it with a function that calculates the kernel, something like that
de
This is a limitation in scikit-learn that has proved hard to get rid of. You can try this workaround. Represent the strings in feature vectors with only one feature, which is really just an index into the table of strings.
>>> data = ["foo", "bar", "baz"]
>>> X = np.arange(len(data)).reshape(-1, 1)
>>> X
array([[0],
[1],
[2]])
Redefine the string kernel function to work on this representation:
>>> def string_kernel(X, Y):
... R = np.zeros((len(x), len(y)))
... for x in X:
... for y in Y:
... i = int(x[0])
... j = int(y[0])
... # simplest kernel ever
... R[i, j] = data[i][0] == data[j][0]
... return R
...
>>> clf = SVC(kernel=string_kernel)
>>> clf.fit(X, ['no', 'yes', 'yes'])
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel=<function string_kernel at 0x7f5988f0bde8>, max_iter=-1,
probability=False, random_state=None, shrinking=True, tol=0.001,
verbose=False)
The downside to this is that to classify new samples, you have to add them to data
, then construct new pseudo-feature vectors for them.
>>> data.extend(["bla", "fool"])
>>> clf.predict([[3], [4]])
array(['yes', 'no'],
dtype='|S3')
(You can get around this by doing more interpretation of your pseudo-features, e.g., looking into a different table for i >= len(X_train)
. But it's still cumbersome.)
This is an ugly hack, but it works (it's slightly less ugly for clustering because there the dataset doesn't change after fit
). Speaking on behalf of the scikit-learn developers, I say a patch to fix this properly is welcome.
I think that shogun library could be the solution, is also free and open source, I suggest review this example: https://github.com/shogun-toolbox/shogun/tree/develop/src/shogun/kernel/string