Does tensorflow have something similar to scikit learn\'s one hot encoder for processing categorical data? Would using a placeholder of tf.string behave as categorical data
Maybe it's due to changes to Tensorflow since Nov 2015, but @dga's answer produced errors. I did get it to work with the following modifications:
sparse_labels = tf.reshape(label_batch, [-1, 1])
derived_size = tf.shape(sparse_labels)[0]
indices = tf.reshape(tf.range(0, derived_size, 1), [-1, 1])
concated = tf.concat(1, [indices, sparse_labels])
outshape = tf.concat(0, [tf.reshape(derived_size, [1]), tf.reshape(num_labels, [1])])
labels = tf.sparse_to_dense(concated, outshape, 1.0, 0.0)
You can use tf.sparse_to_dense:
The sparse_indices argument indicates where the ones should go, output_shape should be set to the number of possible outputs (e.g. the number of labels), and sparse_values should be 1 with the desired type (it will determine the type of the output from the type of sparse_values).
There's embedding_ops in Scikit Flow and examples that deal with categorical variables, etc.
If you just begin to learn TensorFlow, I would suggest you trying out examples in TensorFlow/skflow first and then once you are more familiar with TensorFlow it would be fairly easy for you to insert TensorFlow code to build a custom model you want (there are also examples for this).
Hope those examples for images and text understanding could get you started and let us know if you encounter any issues! (post issues or tag skflow in SO).
In [7]: one_hot = tf.nn.embedding_lookup(np.eye(5), [1,2])
In [8]: one_hot.eval()
Out[8]:
array([[ 0., 1., 0., 0., 0.],
[ 0., 0., 1., 0., 0.]])
works on TF version 1.3.0. As of Sep 2017.
My version of @CFB and @dga example, shortened a bit to ease understanding.
num_labels = 10
labels_batch = [2, 3, 5, 9]
sparse_labels = tf.reshape(labels_batch, [-1, 1])
derived_size = len(labels_batch)
indices = tf.reshape(tf.range(0, derived_size, 1), [-1, 1])
concated = tf.concat(1, [indices, sparse_labels])
labels = tf.sparse_to_dense(concated, [derived_size, num_labels], 1.0, 0.0)
Take a look at tf.nn.embedding_lookup. It maps from categorical IDs to their embeddings.
For an example of how it's used for input data, see here.