How to one-hot-encode sentences at the character level?

前端 未结 5 839
-上瘾入骨i
-上瘾入骨i 2021-01-15 15:16

I would like to convert a sentence to an array of one-hot vector. These vector would be the one-hot representation of the alphabet. It would look like the following:

5条回答
  •  臣服心动
    2021-01-15 15:30

    This is a common task in Recurrent Neural Networks and there's a specific function just for this purpose in tensorflow, if you'd like to use it.

    alphabets = {'a' : 0, 'b': 1, 'c':2, 'd':3, 'e':4, 'f':5, 'g':6, 'h':7, 'i':8, 'j':9, 'k':10, 'l':11, 'm':12, 'n':13, 'o':14}
    
    idxs = [alphabets[ch] for ch in 'hello']
    print(idxs)
    # [7, 4, 11, 11, 14]
    
    # @divakar's approach
    idxs = np.fromstring("hello",dtype=np.uint8)-97
    
    # or for more clear understanding, use:
    idxs = np.fromstring('hello', dtype=np.uint8) - ord('a')
    
    one_hot = tf.one_hot(idxs, 26, dtype=tf.uint8)
    sess = tf.InteractiveSession()
    
    In [15]: one_hot.eval()
    Out[15]: 
    array([[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)
    

提交回复
热议问题