How to one-hot-encode sentences at the character level?

前端 未结 5 837
-上瘾入骨i
-上瘾入骨i 2021-01-15 15:16

I would like to convert a sentence to an array of one-hot vector. These vector would be the one-hot representation of the alphabet. It would look like the following:

5条回答
  •  粉色の甜心
    2021-01-15 15:22

    Here's a vectorized approach using NumPy broadcasting to give us a (N,26) shaped array -

    ints = np.fromstring("hello",dtype=np.uint8)-97
    out = (ints[:,None] == np.arange(26)).astype(int)
    

    If you are looking for performance, I would suggest using an initialized array and then assign -

    out = np.zeros((len(ints),26),dtype=int)
    out[np.arange(len(ints)), ints] = 1
    

    Sample run -

    In [153]: ints = np.fromstring("hello",dtype=np.uint8)-97
    
    In [154]: ints
    Out[154]: array([ 7,  4, 11, 11, 14], dtype=uint8)
    
    In [155]: out = (ints[:,None] == np.arange(26)).astype(int)
    
    In [156]: print out
    [[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
     [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
     [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
     [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
     [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]]
    

提交回复
热议问题