How to one-hot-encode sentences at the character level?

前端未结

关注

 5  837

-上瘾入骨i 2021-01-15 15:16

I would like to convert a sentence to an array of one-hot vector. These vector would be the one-hot representation of the alphabet. It would look like the following:

5条回答

粉色の甜心 (楼主)

2021-01-15 15:22

Here's a vectorized approach using NumPy broadcasting to give us a (N,26) shaped array -

ints = np.fromstring("hello",dtype=np.uint8)-97
out = (ints[:,None] == np.arange(26)).astype(int)

If you are looking for performance, I would suggest using an initialized array and then assign -

out = np.zeros((len(ints),26),dtype=int)
out[np.arange(len(ints)), ints] = 1

Sample run -

In [153]: ints = np.fromstring("hello",dtype=np.uint8)-97

In [154]: ints
Out[154]: array([ 7,  4, 11, 11, 14], dtype=uint8)

In [155]: out = (ints[:,None] == np.arange(26)).astype(int)

In [156]: print out
[[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]]

0 讨论(0)

查看其它5个回答