How to one-hot-encode sentences at the character level?

前端 未结 5 838
-上瘾入骨i
-上瘾入骨i 2021-01-15 15:16

I would like to convert a sentence to an array of one-hot vector. These vector would be the one-hot representation of the alphabet. It would look like the following:

相关标签:
5条回答
  • 2021-01-15 15:22

    Here's a vectorized approach using NumPy broadcasting to give us a (N,26) shaped array -

    ints = np.fromstring("hello",dtype=np.uint8)-97
    out = (ints[:,None] == np.arange(26)).astype(int)
    

    If you are looking for performance, I would suggest using an initialized array and then assign -

    out = np.zeros((len(ints),26),dtype=int)
    out[np.arange(len(ints)), ints] = 1
    

    Sample run -

    In [153]: ints = np.fromstring("hello",dtype=np.uint8)-97
    
    In [154]: ints
    Out[154]: array([ 7,  4, 11, 11, 14], dtype=uint8)
    
    In [155]: out = (ints[:,None] == np.arange(26)).astype(int)
    
    In [156]: print out
    [[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
     [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
     [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
     [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
     [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]]
    
    0 讨论(0)
  • 2021-01-15 15:28

    You asked about "sentences" but your example provided only a single word, so I'm not sure what you wanted to do about spaces. But as far as single words are concerned, your example could be implemented with:

    def onehot(ltr):
     return [1 if i==ord(ltr) else 0 for i in range(97,123)]
    
    def onehotvec(s):
     return [onehot(c) for c in list(s.lower())]
    
    onehotvec("hello")
    [[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
     [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
    
    0 讨论(0)
  • 2021-01-15 15:30

    This is a common task in Recurrent Neural Networks and there's a specific function just for this purpose in tensorflow, if you'd like to use it.

    alphabets = {'a' : 0, 'b': 1, 'c':2, 'd':3, 'e':4, 'f':5, 'g':6, 'h':7, 'i':8, 'j':9, 'k':10, 'l':11, 'm':12, 'n':13, 'o':14}
    
    idxs = [alphabets[ch] for ch in 'hello']
    print(idxs)
    # [7, 4, 11, 11, 14]
    
    # @divakar's approach
    idxs = np.fromstring("hello",dtype=np.uint8)-97
    
    # or for more clear understanding, use:
    idxs = np.fromstring('hello', dtype=np.uint8) - ord('a')
    
    one_hot = tf.one_hot(idxs, 26, dtype=tf.uint8)
    sess = tf.InteractiveSession()
    
    In [15]: one_hot.eval()
    Out[15]: 
    array([[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)
    
    0 讨论(0)
  • 2021-01-15 15:37

    Just compare the letters in your passed string to a given alphabet:

    def string_vectorizer(strng, alphabet=string.ascii_lowercase):
        vector = [[0 if char != letter else 1 for char in alphabet] 
                      for letter in strng]
        return vector
    

    Note that, with a custom alphabet (e.g. "defbcazk", the columns will be ordered as each element appears in the original list).

    The output of string_vectorizer('hello'):

    [[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
     [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
    
    0 讨论(0)
  • 2021-01-15 15:44

    With pandas, you can use pd.get_dummies by passing a categorical Series:

    import pandas as pd
    import string
    low = string.ascii_lowercase
    
    pd.get_dummies(pd.Series(list(s)).astype('category', categories=list(low)))
    Out: 
       a  b  c  d  e  f  g  h  i  j ...  q  r  s  t  u  v  w  x  y  z
    0  0  0  0  0  0  0  0  1  0  0 ...  0  0  0  0  0  0  0  0  0  0
    1  0  0  0  0  1  0  0  0  0  0 ...  0  0  0  0  0  0  0  0  0  0
    2  0  0  0  0  0  0  0  0  0  0 ...  0  0  0  0  0  0  0  0  0  0
    3  0  0  0  0  0  0  0  0  0  0 ...  0  0  0  0  0  0  0  0  0  0
    4  0  0  0  0  0  0  0  0  0  0 ...  0  0  0  0  0  0  0  0  0  0
    
    [5 rows x 26 columns]
    
    0 讨论(0)
提交回复
热议问题