Tensorflow.js tokenizer

徘徊边缘 提交于 2019-11-30 21:40:21

To transform text to vectors, there are lots of ways to do it, all depending on the use case. The most intuitive one, is the one using the term frequency, i.e , given the vocabulary of the corpus (all the words possible), all text document will be represented as a vector where each entry represents the occurrence of the word in text document.

With this vocabulary :

["machine", "learning", "is", "a", "new", "field", "in", "computer", "science"]

the following text:

["machine", "is", "a", "field", "machine", "is", "is"] 

will be transformed as this vector:

[2, 0, 3, 1, 0, 1, 0, 0, 0] 

One of the disadvantage of this technique is that there might be lots of 0 in the vector which has the same size as the vocabulary of the corpus. That is why there are others techniques. However the bag of words is often referred to. And there is a slight different version of it using tf.idf

const vocabulary = ["machine", "learning", "is", "a", "new", "field", "in", "computer", "science"]
const text = ["machine", "is", "a", "field", "machine", "is", "is"] 
const parse = (t) => vocabulary.map((w, i) => t.reduce((a, b) => b === w ? ++a : a , 0))
console.log(parse(text))

There is also the following module that might help to achieve what you want

Well, I faced this issue and handled it by following below steps:

  1. After tokenizer.fit_on_texts([data]) print tokenizer.word_index in your python code.
  2. copy and save the word_index output as json file.
  3. Refer to this json object to generate tokenized words, like this: function getTokenisedWord(seedWord) { const _token = word2index[seedWord.toLowerCase()] return tf.tensor1d([_token]) }
  4. Feed to model: const seedWordToken = getTokenisedWord('Hello'); model.predict(seedWordToken).data().then(predictions => { const resultIdx = tf.argMax(predictions).dataSync()[0]; console.log('Predicted Word ::', index2word[resultIdx]); })
  5. index2word is the reverse mapping of word2index json object.
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!