On occasion, circumstances require us to do the following:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=my_max)
Lets see what this line of code does.
tokenizer.fit_on_texts(text)
For example, consider the sentence " The earth is an awesome place live"
tokenizer.fit_on_texts("The earth is an awesome place live")
fits [[1,2,3,4,5,6,7]] where 3 -> "is" , 6 -> "place", so on.
sequences = tokenizer.texts_to_sequences("The earth is an great place live")
returns [[1,2,3,4,6,7]].
You see what happened here. The word "great" is not fit initially, so it does not recognize the word "great". Meaning, fit_on_text can be used independently on train data and then the fitted vocabulary index can be used to represent a completely new set of word sequence. These are two different processes. Hence the two lines of code.