Keras Text Preprocessing - Saving Tokenizer object to file for scoring

后端 未结 4 1101
没有蜡笔的小新
没有蜡笔的小新 2020-12-01 03:59

I\'ve trained a sentiment classifier model using Keras library by following the below steps(broadly).

  1. Convert Text corpus into sequences using Tokenizer object
相关标签:
4条回答
  • 2020-12-01 04:38

    The most common way is to use either pickle or joblib. Here you have an example on how to use pickle in order to save Tokenizer:

    import pickle
    
    # saving
    with open('tokenizer.pickle', 'wb') as handle:
        pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
    
    # loading
    with open('tokenizer.pickle', 'rb') as handle:
        tokenizer = pickle.load(handle)
    
    0 讨论(0)
  • 2020-12-01 04:39

    The accepted answer clearly demonstrates how to save the tokenizer. The following is a comment on the problem of (generally) scoring after fitting or saving. Suppose that a list texts is comprised of two lists Train_text and Test_text, where the set of tokens in Test_text is a subset of the set of tokens in Train_text (an optimistic assumption). Then fit_on_texts(Train_text) gives different results for texts_to_sequences(Test_text) as compared with first calling fit_on_texts(texts) and then text_to_sequences(Test_text).

    Concrete Example:

    from keras.preprocessing.text import Tokenizer
    
    docs = ["A heart that",
             "full up like",
             "a landfill",
            "no surprises",
            "and no alarms"
             "a job that slowly"
             "Bruises that",
             "You look so",
             "tired happy",
             "no alarms",
            "and no surprises"]
    docs_train = docs[:7]
    docs_test = docs[7:]
    # EXPERIMENT 1: FIT  TOKENIZER ONLY ON TRAIN
    T_1 = Tokenizer()
    T_1.fit_on_texts(docs_train)  # only train set
    encoded_train_1 = T_1.texts_to_sequences(docs_train)
    encoded_test_1 = T_1.texts_to_sequences(docs_test)
    print("result for test 1:\n%s" %(encoded_test_1,))
    
    # EXPERIMENT 2: FIT TOKENIZER ON BOTH TRAIN + TEST
    T_2 = Tokenizer()
    T_2.fit_on_texts(docs)  # both train and test set
    encoded_train_2 = T_2.texts_to_sequences(docs_train)
    encoded_test_2 = T_2.texts_to_sequences(docs_test)
    print("result for test 2:\n%s" %(encoded_test_2,))
    

    Results:

    result for test 1:
    [[3], [10, 3, 9]]
    result for test 2:
    [[1, 19], [5, 1, 4]]
    

    Of course, if the above optimistic assumption is not satisfied and the set of tokens in Test_text is disjoint from that of Train_test, then test 1 results in a list of empty brackets [].

    0 讨论(0)
  • 2020-12-01 04:53

    I've created the issue https://github.com/keras-team/keras/issues/9289 in the keras Repo. Until the API is changed, the issue has a link to a gist that has code to demonstrate how to save and restore a tokenizer without having the original documents the tokenizer was fit on. I prefer to store all my model information in a JSON file (because reasons, but mainly mixed JS/Python environment), and this will allow for that, even with sort_keys=True

    0 讨论(0)
  • 2020-12-01 05:03

    Tokenizer class has a function to save date into JSON format:

    tokenizer_json = tokenizer.to_json()
    with io.open('tokenizer.json', 'w', encoding='utf-8') as f:
        f.write(json.dumps(tokenizer_json, ensure_ascii=False))
    

    The data can be loaded using tokenizer_from_json function from keras_preprocessing.text:

    with open('tokenizer.json') as f:
        data = json.load(f)
        tokenizer = tokenizer_from_json(data)
    
    0 讨论(0)
提交回复
热议问题