How to build a Language model using LSTM that assigns probability of occurence for a given sentence

后端 未结 1 1431
小蘑菇
小蘑菇 2020-12-16 04:09

Currently, I am using Trigram to do this. It assigns the probability of occurrence for a given sentence. But Its limited to the only context of 2 words. But LSTM\'s can do m

相关标签:
1条回答
  • 2020-12-16 04:57

    I have just coded a very simple example showing how one might compute the probability of occurrence of a sentence with a LSTM model. The full code can be found here.

    Suppose we want to predict the probability of occurrence of a sentence for the following dataset (this rhyme was published in Mother Goose's Melody in London around 1765):

    # Data
    data = ["Two little dicky birds",
            "Sat on a wall,",
            "One called Peter,",
            "One called Paul.",
            "Fly away, Peter,",
            "Fly away, Paul!",
            "Come back, Peter,",
            "Come back, Paul."]
    

    First of all, let's use keras.preprocessing.text.Tokenizer to create a vocabulary and tokenize the sentences:

    # Preprocess data
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(data)
    vocab = tokenizer.word_index
    seqs = tokenizer.texts_to_sequences(data)
    

    Our model will take a sequence of words as input (context), and will output the conditional probability distribution of each word in the vocabulary given the context. To this end, we prepare the training data by padding the sequences and sliding windows over them:

    def prepare_sentence(seq, maxlen):
        # Pads seq and slides windows
        x = []
        y = []
        for i, w in enumerate(seq):
            x_padded = pad_sequences([seq[:i]],
                                     maxlen=maxlen - 1,
                                     padding='pre')[0]  # Pads before each sequence
            x.append(x_padded)
            y.append(w)
        return x, y
    
    # Pad sequences and slide windows
    maxlen = max([len(seq) for seq in seqs])
    x = []
    y = []
    for seq in seqs:
        x_windows, y_windows = prepare_sentence(seq, maxlen)
        x += x_windows
        y += y_windows
    x = np.array(x)
    y = np.array(y) - 1  # The word <PAD> does not constitute a class
    y = np.eye(len(vocab))[y]  # One hot encoding
    

    I decided to slide windows separately for each verse, but this could be done differently.

    Next, we define and train a simple LSTM model with Keras. The model consists of an embedding layer, a LSTM layer, and a dense layer with a softmax activation (which uses the output at the last timestep of the LSTM to produce the probability of each word in the vocabulary given the context):

    # Define model
    model = Sequential()
    model.add(Embedding(input_dim=len(vocab) + 1,  # vocabulary size. Adding an
                                                   # extra element for <PAD> word
                        output_dim=5,  # size of embeddings
                        input_length=maxlen - 1))  # length of the padded sequences
    model.add(LSTM(10))
    model.add(Dense(len(vocab), activation='softmax'))
    model.compile('rmsprop', 'categorical_crossentropy')
    
    # Train network
    model.fit(x, y, epochs=1000)
    

    The joint probability P(w_1, ..., w_n) of occurrence of a sentence w_1 ... w_n can be computed using the rule of conditional probability:

    P(w_1, ..., w_n)=P(w_1)*P(w_2|w_1)*...*P(w_n|w_{n-1}, ..., w_1)

    where each of these conditional probabilities is given by the LSTM model. Note that they might be very small, so it is sensible to work in log space in order to avoid numerical instability issues. Putting it all together:

    # Compute probability of occurence of a sentence
    sentence = "One called Peter,"
    tok = tokenizer.texts_to_sequences([sentence])[0]
    x_test, y_test = prepare_sentence(tok, maxlen)
    x_test = np.array(x_test)
    y_test = np.array(y_test) - 1  # The word <PAD> does not constitute a class
    p_pred = model.predict(x_test)  # array of conditional probabilities
    vocab_inv = {v: k for k, v in vocab.items()}
    
    # Compute product
    # Efficient version: np.exp(np.sum(np.log(np.diag(p_pred[:, y_test]))))
    log_p_sentence = 0
    for i, prob in enumerate(p_pred):
        word = vocab_inv[y_test[i]+1]  # Index 0 from vocab is reserved to <PAD>
        history = ' '.join([vocab_inv[w] for w in x_test[i, :] if w != 0])
        prob_word = prob[y_test[i]]
        log_p_sentence += np.log(prob_word)
        print('P(w={}|h={})={}'.format(word, history, prob_word))
    print('Prob. sentence: {}'.format(np.exp(log_p_sentence)))
    

    NOTE: This is a very small toy dataset and we might be overfitting.

    0 讨论(0)
提交回复
热议问题