Currently, I am using Trigram to do this. It assigns the probability of occurrence for a given sentence. But Its limited to the only context of 2 words. But LSTM\'s can do m
I have just coded a very simple example showing how one might compute the probability of occurrence of a sentence with a LSTM model. The full code can be found here.
Suppose we want to predict the probability of occurrence of a sentence for the following dataset (this rhyme was published in Mother Goose's Melody in London around 1765):
# Data
data = ["Two little dicky birds",
"Sat on a wall,",
"One called Peter,",
"One called Paul.",
"Fly away, Peter,",
"Fly away, Paul!",
"Come back, Peter,",
"Come back, Paul."]
First of all, let's use keras.preprocessing.text.Tokenizer to create a vocabulary and tokenize the sentences:
# Preprocess data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data)
vocab = tokenizer.word_index
seqs = tokenizer.texts_to_sequences(data)
Our model will take a sequence of words as input (context), and will output the conditional probability distribution of each word in the vocabulary given the context. To this end, we prepare the training data by padding the sequences and sliding windows over them:
def prepare_sentence(seq, maxlen):
# Pads seq and slides windows
x = []
y = []
for i, w in enumerate(seq):
x_padded = pad_sequences([seq[:i]],
maxlen=maxlen - 1,
padding='pre')[0] # Pads before each sequence
x.append(x_padded)
y.append(w)
return x, y
# Pad sequences and slide windows
maxlen = max([len(seq) for seq in seqs])
x = []
y = []
for seq in seqs:
x_windows, y_windows = prepare_sentence(seq, maxlen)
x += x_windows
y += y_windows
x = np.array(x)
y = np.array(y) - 1 # The word <PAD> does not constitute a class
y = np.eye(len(vocab))[y] # One hot encoding
I decided to slide windows separately for each verse, but this could be done differently.
Next, we define and train a simple LSTM model with Keras. The model consists of an embedding layer, a LSTM layer, and a dense layer with a softmax activation (which uses the output at the last timestep of the LSTM to produce the probability of each word in the vocabulary given the context):
# Define model
model = Sequential()
model.add(Embedding(input_dim=len(vocab) + 1, # vocabulary size. Adding an
# extra element for <PAD> word
output_dim=5, # size of embeddings
input_length=maxlen - 1)) # length of the padded sequences
model.add(LSTM(10))
model.add(Dense(len(vocab), activation='softmax'))
model.compile('rmsprop', 'categorical_crossentropy')
# Train network
model.fit(x, y, epochs=1000)
The joint probability P(w_1, ..., w_n)
of occurrence of a sentence w_1 ... w_n
can be computed using the rule of conditional probability:
P(w_1, ..., w_n)=P(w_1)*P(w_2|w_1)*...*P(w_n|w_{n-1}, ..., w_1)
where each of these conditional probabilities is given by the LSTM model. Note that they might be very small, so it is sensible to work in log space in order to avoid numerical instability issues. Putting it all together:
# Compute probability of occurence of a sentence
sentence = "One called Peter,"
tok = tokenizer.texts_to_sequences([sentence])[0]
x_test, y_test = prepare_sentence(tok, maxlen)
x_test = np.array(x_test)
y_test = np.array(y_test) - 1 # The word <PAD> does not constitute a class
p_pred = model.predict(x_test) # array of conditional probabilities
vocab_inv = {v: k for k, v in vocab.items()}
# Compute product
# Efficient version: np.exp(np.sum(np.log(np.diag(p_pred[:, y_test]))))
log_p_sentence = 0
for i, prob in enumerate(p_pred):
word = vocab_inv[y_test[i]+1] # Index 0 from vocab is reserved to <PAD>
history = ' '.join([vocab_inv[w] for w in x_test[i, :] if w != 0])
prob_word = prob[y_test[i]]
log_p_sentence += np.log(prob_word)
print('P(w={}|h={})={}'.format(word, history, prob_word))
print('Prob. sentence: {}'.format(np.exp(log_p_sentence)))
NOTE: This is a very small toy dataset and we might be overfitting.