问题
This is a tough question, but I might as well try. I'm implementing the architecture from this paper https://arxiv.org/pdf/1503.08895.pdf for language modeling. See page 2 for a diagram, and the top of page 5 for the section on positional or "temporal" encoding. More on positional encoding can be found here, https://arxiv.org/pdf/1706.03762.pdf at the bottom of page 5/top of page 6. (I was directed to that second paper by the authors of the first.)
So here's my keras implementation in a nutshell:
word_seq = Input(shape = (SEQ_LEN,), dtype = "int32", name = "word_seq")
query = Input(shape = (EMBED_DIM, ), dtype = "float32", name = "q_input")
#the query for lang. modeling is a constant vector filled with 0.1, as described at the bottom of page 7 in the first linked paper
T_A = Added_Weights(input_dim = (SEQ_LEN, EMBED_DIM))
#Added_Weights is a custom layer I wrote, which I'll post below
#These are the "positional encoding" components
T_C = Added_Weights(input_dim = (SEQ_LEN, EMBED_DIM))
Emb_A = Embedding(output_dim = EMBED_DIM, input_dim = VOCAB_SIZE, input_length = SEQ_LEN, name = "Emb_A")
Emb_C = Embedding(output_dim = EMBED_DIM, input_dim = VOCAB_SIZE, input_length = SEQ_LEN, name = "Emb_C")
int_state_weights = Dense(units = EMBED_DIM, activation = 'linear',
kernel_initializer=RandomNormal(mean=0., stddev = 0.05, seed = None))
layer_output = query
#the loop uses the output from the previous layer as the query, but the first layer's query is just that constant vector
for i in range(0, NUM_LAYERS - 1):
memories = Emb_A(word_seq) #these all re-use the weights instantiated earlier.
memories = T_A(memories)
memories = Dropout(DROPOUT_R)(memories)
content = Emb_C(word_seq)
content = T_C(content)
mem_relevance = Dot(axes=[1, 2])([layer_output, memories])
weighted_internal_state = int_state_weights(mem_relevance)
mem_relevance = Softmax()(mem_relevance)
content_relevance = Dot(axes=1)([mem_relevance,
content]) # weight each piece of content by it's probability of being relevant
layer_output = Add()([content_relevance, weighted_internal_state])
layer_output = Dropout(DROPOUT_R)(layer_output)
final_output = Dense(units = VOCAB_SIZE, activation ='relu',
kernel_initializer=RandomNormal(mean=0., stddev = 0.05, seed = None))(layer_output)
model = Model(inputs = [word_seq, query], outputs = prediction)
model.compile(optimizer = SGD(lr = 0.01, clipnorm = 50.), loss = 'categorical_crossentropy', metrics = ['accuracy'])
model.fit(x = [td_seqs, td_query], y = [td_labels],
batch_size = BATCH_SIZE, callbacks = [lr_adjust, lr_termination, for_csv], epochs=200, verbose = 1)
BATCH_SIZE is currently 128. This went well on ~35,000 training samples BEFORE I added the T_A and T_C parts, ending at 96% accuracy. As soon as I implemented T_A and T_C (the positional encoding), training ended at around 10% accuracy and 5.2-ish training loss. I increased the training data by a factor of 10 and didn't see any real improvement. Here's my Added_Weights class:
class Added_Weights(Layer):
def __init__(self, input_dim, **kwargs):
super(Added_Weights, self).__init__(**kwargs)
self.input_dim = input_dim
def build(self, input_shape):
# Create a trainable weight variable for this layer.
self.kernel = self.add_weight(name='kernel',
shape=(self.input_dim[0], self.input_dim[1]),
initializer=RandomNormal(mean=0., stddev=0.05, seed=None),
trainable=True)
super(Added_Weights, self).build(input_shape)
def call(self, x, **kwargs):
return x + self.kernel
def compute_output_shape(self, input_shape):
return input_shape
I am agonizing over why this won't work, after reading both of these awesome papers explicitly stating that it SHOULD work. If anyone can manage to help with this, that would be amazing.
来源:https://stackoverflow.com/questions/50400481/positional-encodings-leads-to-worse-convergence-language-modeling