Adding Attention on top of simple LSTM layer in Tensorflow 2.0

前端未结

关注

 1  1744

I have a simple network of one LSTM and two Dense layers as such:

model = tf.keras.Sequential()
model.add(layers.LSTM(20, input_shape=(train_X.shape[1], trai


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  不要未来只要你来        
                
              
                            
                2020-12-30 09:34
              
            
            
                                                                       
I eventually found two answers to the problem, both from libraries on pypi.org. The first is self-attention and can be implemented with Keras (the pre TF 2.0 integrated version of Keras) as follows...
        model = keras.models.Sequential()
        model.add(keras.layers.LSTM(cfg.LSTM, input_shape=(cfg.TIMESTEPS,
                  cfg.FEATURES),
                  return_sequences=True))
        model.add(SeqSelfAttention(attention_width=cfg.ATTNWIDTH,
                attention_type=SeqSelfAttention.ATTENTION_TYPE_MUL,
                attention_activation='softmax',
                name='Attention'))
        model.add(keras.layers.Dense(cfg.DENSE))
        model.add(keras.layers.Dense(cfg.OUTPUT, activation='sigmoid'))

The second way to do it is a more general solution that works with the post TF 2.0 integrated Keras as follows...
        model = tf.keras.models.Sequential()
        model.add(layers.LSTM(cfg.LSTM, input_shape=(cfg.SEQUENCES,
                  train_X.shape[2]),
                  return_sequences=True))
        model.add(Attention(name='attention_weight'))
        model.add(layers.Dense(train_Y.shape[2], activation='sigmoid'))

They each behave a little different, and produce very different results. The self-attention library reduces the dimensions from 3 to 2 and when predicting you get a prediction per input vector. The general attention mechanism maintains the 3D data and outputs 3D, and when predicting you only get a prediction per batch. You can solve this by reshaping your prediction data to have batch sizes of 1 if you want predictions per input vector.
As for results, the self-attention did produce superior results to LSTM alone, but not better than other enhancements such as dropout or more dense, layers, etc. The general attention does not seem to add any benefit to an LSTM model and in many cases makes things worse, but I'm still investigating.
In any case, it can be done, but so far it's dubious if it should be done.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复