How to Merge Numerical and Embedding Sequential Models to treat categories in RNN

后端 未结 2 1599
后悔当初
后悔当初 2021-02-01 09:50

I would like to build a one layer LSTM model with embeddings for my categorical features. I currently have numerical features and a few categorical features, such as Location, w

相关标签:
2条回答
  • 2021-02-01 10:42

    One solution, as you mentioned, is to one-hot encode the categorical data (or even use them as they are, in index-based format) and feed them along the numerical data to an LSTM layer. Of course, you can also have two LSTM layers here, one for processing the numerical data and another for processing categorical data (in one-hot encoded format or index-based format) and then merge their outputs.

    Another solution is to have one separate embedding layer for each of those categorical data. Each embedding layer may have its own embedding dimension (and as suggested above, you may have more than one LSTM layer for processing numerical and categorical features separately):

    num_cats = 3 # number of categorical features
    n_steps = 100 # number of timesteps in each sample
    n_numerical_feats = 10 # number of numerical features in each sample
    cat_size = [1000, 500, 100] # number of categories in each categorical feature
    cat_embd_dim = [50, 10, 100] # embedding dimension for each categorical feature
    
    numerical_input = Input(shape=(n_steps, n_numerical_feats), name='numeric_input')
    cat_inputs = []
    for i in range(num_cats):
        cat_inputs.append(Input(shape=(n_steps,1), name='cat' + str(i+1) + '_input'))
    
    cat_embedded = []
    for i in range(num_cats):
        embed = TimeDistributed(Embedding(cat_size[i], cat_embd_dim[i]))(cat_inputs[i])
        cat_embedded.append(embed)
    
    cat_merged = concatenate(cat_embedded)
    cat_merged = Reshape((n_steps, -1))(cat_merged)
    merged = concatenate([numerical_input, cat_merged])
    lstm_out = LSTM(64)(merged)
    
    model = Model([numerical_input] + cat_inputs, lstm_out)
    model.summary()
    

    Here is the model summary:

    Layer (type)                    Output Shape         Param #     Connected to                     
    ==================================================================================================
    cat1_input (InputLayer)         (None, 100, 1)       0                                            
    __________________________________________________________________________________________________
    cat2_input (InputLayer)         (None, 100, 1)       0                                            
    __________________________________________________________________________________________________
    cat3_input (InputLayer)         (None, 100, 1)       0                                            
    __________________________________________________________________________________________________
    time_distributed_1 (TimeDistrib (None, 100, 1, 50)   50000       cat1_input[0][0]                 
    __________________________________________________________________________________________________
    time_distributed_2 (TimeDistrib (None, 100, 1, 10)   5000        cat2_input[0][0]                 
    __________________________________________________________________________________________________
    time_distributed_3 (TimeDistrib (None, 100, 1, 100)  10000       cat3_input[0][0]                 
    __________________________________________________________________________________________________
    concatenate_1 (Concatenate)     (None, 100, 1, 160)  0           time_distributed_1[0][0]         
                                                                     time_distributed_2[0][0]         
                                                                     time_distributed_3[0][0]         
    __________________________________________________________________________________________________
    numeric_input (InputLayer)      (None, 100, 10)      0                                            
    __________________________________________________________________________________________________
    reshape_1 (Reshape)             (None, 100, 160)     0           concatenate_1[0][0]              
    __________________________________________________________________________________________________
    concatenate_2 (Concatenate)     (None, 100, 170)     0           numeric_input[0][0]              
                                                                     reshape_1[0][0]                  
    __________________________________________________________________________________________________
    lstm_1 (LSTM)                   (None, 64)           60160       concatenate_2[0][0]              
    ==================================================================================================
    Total params: 125,160
    Trainable params: 125,160
    Non-trainable params: 0
    __________________________________________________________________________________________________
    

    Yet there is another solution which you can try: just have one embedding layer for all the categorical features. It involves some preprocessing though: you need to re-index all the categories to make them distinct from each other. For example, the categories in first categorical feature would be numbered from 1 to size_first_cat and then the categories in the second categorical feature would be numbered from size_first_cat + 1 to size_first_cat + size_second_cat and so on. However, in this solution all the categorical features would have the same embedding dimension since we are using only one embedding layer.


    Update: Now that I think about it, you can also reshape the categorical features in data preprocessing stage or even in the model to get rid of TimeDistributed layers and the Reshape layer (and this may increase the training speed as well):

    numerical_input = Input(shape=(n_steps, n_numerical_feats), name='numeric_input')
    cat_inputs = []
    for i in range(num_cats):
        cat_inputs.append(Input(shape=(n_steps,), name='cat' + str(i+1) + '_input'))
    
    cat_embedded = []
    for i in range(num_cats):
        embed = Embedding(cat_size[i], cat_embd_dim[i])(cat_inputs[i])
        cat_embedded.append(embed)
    
    cat_merged = concatenate(cat_embedded)
    merged = concatenate([numerical_input, cat_merged])
    lstm_out = LSTM(64)(merged)
    
    model = Model([numerical_input] + cat_inputs, lstm_out)
    

    As for fitting the model, you need to feed each input layer separately with its own corresponding numpy array, for example:

    X_tr_numerical = X_train[:,:,:n_numerical_feats]
    
    # extract categorical features: you can use a for loop to this as well.
    # note that we reshape categorical features to make them consistent with the updated solution
    X_tr_cat1 = X_train[:,:,cat1_idx].reshape(-1, n_steps) 
    X_tr_cat2 = X_train[:,:,cat2_idx].reshape(-1, n_steps)
    X_tr_cat3 = X_train[:,:,cat3_idx].reshape(-1, n_steps)
    
    # don't forget to compile the model ...
    
    # fit the model
    model.fit([X_tr_numerical, X_tr_cat1, X_tr_cat2, X_tr_cat3], y_train, ...)
    
    # or you can use input layer names instead
    model.fit({'numeric_input': X_tr_numerical,
               'cat1_input': X_tr_cat1,
               'cat2_input': X_tr_cat2,
               'cat3_input': X_tr_cat3}, y_train, ...)
    

    If you would like to use fit_generator() there is no difference:

    # if you are using a generator
    def my_generator(...):
    
        # prep the data ...
    
        yield [batch_tr_numerical, batch_tr_cat1, batch_tr_cat2, batch_tr_cat3], batch_tr_y
    
        # or use the names
        yield {'numeric_input': batch_tr_numerical,
               'cat1_input': batch_tr_cat1,
               'cat2_input': batch_tr_cat2,
               'cat3_input': batch_tr_cat3}, batch_tr_y
    
    model.fit_generator(my_generator(...), ...)
    
    # or if you are subclassing Sequence class
    class MySequnece(Sequence):
        def __init__(self, x_set, y_set, batch_size):
            # initialize the data
    
        def __getitem__(self, idx):
            # fetch data for the given batch index (i.e. idx)
    
            # same as the generator above but use `return` instead of `yield`
    
    model.fit_generator(MySequence(...), ...)
    
    0 讨论(0)
  • 2021-02-01 10:45

    One other solution I could think of is you could as well concat the numerical(after normalizing) and categorical features together even before you feed it to the lstm.

    During the backprop alow the gradients to flow only in the embedding layer since by default the gradient will flow in both branches.

    0 讨论(0)
提交回复
热议问题