Entity Embedding of Categorical within Time Series Data and LSTM

问题

I'm trying to solve a time series problem. In short, for each customer and material (SKU code), I have different orders placed in the past. I need to build a model that predict the number of days before the next order for each customer and material.

What I'm trying to do is to build an LSTM model in Keras, where for each customer and material I have a 50 max padded timesteps of history, and I'm using a mix of numeric (# of days since previous order, AVG days between orders in last 60 days etc...) and categorical features (SKU code, customer code, type of SKU etc...).

For the categorical, I'm trying to use the popular entity embedding technique. I started from an example published on Github, that was not using LSTM (it was embedding using input_lengh = 1) and generalized it to work with higher input emebdding that I could feed to LSTM.

Below my code.

from keras.regularizers import l2,l1

input_models=[]
output_embeddings=[]
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

## features is this list features =['CAT_Cliente_le','CAT_Famiglia_le','CAT_Materiale_le','CAT_Settimana','CAT_Sotto_Famiglia_le','NUM_Data_diff_comprato','NUM_Data_diff_comprato_avg','NUM_Data_diff_comprato_avg_sf','NUM_Qty','NUM_Rank']

for categorical_var in np.arange(len(features)-5):    

 #Name of the categorical variable that will be used in the Keras Embedding layer
 cat_emb_name= features[categorical_var]+'_Embedding'

 # Define the embedding_size, max size is 10
 no_of_unique_cat = dataset_train.loc[:,features[categorical_var]].nunique()
 embedding_size = int(min(np.ceil((no_of_unique_cat+1)/2), 10 ))

 #One Embedding Layer for each categorical variable
 input_model = Input(shape=(MAX_TIMESTEP,)) 
 output_model = Embedding(no_of_unique_cat+1, embedding_size, name=cat_emb_name,input_length=MAX_TIMESTEP,mask_zero=True)(input_model)

 #Appending all the categorical inputs
 input_models.append(input_model)

 #Appending all the embeddings
 output_embeddings.append(output_model)

#Other non-categorical data columns (numerical). I have 5 of them
input_numeric = Input(shape=(MAX_TIMESTEP,len(['1','2','3','4','5']),))
mask_numeric = Masking(mask_value=0., input_shape=(MAX_TIMESTEP,5))(input_numeric)
input_models.append(input_numeric)
output_embeddings.append(mask_numeric)

output = Concatenate(axis=2)(output_embeddings)

output = LSTM(
           units= 25,
           input_shape=(MAX_TIMESTEP, 4),
           use_bias=True,
           kernel_initializer=he_normal(seed=14),
           recurrent_initializer=he_normal(seed=14),
           unit_forget_bias = True,
           return_sequences=True)(output)

output = TimeDistributed(Dense(1))(output)

model = Model(inputs=input_models, outputs=output)
model.compile(loss='mae', optimizer=SGD(lr=0.2, decay=0.001, momentum=0.9, nesterov=False),
          #clipvalue=0.75), epsilon=None, decay=0.00000, amsgrad=False),
          metrics=['mape'])`

What I observed it that: -the model show good performance with numeric features only -adding categorical does nothing to improve performances (I would at least expect the model to overfit by producing very specific rules, like client X ordered material Y in week Z after 5 days), but this never happens

My question is, is there something conceptually wrong in using entity embedding in LSTM like this? Should I change something?

Thanks a lot in advance

来源：https://stackoverflow.com/questions/57052889/entity-embedding-of-categorical-within-time-series-data-and-lstm

标签

lstm

word-embedding

tf.keras