Entity Embedding of Categorical within Time Series Data and LSTM

放肆的年华 提交于 2019-12-13 03:19:15

问题


I'm trying to solve a time series problem. In short, for each customer and material (SKU code), I have different orders placed in the past. I need to build a model that predict the number of days before the next order for each customer and material.

What I'm trying to do is to build an LSTM model in Keras, where for each customer and material I have a 50 max padded timesteps of history, and I'm using a mix of numeric (# of days since previous order, AVG days between orders in last 60 days etc...) and categorical features (SKU code, customer code, type of SKU etc...).

For the categorical, I'm trying to use the popular entity embedding technique. I started from an example published on Github, that was not using LSTM (it was embedding using input_lengh = 1) and generalized it to work with higher input emebdding that I could feed to LSTM.

Below my code.

from keras.regularizers import l2,l1

input_models=[]
output_embeddings=[]
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

## features is this list features =['CAT_Cliente_le','CAT_Famiglia_le','CAT_Materiale_le','CAT_Settimana','CAT_Sotto_Famiglia_le','NUM_Data_diff_comprato','NUM_Data_diff_comprato_avg','NUM_Data_diff_comprato_avg_sf','NUM_Qty','NUM_Rank']

for categorical_var in np.arange(len(features)-5):    

 #Name of the categorical variable that will be used in the Keras Embedding layer
 cat_emb_name= features[categorical_var]+'_Embedding'

 # Define the embedding_size, max size is 10
 no_of_unique_cat = dataset_train.loc[:,features[categorical_var]].nunique()
 embedding_size = int(min(np.ceil((no_of_unique_cat+1)/2), 10 ))

 #One Embedding Layer for each categorical variable
 input_model = Input(shape=(MAX_TIMESTEP,)) 
 output_model = Embedding(no_of_unique_cat+1, embedding_size, name=cat_emb_name,input_length=MAX_TIMESTEP,mask_zero=True)(input_model)

 #Appending all the categorical inputs
 input_models.append(input_model)

 #Appending all the embeddings
 output_embeddings.append(output_model)

#Other non-categorical data columns (numerical). I have 5 of them
input_numeric = Input(shape=(MAX_TIMESTEP,len(['1','2','3','4','5']),))
mask_numeric = Masking(mask_value=0., input_shape=(MAX_TIMESTEP,5))(input_numeric)
input_models.append(input_numeric)
output_embeddings.append(mask_numeric)

output = Concatenate(axis=2)(output_embeddings)

output = LSTM(
           units= 25,
           input_shape=(MAX_TIMESTEP, 4),
           use_bias=True,
           kernel_initializer=he_normal(seed=14),
           recurrent_initializer=he_normal(seed=14),
           unit_forget_bias = True,
           return_sequences=True)(output)

output = TimeDistributed(Dense(1))(output)

model = Model(inputs=input_models, outputs=output)
model.compile(loss='mae', optimizer=SGD(lr=0.2, decay=0.001, momentum=0.9, nesterov=False),
          #clipvalue=0.75), epsilon=None, decay=0.00000, amsgrad=False),
          metrics=['mape'])`

What I observed it that: -the model show good performance with numeric features only -adding categorical does nothing to improve performances (I would at least expect the model to overfit by producing very specific rules, like client X ordered material Y in week Z after 5 days), but this never happens

My question is, is there something conceptually wrong in using entity embedding in LSTM like this? Should I change something?

Thanks a lot in advance

来源:https://stackoverflow.com/questions/57052889/entity-embedding-of-categorical-within-time-series-data-and-lstm

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!