What embedding-layer output_dim is really needed for a dictionary of just 10000 words?

问题

I'm training up an RNN with a very reduced set of word features, around 10,000. I was planning on starting with an embedding layer before adding RNNs, but it is very unclear to me what dimensionality is really needed. I know that I can try out different values (32, 64, etc.), but I'd rather have some intuition going into it first. For example, if I use a 32-dimensional embedding vector, then only 3 different values are needed per dimension to fully describe the space (32**3>>10000).

Alternatively, for a space with this small number of words, does one even really need to use an embedding layer or does it make more sense to just go from an input layer right to the RNN?

回答1:

This is a good question that does not have a good answer. You should surely use an embedding layer and not just go straight to an LSTM/GRU. However, the latent dimension of the embedding layer should be "as large as possible while maintain peak validation performance". For a dictionary around your size, 128 or 256 should be a reasonable decision. I doubt you will see drastically different performance.

However, something that will really affect your results on a small data set is not using pre-trained word embeddings. This will cause your embeddings to brutally overfit to your training data. I recommend using GLove word embeddings. After downloading the glove data, you can use them to initialize the weights to your embedding layer and then the emebdding layer will fine-tune the weights to your usecase. Here is some code I use for the GloVe embeddings with Keras. It let's you load different sizes of them and also caches the matrix so that it is fast to run the second time around.

class GloVeSize(Enum):

    tiny = 50
    small = 100
    medium = 200
    large = 300


__DEFAULT_SIZE = GloVeSize.small


def get_pretrained_embedding_matrix(word_to_index,
                                    vocab_size=10000,
                                    glove_dir="./bin/GloVe",
                                    use_cache_if_present=True,
                                    cache_if_computed=True,
                                    cache_dir='./bin/cache',
                                    size=__DEFAULT_SIZE,
                                    verbose=1):

    """
    get pre-trained word embeddings from GloVe: https://github.com/stanfordnlp/GloVe
    :param word_to_index: a word to index map of the corpus
    :param vocab_size: the vocab size
    :param glove_dir: the dir of glove
    :param use_cache_if_present: whether to use a cached weight file if present
    :param cache_if_computed: whether to cache the result if re-computed
    :param cache_dir: the directory of the project's cache
    :param size: an enumerated choice of GloVeSize
    :param verbose: the verbosity level of logging
    :return: a matrix of the embeddings
    """
    def vprint(*args, with_arrow=True):
        if verbose > 0:
            if with_arrow:
                print(">>", *args)
            else:
                print(*args)

    if not os.path.exists(cache_dir):
        os.makedirs(cache_dir)

    cache_path = os.path.join(cache_dir, 'glove_%d_embedding_matrix.npy' % size.value)
    if use_cache_if_present and os.path.isfile(cache_path):
        return np.load(cache_path)
    else:
        vprint('computing embeddings', with_arrow=True)
        embeddings_index = {}
        size_value = size.value
        f = open(os.path.join(glove_dir, 'glove.6B.' + str(size_value) + 'd.txt'),
                 encoding="ascii", errors='ignore')

        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs

        f.close()
        vprint('Found', len(embeddings_index), 'word vectors.')

        embedding_matrix = np.random.normal(size=(vocab_size, size.value))

        non = 0
        for word, index in word_to_index.items():
            embedding_vector = embeddings_index.get(word)
            if embedding_vector is not None:
                embedding_matrix[index] = embedding_vector
            else:
                non += 1

        vprint(non, "words did not have mappings")
        vprint(with_arrow=False)

        if cache_if_computed:
            np.save(cache_path, embedding_matrix)

return embedding_matrix

then instantiate your embedding layer with that weight matrix:

 embedding_size = GloVeSize.small
    embedding_matrix = get_pretrained_embedding_matrix(data.word_to_index,
size=embedding_size)

embedding = Embedding(
     output_dim=self.embedding_size,
     input_dim=self.vocabulary_size + 1,
     input_length=self.input_length,
     mask_zero=True,
     weights=[np.vstack((np.zeros((1, self.embedding_size)),
                         self.embedding_matrix))],
     name='embedding'
)(input_layer)

来源：https://stackoverflow.com/questions/51328516/what-embedding-layer-output-dim-is-really-needed-for-a-dictionary-of-just-10000

标签

tensorflow

keras

deep-learning

word-embedding