I want to train a DNN model by training data with more than one billion feature dimensions. So the shape of the first layer weight matrix will be (1,000,000,000, 512). this weig
The answer to this question depends greatly on what operations you want to perform on the weight matrix.
The typical way to handle such a large number of features is to treat the 512
vector per feature as an embedding. If each of your example in the data set has only one of the 1 billion features, then you can use the tf.nn.embedding_lookup function to lookup the embeddings for the features present in a mini-batch of examples. If each example has more than one feature, but presumably only a handful of them, then you can use the tf.nn.embedding_lookup_sparse to lookup the embeddings.
In both these cases, your weight matrix can be distributed across many machines. That is, the params
argument to both of these functions is a list of tensors. You would shard your large weight matrix and locate the shards in different machines. Please look at tf.device and the primer on distributed execution to understand how data and computation can be distributed across many machines.
If you really want to do some dense operation on the weight matrix, say, multiply the matrix with another matrix, that is still conceivable, although there are no ready-made recipes in TensorFlow to handle that. You would still shard your weight matrix across machines. But then, you have to manually construct a sequence of matrix multiplies on the distributed blocks of your weight matrix, and combine the results.