Feeding integer CSV data to a Keras Dense first layer in sequential model

问题

The documentation for CSV Datasets stops short of showing how to use a CSV dataset for anything practical like using the data to train a neural network. Can anyone provide a straightforward example to demonstrate how to do this, with clarity around data shape and type issues at a minimum, and preferably considering batching, shuffling, repeating over epochs as well?

For example, I have a CSV file of M rows, each row being an integer class label followed by N integers from which I hope to predict the class label using an old-style 3-layer neural network with H hidden neurons:

model = Sequential()
model.add(Dense(H, activation='relu', input_dim=N))
...
model.fit(train_ds, ...)

For my data, M > 50000 and N > 200. I have tried creating my dataset by using:

train_ds = tf.data.experimental.make_csv_dataset('mydata.csv`, batch_size=B)

However... this leads to compatibility problems between the dataset and the model... but it's not clear where these compatibility problems lie - are they in the input shape, the integer (not float) data, or somewhere else?

回答1:

This question may provide some help... although the answers mostly relate to Tensorflow V1.x

It may be that CSV Datasets are not required for this task. The data size you indicate will probably fit in memory, and a tf.data.Dataset may wrap your data in more complexity than valuable functionality. You can do it without datasets (as shown below) so long as ALL your data is integers.

If you persist with the CSV Dataset approach, understand that there are many ways CSVs are used, and different approaches to loading them (e.g. see here and here). Because CSVs can have a variety of column types (numerical, boolean, text, categorical, ...), the first step is usually to load the CSV data in a column-oriented format. This provides access to the columns via their labels - useful for pre-processing. However, you probably want to provide rows of data to your model, so translating from columns to rows may be one source of confusion. At some point you will probably need to convert your integer data to float, but this may occur as a side-effect of certain pre-processing.

So long as your CSVs contain integers only, without missing data, and with a header row, you can do it without a tf.data.Dataset, step-by-step as follows:

import numpy as np
from numpy import genfromtxt
import tensorflow as tf

train_data = genfromtxt('train set.csv', delimiter=',')
test_data = genfromtxt('test set.csv', delimiter=',')
train_data = np.delete(train_data, (0), axis=0)    # delete header row
test_data = np.delete(test_data, (0), axis=0)      # delete header row
train_labels = train_data[:,[0]]
test_labels = test_data[:,[0]]
train_labels = tf.keras.utils.to_categorical(train_labels)
# count labels used in training set; categorise test set on same basis
# even if test set only uses subset of categories learning in training
K = len(train_labels[ 0 ])
test_labels = tf.keras.utils.to_categorical(test_labels, K)
train_data = np.delete(train_data, (0), axis=1)    # delete label column
test_data = np.delete(test_data, (0), axis=1)      # delete label column
# Data will have been read in as float... but you may want scaling/normalization...
scale = lambda x: x/1000.0 - 500.0                 # change to suit
scale(train_data)
scale(test_data)

N_train = len(train_data[0])        # columns in training set
N_test = len(test_data[0])          # columns in test set
if N_train != N_test:
  print("Datasets have incompatible column counts: %d vs %d" % (N_train, N_test))
  exit()
M_train = len(train_data)           # rows in training set
M_test = len(test_data)             # rows in test set

print("Training data size: %d rows x %d columns" % (M_train, N_train))
print("Test set data size: %d rows x %d columns" % (M_test, N_test))
print("Training to predict %d classes" % (K))

model = Sequential()
model.add(Dense(H, activation='relu', input_dim=N_train))     # H not yet defined...
...
model.compile(...)
model.fit( train_data, train_labels, ... )    # see docs for shuffle, batch, etc
model.evaluate( test_data, test_labels )

来源：https://stackoverflow.com/questions/61607656/feeding-integer-csv-data-to-a-keras-dense-first-layer-in-sequential-model

标签

tensorflow

keras

tensorflow-datasets