Streaming large training and test files into Tensorflow's DNNClassifier

后端未结

关注

 2  553

I have a huge training CSV file (709M) and a large testing CSV file (125M) that I want to send into a DNNClassifier in the context of using the high-level Tenso

相关标签:

2条回答

悲&欢浪女

2020-11-29 02:39

I agree with DomJack about using the Dataset API, except the need to read the whole csv file and then convert to TfRecord. I am hereby proposing to emply TextLineDataset - a sub-class of the Dataset API to directly load data into a TensorFlow program. An intuitive tutorial can be found here.

The code below is used for the MNIST classification problem for illustration and hopefully, answer the question of the OP. The csv file has 784 columns, and the number of classes is 10. The classifier I used in this example is a 1-hidden-layer neural network with 16 relu units.

Firstly, load libraries and define some constants:

# load libraries
import tensorflow as tf
import os

# some constants
n_x = 784
n_h = 16
n_y = 10

# path to the folder containing the train and test csv files
# You only need to change PATH, rest is platform independent
PATH = os.getcwd() + '/' 

# create a list of feature names
feature_names = ['pixel' + str(i) for i in range(n_x)]

Secondly, we create an input function reading a file using the Dataset API, then provide the results to the Estimator API. The return value must be a two-element tuple organized as follows: the first element must be a dict in which each input feature is a key, and then a list of values for the training batch, and the second element is a list of labels for the training batch.

def my_input_fn(file_path, batch_size=32, buffer_size=256,\
                perform_shuffle=False, repeat_count=1):
    '''
    Args:
        - file_path: the path of the input file
        - perform_shuffle: whether the data is shuffled or not
        - repeat_count: The number of times to iterate over the records in the dataset.
                    For example, if we specify 1, then each record is read once.
                    If we specify None, iteration will continue forever.
    Output is two-element tuple organized as follows:
        - The first element must be a dict in which each input feature is a key,
        and then a list of values for the training batch.
        - The second element is a list of labels for the training batch.
    '''
    def decode_csv(line):
        record_defaults = [[0.]]*n_x # n_x features
        record_defaults.insert(0, [0]) # the first element is the label (int)
        parsed_line = tf.decode_csv(records=line,\
                                    record_defaults=record_defaults)
        label = parsed_line[0]  # First element is the label
        del parsed_line[0]  # Delete first element
        features = parsed_line  # Everything but first elements are the features
        d = dict(zip(feature_names, features)), label
        return d

    dataset = (tf.data.TextLineDataset(file_path)  # Read text file
               .skip(1)  # Skip header row
               .map(decode_csv))  # Transform each elem by applying decode_csv fn
    if perform_shuffle:
        # Randomizes input using a window of 256 elements (read into memory)
        dataset = dataset.shuffle(buffer_size=buffer_size)
    dataset = dataset.repeat(repeat_count)  # Repeats dataset this # times
    dataset = dataset.batch(batch_size)  # Batch size to use
    iterator = dataset.make_one_shot_iterator()
    batch_features, batch_labels = iterator.get_next()

    return batch_features, batch_labels

Then, the mini-batch can be computed as

next_batch = my_input_fn(file_path=PATH+'train1.csv',\
                         batch_size=batch_size,\
                         perform_shuffle=True) # return 512 random elements

Next, we define the feature columns are numeric

feature_columns = [tf.feature_column.numeric_column(k) for k in feature_names]

Thirdly, we create an estimator DNNClassifier:

classifier = tf.estimator.DNNClassifier(
    feature_columns=feature_columns,  # The input features to our model
    hidden_units=[n_h],  # One layer
    n_classes=n_y,
    model_dir=None)

Finally, the DNN is trained using the test csv file, while the evaluation is performed on the test file. Please change the repeat_count and steps to ensure that the training meets the required number of epochs in your code.

# train the DNN
classifier.train(
    input_fn=lambda: my_input_fn(file_path=PATH+'train1.csv',\
                                 perform_shuffle=True,\
                                 repeat_count=1),\
                                 steps=None)    

# evaluate using the test csv file
evaluate_result = classifier.evaluate(
    input_fn=lambda: my_input_fn(file_path=PATH+'test1.csv',\
                                 perform_shuffle=False))
print("Evaluation results")
for key in evaluate_result:
    print("   {}, was: {}".format(key, evaluate_result[key]))

0 讨论(0)

时光取名叫无心

2020-11-29 02:40

Check out the tf.data.Dataset API. There are a number of ways to create a dataset. I'll outline four - but you'll only have to implement one.

I assume each row of your csv files is n_features float values followed by a single int value.

Creating a `tf.data.Dataset`

Wrap a python generator with `Dataset.from_generator`

The easiest way to get started is to wrap a native python generator. This can have performance issues, but may be fine for your purposes.

def read_csv(filename):
    with open(filename, 'r') as f:
        for line in f.readlines():
            record = line.rstrip().split(',')
            features = [float(n) for n in record[:-1]]
            label = int(record[-1])
            yield features, label

def get_dataset():
    filename = 'my_train_dataset.csv'
    generator = lambda: read_csv(filename)
    return tf.data.Dataset.from_generator(
        generator, (tf.float32, tf.int32), ((n_features,), ()))

This approach is highly versatile and allows you to test your generator function (read_csv) independently of TensorFlow.

Use Tensorflow Datasets API

Supporting tensorflow versions 1.12+, tensorflow datasets is my new favourite way of creating datasets. It automatically serializes your data, collects statistics and makes other meta-data available to you via info and builder objects. It can also handle automatic downloading and extracting making collaboration simple.

import tensorflow_datasets as tfds

class MyCsvDatasetBuilder(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version("0.0.1")

  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        description=(
            "My dataset"),
        features=tfds.features.FeaturesDict({
            "features": tfds.features.Tensor(
              shape=(FEATURE_SIZE,), dtype=tf.float32),
            "label": tfds.features.ClassLabel(
                names=CLASS_NAMES),
            "index": tfds.features.Tensor(shape=(), dtype=tf.float32)
        }),
        supervised_keys=("features", "label"),
    )

  def _split_generators(self, dl_manager):
    paths = dict(
      train='/path/to/train.csv',
      test='/path/to/test.csv',
    )
    # better yet, if the csv files were originally downloaded, use
    # urls = dict(train=train_url, test=test_url)
    # paths = dl_manager.download(urls)
    return [
        tfds.core.SplitGenerator(
            name=tfds.Split.TRAIN,
            num_shards=10,
            gen_kwargs=dict(path=paths['train'])),
        tfds.core.SplitGenerator(
            name=tfds.Split.TEST,
            num_shards=2,
            gen_kwargs=dict(cvs_path=paths['test']))
    ]

  def _generate_examples(self, csv_path):
    with open(csv_path, 'r') as f:
        for i, line in enumerate(f.readlines()):
            record = line.rstrip().split(',')
            features = [float(n) for n in record[:-1]]
            label = int(record[-1])
            yield dict(features=features, label=label, index=i)

Usage:

builder = MyCsvDatasetBuilder()
builder.download_and_prepare()  # will only take time to run first time
# as_supervised makes output (features, label) - good for model.fit
datasets = builder.as_dataset(as_supervised=True)

train_ds = datasets['train']
test_ds = datasets['test']

Wrap an index-based python function

One of the downsides of the above is shuffling the resulting dataset with a shuffle buffer of size n requires n examples to be loaded. This will either create periodic pauses in your pipeline (large n) or result in potentially poor shuffling (small n).

def get_record(i):
    # load the ith record using standard python, return numpy arrays
    return features, labels

def get_inputs(batch_size, is_training):

    def tf_map_fn(index):
        features, labels = tf.py_func(
            get_record, (index,), (tf.float32, tf.int32), stateful=False)
        features.set_shape((n_features,))
        labels.set_shape(())
        # do data augmentation here
        return features, labels

    epoch_size = get_epoch_size()
    dataset = tf.data.Dataset.from_tensor_slices((tf.range(epoch_size,))
    if is_training:
        dataset = dataset.repeat().shuffle(epoch_size)
    dataset = dataset.map(tf_map_fn, (tf.float32, tf.int32), num_parallel_calls=8)
    dataset = dataset.batch(batch_size)
    # prefetch data to CPU while GPU processes previous batch
    dataset = dataset.prefetch(1)
    # Also possible
    # dataset = dataset.apply(
    #     tf.contrib.data.prefetch_to_device('/gpu:0'))
    features, labels = dataset.make_one_shot_iterator().get_next()
    return features, labels

In short, we create a dataset just of the record indices (or any small record ID which we can load entirely into memory). We then do shuffling/repeating operations on this minimal dataset, then map the index to the actual data via tf.data.Dataset.map and tf.py_func. See the Using with Estimators and Testing in isolation sections below for usage. Note this requires your data to be accessible by row, so you may need to convert from csv to some other format.

TextLineDataset

You can also read the csv file directly using a tf.data.TextLineDataset.

def get_record_defaults():
  zf = tf.zeros(shape=(1,), dtype=tf.float32)
  zi = tf.ones(shape=(1,), dtype=tf.int32)
  return [zf]*n_features + [zi]

def parse_row(tf_string):
    data = tf.decode_csv(
        tf.expand_dims(tf_string, axis=0), get_record_defaults())
    features = data[:-1]
    features = tf.stack(features, axis=-1)
    label = data[-1]
    features = tf.squeeze(features, axis=0)
    label = tf.squeeze(label, axis=0)
    return features, label

def get_dataset():
    dataset = tf.data.TextLineDataset(['data.csv'])
    return dataset.map(parse_row, num_parallel_calls=8)

The parse_row function is a little convoluted since tf.decode_csv expects a batch. You can make it slightly simpler if you batch the dataset before parsing.

def parse_batch(tf_string):
    data = tf.decode_csv(tf_string, get_record_defaults())
    features = data[:-1]
    labels = data[-1]
    features = tf.stack(features, axis=-1)
    return features, labels

def get_batched_dataset(batch_size):
    dataset = tf.data.TextLineDataset(['data.csv'])
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(parse_batch)
    return dataset

TFRecordDataset

Alternatively you can convert the csv files to TFRecord files and use a TFRecordDataset. There's a thorough tutorial here.

Step 1: Convert the csv data to TFRecords data. Example code below (see read_csv from from_generator example above).

with tf.python_io.TFRecordWriter("my_train_dataset.tfrecords") as writer:
    for features, labels in read_csv('my_train_dataset.csv'):
        example = tf.train.Example()
        example.features.feature[
            "features"].float_list.value.extend(features)
        example.features.feature[
            "label"].int64_list.value.append(label)
        writer.write(example.SerializeToString())

This only needs to be run once.

Step 2: Write a dataset that decodes these record files.

def parse_function(example_proto):
    features = {
        'features': tf.FixedLenFeature((n_features,), tf.float32),
        'label': tf.FixedLenFeature((), tf.int64)
    }
    parsed_features = tf.parse_single_example(example_proto, features)
    return parsed_features['features'], parsed_features['label']

def get_dataset():
    dataset = tf.data.TFRecordDataset(['data.tfrecords'])
    dataset = dataset.map(parse_function)
    return dataset

Using the dataset with estimators

def get_inputs(batch_size, shuffle_size):
    dataset = get_dataset()  # one of the above implementations
    dataset = dataset.shuffle(shuffle_size)
    dataset = dataset.repeat()  # repeat indefinitely
    dataset = dataset.batch(batch_size)
            # prefetch data to CPU while GPU processes previous batch
    dataset = dataset.prefetch(1)
    # Also possible
    # dataset = dataset.apply(
    #     tf.contrib.data.prefetch_to_device('/gpu:0'))
    features, label = dataset.make_one_shot_iterator().get_next()

estimator.train(lambda: get_inputs(32, 1000), max_steps=1e7)

Testing the dataset in isolation

I'd strongly encourage you to test your dataset independently of your estimator. Using the above get_inputs, it should be as simple as

batch_size = 4
shuffle_size = 100
features, labels = get_inputs(batch_size, shuffle_size)
with tf.Session() as sess:
    f_data, l_data = sess.run([features, labels])
print(f_data, l_data)  # or some better visualization function

Performance

Assuming your using a GPU to run your network, unless each row of your csv file is enormous and your network is tiny you probably won't notice a difference in performance. This is because the Estimator implementation forces data loading/preprocessing to be performed on the CPU, and prefetch means the next batch can be prepared on the CPU as the current batch is training on the GPU. The only exception to this is if you have a massive shuffle size on a dataset with a large amount of data per record, which will take some time to load in a number of examples initially before running anything through the GPU.